206 KiB
Changelog
All notable changes to GDPR Scanner are documented here.
Format follows Keep a Changelog. Version numbers follow Semantic Versioning.
[Unreleased]
Added
-
PDF redaction for local files — the ✂ redact button now works on local PDF files in addition to DOCX, XLSX, CSV, and TXT. Text-based PDFs are redacted using PyMuPDF's physical redaction (
page.apply_redactions()), which removes the underlying text data from the PDF stream — not just paints over it. Scanned/image-based PDFs go through the OCR bbox path: CPR positions are found via Tesseract then physically painted and sanitised. Falls back to a reportlab overlay if PyMuPDF is not installed; raises a clear error if both libraries are absent. -
Google Drive file redaction — the ✂ redact button now works on native DOCX, XLSX, and PDF files stored in Google Drive (both Google Workspace service-account and personal OAuth connectors). The file is downloaded via the Drive API, redacted locally using the same PyMuPDF / python-docx / openpyxl pipeline as local files, then uploaded back as a new revision via
files().update(). Google Docs/Sheets exported as DOCX are detected by MIME type and refused with a clear message (re-upload after exporting manually). Requires thedrivescope (notdrive.readonly) on the service-account domain-wide delegation grant; a 403 surfaces the exact Google error so admins can add the scope. Methods added:get_drive_file_mime,download_drive_file_by_id,update_drive_fileon bothGoogleWorkspaceConnectorandPersonalGoogleConnector. -
SFTP file redaction — the ✂ button now works on SFTP files (DOCX, XLSX, CSV, TXT, PDF). The file is downloaded via paramiko, redacted locally, then written back with
sftp.open(path, "wb"). Source config is matched from_load_file_sources()by host + username; credentials are resolved from the keychain via_resolve_sftp_credentials. Requires the item to be in the current session'sstate.flagged_items(SFTP host info is not stored in the DB). New method:SFTPScanner.write_file(remote_path, content). -
SMB file redaction — the ✂ button now works on SMB/CIFS network share files (DOCX, XLSX, CSV, TXT, PDF). Source config is looked up by matching the host parsed from
full_path(//host/share/…). File is downloaded and re-uploaded using smbprotocol withCreateDisposition.FILE_SUPERSEDEso the file is atomically replaced. New function:file_scanner.write_smb_file(path, content, username, password, domain). -
AI-enhanced NER via Claude — Named Entity Recognition (names, addresses, organisations) can now be powered by Claude Haiku instead of spaCy. Enable in Settings → AI / NER: paste an Anthropic API key, toggle on, click Test to confirm. When enabled,
document_scanner.pycalls the Claude API (claude-haiku-4-5-20251001) instead of spaCy for all three scan engines; results are cached in-memory per document (bounded at 2 000 entries) so repeated scans of the same file never re-charge the API. Falls back to spaCy automatically if the key is missing or theanthropicpackage is not installed. API key stored inconfig.jsonunderclaude_api_key; toggle stored underclaude_ner. Routes:GET/POST /api/settings/claude,POST /api/settings/claude/test.
Fixed
- Settings modal too narrow for seven tabs — widened from 640 px to 720 px so all tab labels fit on one line without wrapping.
[1.6.28] — 2026-05-28
Added
-
Date-range scoping for viewer tokens — tokens can now carry optional
valid_fromandvalid_toscope fields (YYYY-MM-DD). When set,GET /api/db/flaggedfilters items whosemodifieddate falls outside the range. The share modal now shows two date inputs ("Items from" / "Items until") that apply to any scope type (all/role/user). The token list shows a green date-range badge when a range is stored. The server validates format and enforcesvalid_from ≤ valid_to. All three scope dimensions (role, user, date-range) are independent and combinable. -
CPR-only mode — a new
cpr_onlyscan option (sidebar toggle#optCprOnly, profile editor#peOptCprOnly) makes all three scan engines skip items that have no qualifying CPR numbers. Files whose only hits are email addresses, phone numbers, detected faces, or EXIF/GPS metadata are not flagged. The flag already detected is still shown on cards whencpr_only=false(default). Gated in all three engines: file scan skip condition, M365 email flagging, M365 file flagging, and Google Gmail/Drive flagging. -
OCR language override — a new
ocr_langscan option (sidebar select#optOcrLang, profile editor#peOptOcrLang) lets operators choose the Tesseract language pack(s) used when scanning scanned PDFs and images. Presets:dan+eng(default),dan,eng,dan+eng+deu,dan+eng+swe,dan+eng+fra. The setting flows from the UI through the profile, into all three scan engines (M365_scan_bytes_timeout, M365 attachments_scan_bytes, M365 files_scan_bytes, Google_scan_bytesfor both Gmail and Drive). Thelangparameter is threaded throughcpr_detector._scan_bytes→document_scanner.scan_pdf/scan_imageand the spawned PDF-OCR subprocess worker. The OCR cache key already includedlang, so per-language results are cached independently. -
Built-in file redaction for local files — a scissor button (
✂) appears on cards for local DOCX, XLSX, CSV, and TXT files. Clicking it rewrites the file in-place with all detected CPR numbers replaced by██████-████(DOCX/XLSX) or█-blocks (CSV/TXT), then removes the card from the grid and logs a"redacted"disposition. The redaction is atomic: a temp file in the same directory is written first and then moved over the original, so a crash never leaves a half-written file. Implemented inroutes/export.py(POST /api/redact_item) using the existingdocument_scannerredact functions; front-end inresults.js(redactItem) with the button hidden for non-local or unsupported-extension items and for resolved/viewer-mode cards. -
DELETE /api/delete_itemroute registration fix — thedelete_itemhandler inroutes/export.pywas missing its@bp.routedecorator, so the endpoint was never registered in Flask's URL map. The route now works correctly. -
Scheduled report-only email job — scheduled jobs can now be configured as "report only" (toggle
#schedReportOnly). When enabled, the job skips the scan entirely and instead emails the latest scan results already in the database. If the in-memory result list is empty (e.g. after a server restart), results are loaded from the DB viaget_session_items(). M365 authentication is not required for report-only jobs — email is sent Graph-first if authenticated, SMTP otherwise. Jobs fail with a clear error if no scan results are available. The job list card shows a blue "Report only" badge. Settingreport_only=Truein the editor automatically enables "Email report automatically" and dims the Profile field (unused for report-only runs). -
Compliance audit log — every significant admin action is now written to an immutable
audit_logtable in the scanner database. Recorded events: profile save/delete, viewer token create/revoke, viewer/interface/admin PIN set/change/clear, file source add/update/delete, scheduler job save/delete, scan start/stop, SMTP config save, single and bulk disposition changes, item delete, and item redact. Each record stores a Unix timestamp, an action key, a human-readable detail string, and the client IP address. Accessible viaGET /api/audit_log(returns newest-first, max 1000 entries; filterable by?action=). Visible in the Settings modal under a new Audit Log tab; the table refreshes whenever the tab is opened. Thelog_audit_event()module-level helper ingdpr_db.pysilently no-ops if the DB is unavailable, so all call sites are safe in test and offline contexts.
Fixed
-
Stop button had no effect on Google Workspace scans —
POST /api/scan/stoponly setstate._scan_abort(the M365/file abort event) and never touchedstate._google_scan_abort. Separately,_check_abort()inside_run_google_scanwas checkinggdpr_scanner._scan_abort(the M365 event) instead of the module-level_scan_abortalias that points tostate._google_scan_abort. Both bugs combined meant neither the Stop button norPOST /api/google/scan/cancelhad any effect on a running Google scan. Fixed by havingscan_stop()set both events and having_check_abort()use the correct module-level alias. -
Settings tab labels wrapping to two lines — adding the Audit Log tab pushed the six-tab row past the 540 px modal width, causing "E-mailrapport" (and similar long translations) to break onto a second line. The modal is now 640 px wide and tabs carry
white-space:nowrap;.settings-tabsretainsflex-wrap:wrapas a safety net on very small screens.
[1.6.27] — 2026-05-27
Added
-
Email body excerpt preserved for offline preview — when an M365 email or Gmail message is flagged, the first 500 characters of its plain-text body are stored in the card (
body_excerpt), the checkpoint JSON, and a newbody_excerptDB column (migration #10). The M365 email preview now falls back to this excerpt when Graph is unavailable (not authenticated, token expired) or when resuming from a checkpoint without a live connection. The Gmail preview now shows the stored excerpt as the primary content (with the "Open in Gmail" link appended below) rather than the previous plain link-card. A helper_excerpt_page()inroutes/database.pyrenders the excerpt with the same header layout as the full Graph-fetched preview. -
Re-scan diff — resolved items in history view — when browsing a past scan session, items that were flagged in the immediately preceding session but are no longer present in the current one are automatically appended below a "N items no longer present" divider. Resolved items are greyed out and carry a green
✓ Resolvedbadge; the delete button is hidden since the file is already gone. The history banner updates to show the resolved count alongside the flagged count. The diff is computed client-side by fetching the previous session's items and comparing IDs — no new API endpoint needed. Implemented inhistory.js(loadHistorySession) andresults.js(appendCard). -
Google Workspace scan test suite — 19 new tests in
tests/test_google_scan.pycovering all three routes (GET /api/google/scan/users,POST /api/google/scan/start,POST /api/google/scan/cancel) and the core scan engine (_run_google_scan). Route tests verify: 401 when unauthenticated, 409 when scan already running, lock released on both normal completion and exception, abort event cleared on start. Engine tests verify: CPR hits are broadcast asscan_file_flagged, clean items are not,source_typeis correctly set to"gmail"for Gmail items and"gdrive"for Drive items, andgoogle_scan_donealways fires with correctflagged_count/total_scannedvalues.
[1.6.26] — 2026-04-29
Fixed
-
Previous scan results visible when a new scan starts — two async functions (
loadHistorySessionandloadLastScanSummary) could resolve afterstartScanhad already cleared the grid.loadHistorySessionwould re-populate the grid with old history items;loadLastScanSummarywould re-show the last-scan summary card. Both functions now bail early after eachawaitif any of the three scan-running flags (S._m365ScanRunning,S._googleScanRunning,S._fileScanRunning) is set — those flags are written synchronously bystartScanbefore any awaits, so the check is race-free. -
Selected card scrolls out of view when preview panel opens — clicking a card in grid view opens the 420 px preview panel, which shrinks the grid area and reflows the card columns. The selected card was no longer visible.
openPreview()now schedules arequestAnimationFrameafter removing.hiddenfrom the panel so the card is scrolled back into view (scrollIntoView block: nearest) once the layout has settled. -
Gmail and Google Drive preview crashed with a 404 Graph API error —
_source_typewas never set on Google items inroutes/google_scan.py, so Gmail and Google Drive cards carried an emptysource_type. The preview route inroutes/database.pyonly checked for"local","smb", and"email"before falling through to the M365 else-branch, which tried to callhttps://graph.microsoft.com/.../drive/items/gmail:{id}/preview— always a 404. Fixed by tagging Gmail items as_source_type = "gmail"and Google Drive items as"gdrive"at scan time. The preview route now handles both: Google Drive files get an embeddablehttps://drive.google.com/file/d/{id}/previewiframe; Gmail messages (not embeddable) show an info card with an "Open in Gmail" link. Thestate.connector(M365 auth) guard was also moved inside theemailand M365elsebranches so Google-only setups no longer receive a 401 when opening a Gmail or Drive preview.
[1.6.25] — 2026-04-25
Added
-
Checkpoint / resume for Google and File scans — stopping a Google Workspace or file (local/SMB/SFTP) scan mid-way and restarting now resumes from where it left off, exactly like M365 scans have always done. Each engine writes its own checkpoint file (
checkpoint_google.json,checkpoint_file_{source_id}.json) every 25 items. On restart, previously found cards are re-emitted via SSE so the grid is repopulated before new items arrive. The Scan button now always checks for a live checkpoint before starting — if one exists the resume banner is shown regardless of whether the user reloaded the page.POST /api/scan/checkpointreturns a per-engine breakdown;POST /api/scan/clear_checkpointwipes allcheckpoint_*.jsonfiles. Google users' email addresses are included in the checkpoint payload from the frontend so the server can compute a matching key.checkpoint.pyfunctions gained aprefixkeyword argument (default"m365") — existing M365 call sites are unchanged. -
CPR cross-referencing (related documents) — clicking any flagged card that contains CPR hits now shows a "Related documents" section in the preview panel listing other items from the same scan session that share at least one CPR number. Items are ordered by number of shared CPRs; clicking any entry opens it in the preview panel. Works in both live mode and history mode (respects
?ref=N). Powered by a self-join on the existingcpr_indextable — no new data collection needed. NewGDPRDb.get_related_items(item_id, ref_scan_id)method andGET /api/db/related/<item_id>?ref=Nendpoint inroutes/database.py. Frontend:#previewRelateddiv in the preview panel,_loadRelated(f)inresults.js,window._openRelated(id, itemData)helper (looks up liveS.flaggedDatafirst, falls back to API response for history items). -
Email address and Danish phone number detection — all three scan engines (M365, Google Workspace, local/SMB/SFTP) can now flag files and messages containing email addresses or Danish phone numbers in addition to CPR numbers. Detection is opt-in per profile: two new toggle options Scan for email addresses and Scan for phone numbers (default off) appear in the scan options panel and profile editor. When enabled, matches are stored as
email_count/phone_counton each DB row and surfaced as colour-coded badges in list view, grid view, and the preview panel. Email regex requires a structurally valid address (local@domain.tld); phone regex covers 8-digit Danish numbers with optional+45/0045prefix and common spacing patterns. Both are deduplicated before counting. Requires DB migration (adds two INTEGER columns toflagged_items; applied automatically on first startup via_MIGRATIONS). -
SFTP as a 4th file connector — SFTP servers can now be added as file sources alongside local folders, SMB shares, and cloud sources. A new
SFTPScannerclass insftp_connector.pyimplements the sameiter_files()interface asFileScanner, sorun_file_scan(), SSE broadcasting, DB persistence, card building, scheduled scans, and exports work without changes. Supports password auth and SSH private key auth (RSA, Ed25519, ECDSA, DSS); passphrases stored in the OS keychain. Key files uploaded viaPOST /api/file_sources/upload_keyand stored in~/.gdprscanner/sftp_keys/withchmod 600. SFTP sources appear with a 🔒 icon in the sources panel. Requiresparamiko>=3.4(optional — scanner falls back gracefully if not installed). New source-type selector (Local / Network (SMB) / SFTP) replaces the SMB path-prefix auto-detection in the add-source form. -
POST /api/file_sources/upload_key— new endpoint that validates and stores an SSH private key file, returning akey_pathfor use in the source definition. -
SFTP entry in export SOURCE_MAP — Excel and Article 30 exports render SFTP sources as "🔒 SFTP" with a purple tint (
EDE9F7), consistent with the existing per-source tab and summary table logic.
Fixed
-
File source form placeholders untranslated — all nine placeholder texts in the Add source and Edit source forms (source name, path, SMB host/user, SFTP host/user/path, passphrase) were hardcoded English strings. Nine new
data-i18n-placeholderkeys added toen.json,da.json, andde.json; all 12 affected<input>elements now carrydata-i18n-placeholderattributes. -
"Name" and "Auth" labels untranslated in SFTP form — the source-name label and the Auth toggle label in the add-source panel had no
data-i18nattributes. Added keysm365_fsrc_name(DA: "Navn") andm365_fsrc_sftp_auth(same across languages). The name label used an inner<span data-i18n>to preserve the required-field*indicator, which would have been clobbered by adata-i18non the outer<label>element. The same clobber bug was fixed for them365_fsrc_labelusage in the edit form. -
Password field placeholder showed "Stored in OS keychain" in English — added translation key
m365_fsrc_pw_keychain_placeholder(DA: "Gemt i OS-nøglering") and applieddata-i18n-placeholderto the three password inputs across both forms (SMB add, SFTP add, SMB edit).
[1.6.24] — 2026-04-25
Fixed
- Scheduler UI showed untranslated English strings — frequency labels ("Daily", "Weekly", "Monthly"), "Next:", "Running...", "Disabled", and both empty-state messages ("No scheduled scans yet." / "No scheduled runs yet") were hardcoded English strings in
scheduler.jsinstead of usingt(). All six call sites inschedLoad(),schedRenderJobs(), andschedLoadHistory()now callt()with the appropriate key. Three new translation keys added toen.json,da.json, andde.json:m365_sched_no_jobs,m365_sched_running,m365_sched_disabled.
[1.6.23] — 2026-04-21
Added
-
Video file metadata scanning —
.mp4,.mov,.m4v,.avi,.mkv,.wmv,.flv,.webmfiles are now included in all scan sources (M365 OneDrive/SharePoint/Teams, Google Drive, local/SMB). No frame or audio analysis is performed; only container metadata is extracted: GPS coordinates (iPhone/Android QuickTime©xyzatom, ISO 6709 format), author/artist, title, comment/description, and recording date. A smartphone recording with an embedded GPS location is flagged with thegps_locationspecial category, exactly like a geotagged photo. AVI metadata (RIFF INFOINAM/IART/ICMT) is parsed without any external library. Requiresmutagen>=1.47(added torequirements.txt). -
Audio file metadata scanning —
.mp3,.flac,.ogg,.m4a,.aac,.wma,.wav,.opus,.aifffiles are now scanned for PII-bearing tags across all sources. Extracted fields: title, artist, album artist, composer, lyricist, conductor, author, copyright, comment, description. No audio content is transcribed. Usesmutagen.File(easy=True)which normalises tag formats across ID3 (MP3), MPEG-4 (M4A/AAC), Vorbis (FLAC/OGG), and ASF (WMA) into a unified lowercase-key interface. A voice recording saved with a student's name in the artist tag will be flagged withexif_pii. Fixed a silent bug in_extract_audio_metadatawheremutagen.File(io.BytesIO(content), filename)was passing the BytesIO as thefilenamepositional argument; corrected tomutagen.File(fileobj=..., filename=...). -
Audio and video test fixtures —
tests/fixtures/local_files/generate_fixtures.pynow generates 6 new fixtures:14_audio_artist_pii.mp3,15_audio_artist_pii.flac(artist name → flag),16_audio_no_pii.mp3,17_audio_no_pii.flac(no tags → no flag),18_video_gps.mp4(GPS + artist → flag),19_video_no_pii.mp4(no tags → no flag). Total fixtures: 19 (14 flagged, 5 negative).
Fixed
-
Audio and video files not appearing in local/SMB file scan —
file_scanner.pymaintained its own hardcodedDEFAULT_EXTENSIONSset that was never updated when video and audio extensions were added tocpr_detector.SUPPORTED_EXTS. Fixed by importingSUPPORTED_EXTSfromcpr_detectordirectly;DEFAULT_EXTENSIONSis now an alias for it.cpr_detector.SUPPORTED_EXTSis the single source of truth for all scan sources (M365, Google Drive, local, SMB). -
Profile copy rename not reflected in left column until modal reopen — saving a renamed profile via the full editor (
_pmgmtSaveFullEdit) calledloadProfiles()to refreshS._profilesbut never called_renderProfileMgmt(), so the left-column list was not repainted. The new name only appeared after closing and reopening the modal. Fixed by calling_renderProfileMgmt()immediately afterloadProfiles()and re-applying the.activehighlight to the correct row. 10 new route integration tests added for all profile API endpoints; total test count: 182.
[1.6.22] — 2026-04-21
Added
-
Auto-email after manual scan — a new Email report after manual scan toggle in Settings → Email report sends the Excel report to the configured recipients automatically when a manual scan completes. Disabled by default. Stored as
auto_email_manualinsmtp.json. Uses the same Graph-first → SMTP-fallback path as scheduled scan auto-email. Only fires when there are flagged items and at least one recipient is saved; errors are logged but never surface to the UI (the scan result is unaffected). -
Route integration test suite — 44 new tests in
tests/test_route_integration.pycovering security-sensitive and data-correctness paths: viewer token CRUD, role and user scope enforcement onGET /api/db/flagged, bulk disposition isolation, viewer PIN set/verify/rate-limit/clear, interface PIN gate and multi-step flows, scan lock release onrun_scan()exception, andGET /api/db/sessionsshape and ordering. Total test count: 172.
Fixed
-
Role scope filter silently returned nothing —
GET /api/db/flaggedfiltered rows byrow.get("role")but the column returned from the DB isuser_role. Role-scoped viewer tokens ({"role": "student"}or{"role": "staff"}) therefore excluded every item and returned an empty list. Fixed inroutes/database.py. -
Historical session query included newer scans —
gdpr_db.get_session_items(ref_scan_id=N)used a lower-bounded window (started_at >= ref.started_at - 300) with no upper bound, so any scan that started after the historical reference was also returned. Viewing a past session in the history browser would show items from all subsequent scans as well. Fixed by adding an upper bound (started_at BETWEEN ref.started_at - 300 AND ref.started_at + 300). -
Scan button stuck disabled after file scan —
run_file_scanbroadcast ascan_startSSE event, which thescan_starthandler in_attachSchedulerListenersintercepted and setS._m365ScanRunning = true. Whenfile_scan_donefired it checked!S._m365ScanRunningbefore re-enabling the button — finding it stilltrue, the button stayed disabled permanently. Noscan_done(M365) ever arrives to clear the flag. Fixed by removing thescan_startbroadcast fromrun_file_scan; thescan_phase "Files — …"event immediately following already sets_fileScanRunningcorrectly via the phase-source detection in_attachScanListeners. -
TypeError: unhashable type: 'dict'during file and M365 scans —_distinct_cprs = list(dict.fromkeys(cprs))in both scan paths treatedcprsas a list of strings, butextract_matchesreturns a list of dicts ({"formatted": "…", "page": …, …}). The deduplication crashed on the first file that contained CPR numbers, aborting the scan loop. Fixed in bothrun_file_scan(line 251) andrun_scan(line 1100) by keying onc["formatted"]:list(dict.fromkeys(c["formatted"] for c in cprs)). -
Profile applied early lost user selection and source checkboxes — two startup race conditions: (1) Profiles with
user_ids = "all"applied before the M365 user list had loaded ran.forEach()on an empty array (no-op); whenloadUsers()completed it defaulted all users toselected = falsewith nothing to override, leaving the accounts panel completely unchecked. Fixed by adding a_pendingProfileAllUsersdeferred flag mirroring the existing_pendingProfileUserIdsmechanism —loadUsers()applies it after populatingS._allUsers. (2) If the profile was selected in the narrow window before_loadFileSources()returned and rendered the sources panel,_applyProfile()iterated zero checkboxes and the source selection was silently discarded; a subsequentrenderSourcesPanel()call then re-rendered all sources as checked (their default). Fixed by callingrenderSourcesPanel()in_applyProfile()when no source checkboxes are present in the DOM yet — same guard already used inloadUsers().
[1.6.21] — 2026-04-20
Added
-
Local-file scan test fixtures —
tests/fixtures/local_files/contains 13 ready-made files (.txt,.csv,.docx,.xlsx) covering every detection scenario: CPR with explicit label, mod-11–valid CPR without label, post-2007 CPR with/without context keyword, protected number (day+40), multiple CPRs in one file, mixed PII (CPR + email + Art. 9 health data), and three true-negative cases (clean content, invoice false-positive, post-2007 serial number without context). All CPR numbers are mathematically valid; false-positive fixtures are verified to produce zero hits. Rungenerate_fixtures.pyto regenerate the binary files. -
Interface PIN — optional session-level authentication gate for the main scanner interface. Set a 4–8 digit PIN in Settings → Security → Interface PIN; anyone reaching
http://host:5100is redirected to/loginand must enter the PIN before accessing scan controls, settings, or results. Viewer tokens and the/viewroute are completely unaffected — reviewers continue to use their own auth chain. The PIN is stored as a salted SHA-256 hash inconfig.json. Brute-force protection: 5 failed attempts per IP locks out for 5 minutes. APOST /api/interface/logoutendpoint clears the session. PIN management viaGET/POST/DELETE /api/interface/pin.
Fixed
-
"Vælg" (select mode) button did nothing —
toggleSelectMode,toggleCardSelect,selectAllVisible, andapplyBulkDispositionwere defined inside an ES module but never assigned towindow, so allonclickattributes calling them silently failed. Added the four missingwindow.*exports at the bottom ofresults.js. -
Progress counter frozen at M365 total during Google/file scan — the
scan_progresshandler inscan.jsonly updatedprogressStatsandprogressEtaforsource === "m365". When M365 finished first, the counter stayed at its final value (e.g. "15083 / 15083 ETA 0s") for the entire duration of the Google and file scans. Fixed in two places:scan_donenow clears the stats/ETA elements immediately when another scan is still running;scan_progressfor Google/file sources now shows a running"X scanned"count (using thescannedfield those engines already send) and clears ETA, but only while M365 is not running — M365 stats continue to dominate during concurrent scans. -
PDF OCR kills process on large files —
document_scannerpreviously calledconvert_from_path()once for the entire PDF before the processing loop, allocating all page images in memory simultaneously. A 50-page A4 PDF at 300 DPI required ~1.3 GB in a single allocation, triggering the OS OOM killer. Fixed by rendering one page at a time withconvert_from_path(first_page=N, last_page=N)inside the loop acrossscan_pdf,redact_fitz_pdf, andredact_pdf. Peak OCR memory is now bounded to roughly one page (~26 MB at 300 DPI) regardless of document length. -
No bulk disposition tagging — each result card had to be opened individually to set a disposition. Added a Select mode (filter bar "Vælg" button) that reveals per-card checkboxes. Selecting one or more items shows a bulk tag bar at the bottom of the grid with a disposition dropdown and Apply button. Calls
POST /api/db/disposition/bulk; updates all selected items in-memory and clears the selection. "Select all visible" / "Deselect all" toggle available in the bar. Hidden in viewer mode. -
No disposition progress summary — added a thin stats bar between the filter bar and the grid showing total · unreviewed · retain · delete · % reviewed. Updates after every single or bulk disposition save and after each grid render. Unreviewed count is highlighted in red until everything is tagged; turns green at 100%.
-
Google Drive always did a full scan — Drive scanning in
routes/google_scan.pyusedconn.iter_drive_files()on every run, re-downloading every file regardless of what changed. Added Google Drive delta scan using the Drive Changes API. Whendeltais enabled in scan options, the first run records a Changes API start page token per user (gdrive:{email}key indelta.json). Subsequent runs callconn.get_drive_changes(user_email, token)and only process files that have been added or modified since the last scan. Invalid or expired tokens fall back to a full scan automatically. Token save loads the currentdelta.jsonfresh before writing to avoid racing with concurrent M365 token saves.google_scan_doneSSE event now includesdeltaanddelta_sourcesfields. -
No memory guard before OCR page renders — added
_ocr_mem_ok()check (psutil.virtual_memory().available >= 500 MB) before each page render in all three OCR paths. Pages that would exceed the threshold are skipped and recorded as"skipped"inpage_methodswith a printed warning rather than crashing the scan.
[1.6.20] — 2026-04-18
Fixed
-
Graph
sendMailreported as failure despite email being delivered —_post()inm365_connector.pycalledr.json()unconditionally afterraise_for_status(). The GraphsendMailendpoint returns HTTP 202 with an empty body on success, causingjson.JSONDecodeError: Expecting value: line 1 column 1 (char 0). This was caught by thesmtp_testexception handler and surfaced as an error even though the email had been sent. Fixed by returningr.json() if r.content else {}so any Graph endpoint that responds with no body (sendMail, delete operations, etc.) is handled correctly. -
Graph error hidden when SMTP host not configured — when Graph failed and no SMTP host was saved,
smtp_testreturned the generic "No SMTP host configured" message, swallowing the actual Graph error. Theif not hostbranch now surfaces the Graph exception text alongside the Mail.Send permission guidance so the real cause is visible. -
Gmail vs Google Workspace SMTP error messages — the auth failure handler now detects whether the username is a personal Gmail address (
@gmail.com) or a Google Workspace custom-domain account, and shows a different message for each. Personal Gmail: existing App Password troubleshooting steps. Google Workspace: explains that SMTP access is controlled by the Workspace admin console (2-Step Verification policy, SMTP relay service), not the user's personal security settings.
[1.6.19] — 2026-04-18
Fixed
- Gmail SMTP error message misleading when App Password already in use — the auth failure handler in both
smtp_testandsend_reportunconditionally told the user to "create an App Password", even when they were already using one. Gmail returns the same535/Username and Password not acceptederror for a wrong app password, a revoked app password, spaces left in the 16-character code, or a wrong username — none of which are helped by the old message. The Gmail branch now lists the three most common causes (spaces in the code, revoked password, wrong username) and still links to the App Password page to generate a new one. The Microsoft personal account branch is unchanged.
[1.6.18] — 2026-04-18
Fixed
- Art.30 and Excel exports missing GWS and local/SMB sources — two silent failures caused Google Workspace and file-scan results to be absent from all exports after a page reload.
routes/google_scan.py: called_db.end_scan()(method does not exist onGDPRDb— the correct name isfinish_scan). The resultingAttributeErrorwas swallowed by the bareexcept Exception: passguard, sofinished_atwas never written on GWS scan records. Sinceget_session_items()requiresfinished_at IS NOT NULL, every GWS scan was permanently invisible to both export functions.routes/google_scan.py: emitted"scan_done"at completion instead of"google_scan_done", causing the M365 done handler to fire for Google scans and breaking the SSE teardown logic.scan_engine.py(run_file_scan): called_db.begin_scan(sources=…, user_count=0, options=source)with keyword arguments, butbegin_scan(self, options: dict)only accepts a single positional dict. TheTypeErrorwas caught silently, leaving_db_scan_id = None; all subsequentsave_itemcalls were skipped, so local and SMB items were never written to the database.
[1.6.17] — 2026-04-18
Added
-
Scan history browser — results from any past scan session can now be reviewed without running a new scan. On page load, when no scan is running, the last completed session is automatically loaded into the results grid. A History banner appears above the filter bar showing the session date, scanned sources, and item count. A Sessions button in the banner opens a dropdown listing all past sessions newest-first, each showing date, time, source labels, item count, and Delta / Latest badges. Clicking a session loads its items. A Latest scan button (shown only when browsing a past session) jumps back to the most recent session. Starting a new scan exits history mode and takes over the grid with live SSE results. Session cache is invalidated on each scan completion so the picker always reflects the true state of the database.
gdpr_db.py— newget_sessions(limit, window_seconds)groups all completed scans by the 300-second concurrent-scan window and returns session summaries newest-first.get_session_items()gains an optionalref_scan_idparameter to anchor the session window to any past scan.routes/database.py— newGET /api/db/sessions;GET /api/db/flaggednow accepts?ref=<scan_id>to serve items for a specific historical session.static/js/history.js(new) —loadHistorySession(refScanId),openHistoryPicker(),closeHistoryPicker(),exitHistoryMode(),invalidateHistoryCache()all exposed onwindow.state.js—_historyRefScanId: nulltracks which session is currently displayed (null= live/SSE).results.js— initial status check callsloadHistorySession(null)instead ofloadLastScanSummary().scan.js—startScan()callsexitHistoryMode(); all three*_donehandlers callinvalidateHistoryCache().
-
User-scoped viewer tokens (#34) — viewer token links can now be restricted to a specific person so the recipient sees only their own flagged files, across both M365 and Google Workspace. The Share modal's scope selector gains a User option that opens a searchable name autocomplete backed by the already-loaded
S._allUserslist. Typing filters by display name or email; each row shows the person's full name, role badge, and all associated email addresses (M365 UPN and GWS email shown together for dual-platform users). Selecting a name fills the input with the display name and stores both email addresses internally. Scope is stored as{"user": ["alice@m365.dk", "alice@gws.dk"], "display_name": "Alice Smith"}. Server-side enforcement inGET /api/db/flaggedfiltersWHERE account_id IN (list)so items from either platform are included. The viewer header shows the person's full name in a locked identity badge (#viewerIdentityBadge);#filterRoleis hidden. Token rows in the Active links list show the display name badge. Free-text email entry still works as a fallback when no accounts are loaded. File-scan items (account_id = "") never appear in user-scoped views — consistent with the existing role-scope behaviour.
[1.6.16] — 2026-04-18
Added
- User-scoped viewer tokens (#34) — viewer token links can now be restricted to a specific person so the recipient sees only their own flagged files, across both M365 and Google Workspace. The Share modal's scope selector gains a User option that opens a searchable name autocomplete backed by the already-loaded
S._allUserslist. Typing filters by display name or email; each row shows the person's full name, role badge, and all associated email addresses (M365 UPN and GWS email shown together for dual-platform users). Selecting a name fills the input with the display name and stores both email addresses internally. Scope is stored as{"user": ["alice@m365.dk", "alice@gws.dk"], "display_name": "Alice Smith"}. Server-side enforcement inGET /api/db/flaggedfiltersWHERE account_id IN (list)so items from either platform are included. The viewer header shows the person's full name in a locked identity badge (#viewerIdentityBadge);#filterRoleis hidden. Token rows in the Active links list show the display name badge. Free-text email entry still works as a fallback when no accounts are loaded. File-scan items (account_id = "") never appear in user-scoped views — consistent with the existing role-scope behaviour.
[1.6.15] — 2026-04-12
Added
-
Role-scoped viewer tokens — viewer token links can now be restricted to a single role so the recipient can only see student or staff items. A new Role scope dropdown (All roles / Ansatte / Elever) in the Share modal is selected when creating a token. The scope is stored as
"scope": {"role": "student"|"staff"}inviewer_tokens.json. Enforcement is two-layered:GET /api/db/flaggedfilters items server-side usingsession["viewer_scope"].roleset at token validation time; the#filterRoledropdown in the viewer is pre-set and hidden so the constraint cannot be bypassed client-side. Tokens without a scope field (existing tokens, PIN sessions) remain unrestricted. Role badge (Ansatte / Elever) shown on each scoped token row in the Active links list. -
Role filter in results + role-scoped exports — a new Role dropdown in the filter bar (All roles / Ansatte / Elever) narrows the results grid to staff or student items. Clicking Excel or Art.30 while a role is selected exports only that group — the
?role=student|staffparam is forwarded to both export endpoints._build_excel_bytes()and_build_article30_docx()now accept aroleparam; all internal sheets (GPS, External transfers, Art.30 staff/student tables) respect the filter. Filenames get an_eleveror_ansattesuffix. -
Scan filter options for student environments — two new profile options reduce noise when scanning student accounts:
- Ignore GPS in images (
skip_gps_images) — images whose only PII signal is an embedded GPS coordinate are not flagged. Smartphones embed location in every camera photo by default, generating large numbers of low-priority flags in school contexts. GPS data is still extracted and shown in the detail card when the image is flagged by another signal (faces, EXIF author/comment). Applies to M365, Google, and file scans. - Min. CPR count per file (
min_cpr_count, default 1) — a file is only flagged if it contains at least this many distinct CPR numbers. Set to 2 to avoid reporting a student's own consent form or registration document (one CPR) while still flagging class lists and grade sheets with multiple students' CPRs. Deduplication is by value — a CPR repeated 10 times counts as 1 distinct number. Applies to M365, Google, and file scans. - Both options are saved in profiles and editable in the Profile Manager editor.
- Ignore GPS in images (
-
GitHub Actions CI/CD — macOS build —
.github/workflows/build.ymlnow also builds a macOS.appbundle (macos-15, Apple Silicon ARM64) on every push tomainand onv*tags. Released asGDPRScanner_macos_arm64.zip. (Originallymacos-13/ Intel, changed when GitHub retired that runner.)
Fixed
-
OneDrive 404 errors during delta scans —
GET /users/{id}/drive/root/deltareturns 404 for users with no OneDrive licence, a disabled service plan, a drive that was never provisioned (account never signed in), or a suspended account. Previously these 404s fell through torequests.raise_for_status()and were caught by the genericexcept Exceptionhandler in_scan_user_onedrive, broadcasting a redscan_errorcard. Full scans never showed the error because_iter_drive_folder_forhas a bareexcept Exception: return. Fixed by addingM365DriveNotFound(M365Error)tom365_connector.py, raising it from_get()on HTTP 404, and handling it explicitly in_scan_user_onedrivewith ascan_phasebroadcast ("OneDrive (user): not provisioned — skipped") before the generic exception handler. -
CI — Windows artifact never uploaded — PyInstaller
--onedirputs the exe insidedist/GDPRScanner/, not atdist/*.exe. The artifact glob never matched, so no Windows build appeared in releases. A PowerShell packaging step now zipsdist\GDPRScanner\intoGDPRScanner_windows_x64.zip(mirroring the existing Linux step). -
EFFORT_ESTIMATE.md— build effort estimate document covering component-by-component hour breakdowns and complexity drivers for the project. -
Settings → Security tab — new dedicated pane in the Settings modal. Admin PIN and Viewer PIN groups moved here from the General tab, which now contains only Appearance and About. The Share modal's Configure button navigates directly to the Security tab.
-
Viewer mode layout — the sidebar, log panel, and progress bar are now hidden in viewer mode so results fill the full window width. The
🔍 GDPRScannerbrand is shown in the top-left of the topbar (replacing the sidebar header) at the same size and weight as the normal sidebar title. -
Share modal — Revoke / Copy buttons broken —
JSON.stringify(token)produced a double-quoted string that terminated the surroundingonclick="…"HTML attribute early, so neither button fired its handler. Both now pass the token as a single-quoted JS string literal, which is safe for the hex token format. -
Viewer PIN — Clear PIN rejected with "current PIN is incorrect" — clicking Clear PIN without first typing in the Current PIN field sent an empty string to the server, which correctly rejected it. A client-side guard now validates the field is non-empty before sending the request, and focuses the input with an inline error message if it is empty.
-
Share modal — all UI strings now translated — the Share results modal and Viewer PIN settings group were fully hardcoded in English. All visible strings are now backed by i18n keys (
share_*,viewer_pin_*) inen.json,da.json, andde.json. -
Excel / ART.30 export — Gmail and Google Drive missing from summary —
by_sourcewas built from flagged items only, so sources that produced zero hits were silently skipped. Both the Excel Summary sheet and the ART.30 "Breakdown by source" table now include every source that was actually scanned, showing0items and0CPR hits where nothing was found. NewGDPRDb.get_session_sources()method reads thesourcesJSON column from all scans in the current session window to determine which sources ran. -
Scan never finishes when M365 + Google run concurrently —
scan_done(M365 finished) was closing the SSE connection immediately viaS.es.close(), even whenS._googleScanRunningorS._fileScanRunningwas still true. Thegoogle_scan_done/file_scan_doneevents therefore never arrived, leaving the progress bar stuck at 100% indefinitely. SSE teardown is now deferred until the last concurrent scan completes:scan_doneonly closes the connection if neither Google nor File is still running;google_scan_doneandfile_scan_doneclose it when they are the final scan to finish.
[1.6.14] — 2026-04-10
Added — read-only viewer mode (#33)
A DPO, school principal, or compliance coordinator can now review scan results and tag dispositions without access to scan controls, credentials, or settings.
Token links
- New
🔗Share button in the topbar opens a token management modal. - Create generates a 64-char hex token (
secrets.token_hex(32)) with an optional label and expiry (7 d / 30 d / 90 d / 1 yr / never). - Copy copies the full
http://host:5100/view?token=…URL to the clipboard. - Revoke deletes the token immediately; any browser using it is locked out on next navigation.
- Tokens are stored in
~/.gdprscanner/viewer_tokens.jsonwithcreated_at,expires_at, andlast_used_atmetadata. Expired tokens are cleaned up on each list fetch.
PIN alternative
- A 4–8 digit numeric PIN can be set in Settings → General → Viewer PIN.
- Opening
/viewwithout a token shows a PIN entry form (templates/viewer_pin.html). - Correct PIN sets a Flask session cookie (
session["viewer_ok"]) valid for the browser session — no token needed after that. - Brute-force guard: 5 failed attempts per 5 minutes per IP returns 429.
- PIN stored as salted SHA-256 inside
viewer_tokens.json(no extra dependencies).
/view route
- Checks
?token=first (validates + binds session), then existing session cookie, then PIN form (if a PIN is configured), then 403. - Serves the same
index.htmlwithwindow.VIEWER_MODE = trueinjected. - Invalid/expired tokens show
templates/viewer_denied.html.
Viewer mode (JS)
auth.js— bypasses M365 auth check entirely; addsviewer-modeclass to<body>; shows scanner screen immediately.results.js— onDOMContentLoadedcalls_loadViewerResults()which fetchesGET /api/db/flagged(all items from the last completed scan session, joined with dispositions) and renders the grid directly — no SSE required.- CSS (
body.viewer-mode) hides: Sources/Options/Accounts sidebar panels; Scan/Stop buttons; profile bar; config-group buttons; resume banner; bulk-delete button; per-card delete button; data-subject delete button; Share button. - Disposition tagging (select + Save) remains fully functional —
/api/db/dispositionhas no auth guard. - Filter bar, Excel export, Art.30 export, preview panel, and log remain accessible.
New files: routes/viewer.py, static/js/viewer.js, templates/viewer_pin.html, templates/viewer_denied.html
Files changed: app_config.py, gdpr_scanner.py, templates/index.html, static/style.css, static/js/auth.js, static/js/results.js, static/js/scheduler.js, routes/database.py
Fixed — memory exhaustion during large M365 scans
Addressed root causes of runaway memory growth (reported: up to 90 GB RSS) that could crash the host machine during scans of large Microsoft 365 tenants.
scan_engine.py
- Email body HTML stripped at collection time — Graph API returns the full
bodyfield (raw HTML, up to ~1 MB per message) for every email fetched. Previously, all message dicts — including the raw HTML — were accumulated inwork_itemsbefore any scanning began. For 1 000 users × 2 000 emails this could mean >100 GB inwork_itemsalone. The body is now converted to plain text immediately on collection (_precomputed_body), and the rawbodyandbodyPreviewkeys are deleted from the dict before it is queued. The processing loop reads_precomputed_bodyviapop()anddels it after use. work_itemsconverted todequebefore processing — items are now released from memory one by one viapopleft()as they are processed, rather than keeping the entire list alive for the duration of the scan.gc.collect()is called immediately after conversion and after each checkpoint save.contentbytes freed as early as possible in the file processing branch — raw download bytes are nowdel'd immediately aftercontent.decode()(before the expensive NER/PII pass), and also in the no-hitselsebranch where they were previously kept alive until the next loop iteration.body_textfreed after use in the email branch —del body_textadded after_broadcast_cardso large plain-text bodies do not linger until the next iteration.- Memory guard before file downloads — uses
psutil.virtual_memory().availableto skip a file download and log a warning if fewer than 300 MB of RAM are available, preventing a single large file from pushing an already-pressured machine into OOM.
document_scanner.py
- PDF OCR page images freed page by page —
convert_from_path()renders all pages at 300 DPI before scanning begins (~26 MB per A4 page; a 100-page PDF ≈ 2.6 GB). Each renderedPIL.Imageis now nulled out (images[page_num-1] = None) immediately after OCR, so only one page image is live at a time instead of the entire document.
Changed — Sources panel is now resizable and collapsible
The KILDER sidebar panel now behaves consistently with the other sidebar sections.
- Collapsible — the
▾/▸toggle was already wired up; collapse state is already persisted inlocalStorage. No change needed here. - Resizable — a drag handle (
sources-resize-handle) added at the bottom of the panel body. Dragging up shrinks the panel (scroll appears); dragging down is capped at the panel's natural content height — you cannot expand it beyond what is needed to show all sources. Height preference persisted inlocalStorageundergdpr_sources_h. - Auto-fit on render —
_fitSourcesPanel()is called at the end of everyrenderSourcesPanel()invocation. On first load and whenever sources are added or removed (e.g. connecting Google), the panel height snaps to exactly fit all visible sources. A previously saved smaller height is honoured only if it is still smaller than the new content height; dragging back to full height clears the saved preference. - The old
max-height: calc(5 * 26px)fixed cap is removed.
Files changed: templates/index.html, static/style.css, static/js/log.js (_fitSourcesPanel, _initSourcesResize), static/js/sources.js, static/js/results.js.
[1.6.13] — 2026-04-10
Added — developer tooling
run_tests.sh— shell script to activate the venv and run the full test suite. Accepts anypytestarguments:./run_tests.sh,./run_tests.sh -q,./run_tests.sh tests/test_app_config.py.- Directory-scoped
CLAUDE.mdrules —routes/CLAUDE.md,static/js/CLAUDE.md,templates/CLAUDE.md,lang/CLAUDE.mdreplace the previous single-file context document. Each file is loaded automatically by Claude Code only when working in the relevant directory.
Fixed — documentation
README.mdproject files table — removed four phantom entries (Dockerfile,docker-compose.yml,.dockerignore,scanner_audit.jsonl); correctedstatic/app.jsdescription to "archived monolith — no longer loaded"; fixed manual paths (MANUAL-EN.md→docs/manuals/MANUAL-EN.md); added missing files:scan_engine.py,sse.py,checkpoint.py,app_config.py,cpr_detector.py,google_connector.py,static/style.css,static/js/*.js,routes/google_auth.py,routes/google_scan.py,run_tests.sh,docs/setup/guides.docs/manuals/MANUAL-EN.md,docs/manuals/MANUAL-DA.md— version header updated from 1.6.11 → 1.6.13; footer updated from v1.6.8 → v1.6.13.
Changed — blueprint migration batch 3, 4, 5 (auth, database, export — migration complete)
All remaining direct @app.route registrations removed from gdpr_scanner.py. Flask now routes every API endpoint exclusively through its blueprint. Only GET / and GET /api/scan/stream (SSE) remain in gdpr_scanner.py.
routes/auth.py — rewritten with direct imports (batch 3, 6 routes):
MSAL_OK,M365Connector,M365Errorimported fromm365_connector_load_config,_save_configimported fromapp_config- Dead module-level globals
_pending_flowand_auth_poll_resultremoved fromgdpr_scanner.py - Routes removed:
/api/auth/status,/api/auth/start,/api/auth/poll,/api/auth/userinfo,/api/auth/signout,/api/auth/config
routes/database.py — rewritten with direct imports (batch 4, 15 routes):
_get_db,DB_OKfromgdpr_db;_set_admin_pin,_verify_admin_pin,_admin_pin_is_setfromapp_config;_clear_checkpoint,_DELTA_PATHfromcheckpoint;_extract_exif,_html_esc,_placeholder_svgfromcpr_detectorSCANNER_OKdetermined by localimport document_scannertry/exceptdb_exportimproved: usesNamedTemporaryFileinstead ofmktemp(safer for frozen apps)- Email preview HTML: full CSS ruleset (
*, *::before, *::after,img,table, scrollbar) from gdpr_scanner.py version restored - Routes removed:
/api/db/stats,/api/db/trend,/api/db/scans,/api/db/subject,/api/db/overdue,/api/db/disposition(×2),/api/db/deletion_log,/api/db/reset,/api/admin/pin(×2),/api/db/export,/api/db/import,/api/preview/<item_id>,/api/thumb
routes/export.py — rewritten with direct imports (batch 5, 3 routes):
_get_db,DB_OKfromgdpr_db;_GUID_RE,_resolve_display_namefromapp_config;M365PermissionErrorfromm365_connectorapp.loggerreplaced withlogging.getLogger(__name__)- Dead
delete_item()helper removed fromgdpr_scanner.py(was unreachable; blueprint has its own copy) - Routes removed:
/api/export_excel,/api/export_article30,/api/delete_bulk
tests/test_routes.py — db_patch fixture updated: now patches routes.database._get_db / routes.database.DB_OK and routes.export._get_db / routes.export.DB_OK (was patching gdpr_scanner._get_db/gdpr_scanner.DB_OK which no longer have any effect). Two test_without_db_returns_503 tests updated to monkeypatch routes.database.DB_OK instead of gdpr_scanner.DB_OK.
[1.6.12] — 2026-04-10
Fixed — profile editor save drops users from non-active role groups
In _pmgmtSaveFullEdit (profile management editor), the save function applied the active role filter (_pmgmtRoleActive) to the list of checked checkboxes before saving. Since _pmgmtFilterAccounts hides rows via display:none but does not uncheck them, users from other role groups that remained checked (but hidden) were silently discarded on save. The role filter at save time is removed — all checked checkboxes are now captured regardless of which role tab is visible.
[1.6.11] — 2026-04-10
Changed — blueprint migration batch 1 (scan + app_routes)
15 direct @app.route registrations removed from gdpr_scanner.py. Flask now routes all of these exclusively through their blueprint counterparts, which previously existed as dead code shadowed by the direct routes.
routes/scan.py — rewritten with direct imports (was entirely non-functional as dead code due to bare-name NameErrors behind the shadow):
- Added
GET /api/scan/status(new — was only in gdpr_scanner.py) - Added
GET /api/src_toggles,POST /api/src_toggles(new — was only in gdpr_scanner.py) scan_checkpoint_info— added missingcheck_onlyhandling present in the gdpr_scanner.py version- All state references converted from bare names to
state._scan_lock/state._scan_abort;run_scanimported lazily fromscan_engineinside_runto avoid circular imports _save_settings,_load_settings,_load_src_toggles,_save_src_togglesimported fromapp_config_checkpoint_key,_load_checkpoint,_clear_checkpoint,_load_delta_tokens,_DELTA_PATHimported fromcheckpoint
routes/app_routes.py — cleaned up:
APP_VERSIONnow computed locally fromVERSIONfile (was a bare-name reference to gdpr_scanner.py global)_LANG_DIRcomputed at module level; fixedsys/_sysalias mismatch inget_langs(bug in blueprint that never manifested while shadowed)_set_lang_override,_load_lang_forcedimported directly fromapp_configget_langs— added missinglangs.sort()present in the gdpr_scanner.py version
tests/test_routes.py — mock_connector fixture simplified: no longer needs to patch gdpr_scanner._connector since the direct scan/start route is gone; state.connector alone is sufficient. run_scan stub in test_authenticated_returns_started updated to target scan_engine directly.
Routes removed from gdpr_scanner.py: /api/about, /api/langs, /api/set_lang, /api/lang, /api/scan/status, /api/scan/start, /api/scan/stop, /api/scan/checkpoint, /api/scan/clear_checkpoint, /api/settings/save, /api/settings/load, /api/src_toggles, /api/delta/status, /api/delta/clear
Still in gdpr_scanner.py: GET / (root), GET /api/scan/stream (SSE — cannot be in a blueprint), and the auth, users, sources, database, export route groups (31 routes — next batches).
[1.6.10] — 2026-04-10
Fixed — Google Drive exportSizeLimitExceeded warning
Native Google Workspace files too large for Drive's export API (Google's server-side limit, distinct from the 20 MB local cap) now produce a clean skip message instead of a stray WARNING googleapiclient.http — Encountered 403 Forbidden with reason "exportSizeLimitExceeded" in the log. A logging.Filter subclass is installed on the googleapiclient.http logger at import time to suppress the duplicate external warning; the except HttpError block in _drive_iter detects the reason and logs [gdrive] skip '<name>' — file too large for Google export API (exportSizeLimitExceeded) with the file ID.
Fixed — peak memory during large file/SMB scans (OOM risk reduction)
Three targeted buffer-lifetime fixes reduce peak RSS during large scans:
cpr_detector.py—del contentafter writing the PDF bytes to a temp file in_scan_bytes_timeout. The 20 MB buffer was previously held in the main process for the entire duration ofp.join(timeout)(up to 60 s), overlapping with the spawned subprocess's ~150–300 MB heap. It is now freed before the subprocess starts.scan_engine.py—del contentafter the thumbnail block inrun_file_scan. The raw file buffer was kept alive through card dict construction and the start of the next loop iteration; it is now freed as soon as the thumbnail (or placeholder SVG) has been generated.file_scanner.py—PREFETCH_WINDOWreduced from 2 to 1. Halves the maximum number of concurrently-held SMB read buffers (from 2 × 20 MB to 1 × 20 MB).
[1.6.9] — 2026-04-10
Changed — frontend migrated to ES modules
Phase 2 complete: All 10 split JS files converted from <script defer> to <script type="module">.
static/js/state.jsintroduced as the shared state module — exports a singleSobject holding all previously-global mutable state (flaggedData,_allUsers,_profiles,_fileSources,_srcPct, scan-running flags, etc.). All 10 modules import{ S }fromstate.jsand mutate its properties in place.- Every function called from an inline HTML
onclick=handler is explicitly exported viawindow.fnName = fnNameat the bottom of each module (~80 exports across 10 files). var LANGretained in the inline<script>block (not a module) so it remains a true global accessible from all modules as a bare name.app.jsretained as archive; no longer loaded byindex.html.
Fixed — connector.js SyntaxError caused by duplicate function declarations
openFileSourcesModal and closeFileSourcesModal were declared twice at module top level in connector.js — once as redirect stubs pointing to the new unified Sources modal, and once as the old #fsrcBackdrop implementations left over from the pre-unification code. In ES module strict mode, duplicate function declarations in the same scope are a SyntaxError. The engine rejected the entire module at parse time, meaning none of its ~35 window.* exports were ever set. Symptoms:
- "Kilder" (Sources) button did nothing —
window.openSourcesMgmtwas never set - Google status dot, file source loading, and sources panel re-render all silently failed —
window.smGoogleRefreshStatus,window._loadFileSourcesetc. were undefined - Sources panel showed only M365 sources even when Google Workspace was configured
Fix: removed the stale async function openFileSourcesModal / function closeFileSourcesModal bodies (lines 511–518). The redirect stubs at lines 505–506 (openSourcesMgmt('files')) are the correct new behaviour. Also removed the duplicate window.openFileSourcesModal and window.closeFileSourcesModal assignments that appeared twice in the exports block.
Fixed — Profiler modal did not open when _renderProfileMgmt threw
If _renderProfileMgmt() threw a runtime error (e.g. due to downstream failures from the connector.js parse error), openProfileMgmtModal would abort before reaching classList.add('open'), leaving the modal invisibly closed. The function now wraps both _renderProfileMgmt() and _pmgmtOpenEditor() in individual try-catch blocks. Any error is logged to the console; the modal opens regardless.
Fixed — blocking alert on every unhandled async error
ui.js contained a duplicate unhandledrejection listener that called alert() for every unhandled Promise rejection. Background API calls (Google status, file sources, src_toggles) could fire these alerts at page load, and browsers that had already suppressed one alert silently blocked all subsequent ones. Removed the alert() handler; the console.error handler is retained.
[1.6.8] — 2026-04-09
Fixed — memory pressure during large scans
SMB prefetch window reduced
PREFETCH_WINDOWreduced from 5 to 2 infile_scanner.py. Peak in-flight SMB memory drops from ~250 MB to ~40 MB during large network share scans.MAX_FILE_BYTESreduced from 50 MB to 20 MB — files larger than 20 MB are skipped rather than buffered in full.
PDF subprocess concurrency limited
- A module-level
threading.Semaphore(1)incpr_detector.pyensures at most one PDF OCR subprocess runs at a time. Previously, multiple threads could each spawn a ~200 MB subprocess simultaneously, causing OOM under load.
Google Workspace export buffer reduced
_MAX_EXPORT_BYTESingoogle_connector.pyreduced from 50 MB to 20 MB._drive_iternow explicitly deletes theBytesIObuffer (del buf) before yielding each file's bytes, releasing the double-buffer peak immediately rather than waiting for GC.
Fixed — Excel and Article 30 exports missing sources
Gmail and Google Drive tabs added to Excel export
SOURCE_MAPinroutes/export.pywas missinggmail,gdrive,local, andsmbentries. Items from these sources were silently dropped — they were grouped internally but never written to a sheet.- All eight source types now have dedicated tabs: Outlook, OneDrive, SharePoint, Teams, Gmail, Google Drive, Local, Network.
- The same fix applies to the inline Excel builder in
gdpr_scanner.py.
Concurrent scan results captured in exports
- M365, Google Workspace, and file scans each create their own
scan_id. The previous DB fallback usedget_flagged_items(), which only returned results for the single most-recently-completed scan — silently dropping the other sources after page reload. - New
get_session_items(window_seconds=300)ingdpr_db.pyreturns items from all scans whosestarted_atfalls within a 5-minute session window of the latest completed scan. - Both
export_excel()andexport_article30()now useget_session_items()as their DB fallback._build_article30_docx()also uses it directly.
Changed — "Email" source renamed to "Outlook"
The email source type (Microsoft Exchange mailboxes) is now consistently labelled Outlook everywhere:
- Source badges on result cards (
SOURCE_BADGES.email) - Filter bar dropdown
_sourceLabel()in JS- Excel tab label
m365_src_email,m365_filter_email,m365_phase_emailsin all three lang files (en.json,da.json,de.json)- Article 30 report uses Exchange (Outlook) for the formal legal context
Rationale: with Gmail also present, "Email" was ambiguous. "Outlook" ties the source unambiguously to Microsoft 365.
Changed — progress bar moved above log panel
#progressBarmoved from below the topbar to just above#logWrap(above the activity log).- The bar is now a permanent placeholder — always visible, never hidden.
display: flexis the permanent state;display: noneis no longer used. - Background changed from
var(--surface)tovar(--bg)to match the log area. Border changed fromborder-bottomtoborder-top. - New
_clearProgressBar()helper resets phase, stats, ETA, and file fields on scan end, leaving the bar visually empty at idle. All previousstyle.displayassignments removed.
Fixed — profile manager Cancel closes entire modal
- Clicking Cancel in the profile editor previously closed the editor panel but left the profile list modal open behind it.
_pmgmtCloseEditor()now callscloseProfileMgmt()to dismiss the full modal. - Dead stub
function _pmgmtCancelEdit(id) {}removed.
Changed — exports available without running a new scan
- The filter bar (including Excel and Art.30 export buttons) is always visible on page load.
- Exports now use
get_session_items()as the DB fallback, so the buttons produce a complete report from the previous session immediately after page reload — no new scan required.
Fixed — profile loading clobbered by scan start
_save_settings()is called on every M365 scan start with a payload containing only M365sources,user_ids, andoptions. It was writing this back via_profile_from_settings(), which has nogoogle_sourcesfield — permanently stripping Google and file source selections from the active profile after each scan._save_settings()now preservesgoogle_sourcesandfile_sourcesfrom the existing profile when the payload does not include them, and rebuilds the combinedsourcesarray as M365 + google + file._profile_from_settings()updated to pass throughgoogle_sourceswhen present in the payload.
Fixed — "no results" shown during live scan after hard refresh
- Hard-refreshing the browser mid-scan caused the "Ingen CPR-numre fundet" card to appear immediately, before the SSE watchdog had detected the running scan.
loadLastScanSummary()is no longer called directly onDOMContentLoaded. It is now called inside_sseWatchdogon the first status poll, only if no scan is currently running (_initialStatusCheckedflag).
Fixed — progress bar source pill showing "Email" instead of "Outlook"
_PHASE_SOURCE_MAPentry for Exchange mail phases still hadlabel: 'Email'. Updated to'Outlook'to match the rename applied elsewhere.
Changed — profile manager UI simplified
- Removed the redundant × close button from the list panel header — the editor panel's × already closes the entire modal.
- Removed the Luk (Close) button from the list panel footer — the footer now contains only + Ny profil.
- The editor footer Cancel/Annuller button replaced with a single Luk button that closes the entire modal (consistent with
_pmgmtCloseEditor()behaviour).
Changed — log panel collapsible
- A ▾/▸ toggle button added to the left of the log header. Clicking it collapses or expands the log panel (resize handle + log body together, wrapped in
#logSectionBody). - State persists in
localStoragevia the existingtoggleSection/restoreSectionStatesmechanism (sc_logSectionkey).
Changed — log header buttons translated
- All, Errors, and Copy buttons in the log header now use
data-i18nattributes and are fully translated in all three lang files. - Translation keys added:
btn_errors(da: Fejl, de: Fehler),log_copy(da: Kopier, de: Kopieren). - Symbol prefix
⎘removed from the Copy button label.
Changed — project documentation structure
- User manuals moved from project root to
docs/manuals/(MANUAL-DA.md,MANUAL-EN.md). - Setup guides moved from project root to
docs/setup/(M365_SETUP.md,GOOGLE_SETUP.md). routes/app_routes.pyandbuild_gdpr.pyupdated to reference the new manual paths.README.mdlinks updated accordingly.
Fixed — disposition carry-forward across scans
When a previously reviewed file reappears in a new scan it now shows its prior disposition immediately on the result card — no need to open the preview panel first.
get_prior_disposition(item_id)added toScanDBingdpr_db.py. Returns the stored disposition status if it differs from'unreviewed', otherwiseNone.get_flagged_items()andget_session_items()ingdpr_db.pynowLEFT JOIN dispositionsand returnCOALESCE(d.status, 'unreviewed')asdispositionon every row. Exports and the results grid therefore reflect the latest review decision without an extra round-trip._with_disposition(card, db)helper added toscan_engine.py. Injects the prior disposition into a card dict before it is broadcast asscan_file_flagged. Used at all four broadcast points:scan_engine.py— file scan (line ~297)scan_engine.py— checkpoint resume re-emit loop (line ~357)scan_engine.py— M365 scan (line ~456)routes/google_scan.py— Google Workspace scan (line ~225)
- The frontend already reads
f.disposition || 'unreviewed'for filter matching — no JS changes required.
[1.6.7] — 2026-04-06
Fixed — emoji/symbol removal from all buttons and indicators
All UI buttons stripped of emoji and symbol prefixes
- Every interactive element in the topbar, filter bar, modals, and settings panels now uses plain text only. Removed:
▶,■,💾,✕,⚙,🕐,⬇,⬆,🗑,📋,☰,⊞. - Affected buttons: Scan, Stop, Save (profile), Clear (profile), Profiler/Profiles, Kilder/Sources, Indstillinger/Settings, Excel, Art.30, Slet/Delete (bulk), Liste/List, Gitter/Grid, Export (DB), Import (DB), Reset DB, scheduled scan title.
- Labels updated in
templates/index.htmland all three lang files (da.json,en.json,de.json).
Filter bar — Clear button standardised
- The
×clear-filter button was an oversized bare symbol (font-size: 16px, no border). Replaced with a proper text button (Ryd/Clear/Löschen) matching the 26 px filter bar standard: bordered,border-radius: 5px, turns red on hover. - Translation key
m365_filter_clearadded to all three lang files.
Scheduler indicator — "Next:" label translated
- The hardcoded
'Next: 'prefix inschedUpdateSidebarIndicator()is nowt('m365_sched_next', 'Next'). Key added to all three lang files (da:Næste, de:Nächste). - Clock emoji
🕐removed from the indicator and fromm365_sched_titlein all lang files.
Fixed — result card badges, progress bar on browser refresh
Result card badges — standardised to 9 px pill style
- All result card badges now follow the app-wide badge standard:
font-size: 9px; padding: 1px 5px; border-radius: 10px. .source-badge(OneDrive, Exchange, Gmail, etc.) had no CSS definition at all — it now has the correct size, padding, and border-radius..cpr-badgereduced from10px / 2px 6pxto9px / 1px 5px..photo-face-badge,.special-cat-badge,.overdue-badge,.role-pillreduced from10px/border-radius: 4pxto9px / 1px 5px / border-radius: 10px.- Removed camera emoji (📷) from the Faces badge.
.card-sourcegainsflex-wrap: wrapso badges wrap on narrow cards instead of overflowing.
Progress bar — survives browser refresh
- Refreshing the browser mid-scan no longer causes the progress bar to appear without coloured segment pills.
- Three code paths now defensively set the correct running flag and call
_renderProgressSegments()before the track is needed:scan_startSSE handler (sets_m365ScanRunning).scan_progressSSE handler (sets the flag matching the event'ssourcefield — covers mid-scan reconnects wherescan_starthas scrolled out of the 500-event replay buffer).scan_phaseSSE handler (infers source from phase text; fires beforescan_progressin the replay sequence)._sseWatchdog(sets_m365ScanRunningimmediately on detecting a running scan via/api/scan/status, which checks the M365 lock).
Improved — scan responsiveness, UI layout, preview panel
Scan abort responsiveness
- Stop now takes effect within one Graph API round-trip across all collection phases. Previously, pressing Stop only checked the abort flag in the processing loop — the entire collection phase (email folder enumeration, OneDrive file listing, Teams channel fetching, SharePoint site iteration) ran to completion first, which could take 10+ minutes on large tenants.
- Abort checks added to: email folder loop (inside
_scan_user_email), OneDrive items loop (delta and full modes in_scan_user_onedrive), Teams team loop and channel loop (inside_scan_user_teams), SharePoint site loop, and all outer per-user loops. - Side effect: the scheduler no longer fails with "Manual scan already running" when a job fires shortly after the user pressed Stop — the lock is now released promptly.
Scheduler — graceful skip on lock contention
- When a scheduled job fires while a manual (or other scheduled) scan holds the lock, the job now logs
Skipped — a scan is already runningand returns cleanly. Previously it raisedRuntimeError("Manual scan already running"), which was logged as a hard failure with a full traceback in the UI.
Filter bar — always visible, full-width, 26 px
- Filter bar was hidden until the first result arrived. It is now always visible.
- Moved from inside the left column to a direct child of
.main, above.content-area. The preview panel's top edge now aligns with the grid's top edge rather than overlapping the filter bar. - All filter bar controls standardised to
height: 26 px(input,select,button) to match the topbar control standard. Redundant inlinepadding/font-size/border-radiusstripped from button inline styles.
Preview panel
- Resizable: a 5 px drag handle on the left edge lets the user adjust the panel width. Handle uses pointer capture (
setPointerCapture) so dragging over the iframe or releasing outside the browser window always terminates the drag cleanly. Width is persisted insessionStorageand restored when the panel is next opened. - Min width: 280 px; max width: 70% of window width.
- Fixed: clicking the close (×) button had no effect. Root cause:
panel.style.widthset by the resize logic is an inline style and overrides the CSS class.hidden { width: 0 }. Fix:closePreview()now clearspanel.style.width = ''before adding.hidden;openPreview()restores the saved width when showing the panel. - Email preview iframe: added
* { max-width: 100% },overflow-x: hidden,table { table-layout: fixed }, andimg { height: auto }to prevent wide HTML emails from creating a horizontal scrollbar inside the 420 px panel. - Email preview iframe scrollbar: matches the app's 4 px thin scrollbar style.
Thin scrollbars everywhere
.grid-area(results grid) and.log-panelnow use the same 4 px thin scrollbar style (scrollbar-width: thin; width: 4px) as#accountsListand#sourcesPanel. Previously they used the system-default wide scrollbar.
Scheduler next scan indicator
#schedNextIndicatorwas a plaindisplay: blockdiv with no height constraint, causing it to sit taller than adjacent topbar controls. Fixed toheight: 26 px; display: inline-flex; align-items: centerwith a border and border-radius matching the surrounding pill buttons.
Log and preview resize — pointer capture fix
- Both resize handles (
logResizeHandle,previewResizeHandle) switched frommousedown+document.addEventListener('mousemove'/'mouseup')topointerdown+setPointerCapture. The old approach lost the drag when the cursor moved over the iframe (which has its own input context) or left the browser window. Pointer capture routes all pointer events to the handle untilpointerup/pointercancelregardless of cursor position.
Manuals updated (MANUAL-DA.md, MANUAL-EN.md)
- Version 1.6.4 → 1.6.6.
- Section 2: activity log description now mentions copy button, error filter, and resize handle.
- Section 4.4: progress bar description updated — source pill labels listed, old "current phase" wording removed.
- Section 8: profiles section updated for loader model, ✕ clear button, and explicit mention that Google/file sources are saved.
[1.6.6] — 2026-04-06
Improved — UX polish II (clusters, badges, log panel, progress bar)
Pill clusters
- KONTI section header: Alle / Ingen / ↻ converted from bare text links to a pill cluster (
height: 22px), matching the pattern used in the Profile editor. - Profile list rows (Profiler modal): Brug + Kopier grouped into a pill cluster; Slet kept as a separate standalone danger button.
Badge sizing
- Platform badges (M365, GWS, M365+GWS) and role badges (Ansat, Elev, Anden) standardised to
font-size: 9px; padding: 1px 5px; border-radius: 10pxacross the main sidebar account list and the profile modal. Previously the sidebar used larger inline styles (font-size: 10px; padding: 2px 7px) that made badges visually heavier than in the modal.
Account rows
- Main sidebar account row padding reduced from
4px 0to2px 0, matching the compact density of the profile modal account list. - SKU debug search icon button standardised to
height: 26pxto match the adjacent role filter cluster.
Log panel — full rebuild
- Color-coded log levels:
.log-err(redvar(--danger)),.log-ok(greenvar(--success)),.log-warn(orange#e0922a)). Level classes were already passed tolog()but had no CSS — all entries appeared in the same muted colour. - Live scanning indicator: a single italic
▶ filenameline at the bottom of the log updates in place viascan_fileSSE events. Never scrolls; clears automatically when the scan finishes. Avoids flooding the log with per-file entries. - Copy button (
⎘ Copy) in the log header copies all log text to clipboard; flashes✓ Copiedfor 1.5 s. - Log level filter (
All/Errors) in log header — hides info lines when Errors mode is active. - Resizable: drag handle at the top edge of the panel resizes vertically and snaps to the nearest full line (row = 18 px: 16 px line-height + 2 px margin; 2–30 lines range).
- Default height set to 8 lines exactly (
height: 154px= 8 × 18 + 10 px padding). - Persistent across page refresh: up to 300 lines saved to
sessionStorage; restored onDOMContentLoaded; cleared on new scan start. - Smart scroll: auto-scroll only triggers when already within 24 px of the bottom — scrolling up to read earlier entries stops the follow behaviour.
Progress bar — segmented multi-source
- Replaced the single
progressFillbar with a dynamically segmented track (#progressTrack). One segment per active scan type (M365 / Google / Files), equal width, separated by a 1 px gap. Segments are added at scan start and removed as each source finishes. - Color-coded: M365 = blue (
var(--accent)), Google = dark green (#3a7d44), Files = purple-gray (#7a6a9e). - Each segment fills independently — M365 at 80% and Google at 20% are shown simultaneously with no interference. Eliminates the
_maxPcthack (bar stuck at 100% after first source finishes). - Backend (
scan_engine.py,routes/google_scan.py): allscan_progressSSE events now include"source": "m365"/"google"/"file". Frontend routes each event to the correct segment byd.source. - Stats (
X / Y) and ETA only update from M365 events — the only source with meaningful totals and time estimates.
Progress bar — phase display
#progressWhoreplaces the plain-text phase span. Renders a colour-coded source pill ([Email],[OneDrive],[Gmail],[GDrive],[Local], etc.) followed by the user's full display name.- Source pill uses the universal badge standard:
font-size: 9px; padding: 1px 5px; border-radius: 10px; font-weight: 500. _setProgressPhase()identifies the source from the full phase string via_PHASE_SOURCE_MAP, then splits on—to extract the username. Phases without a dash (e.g.📂 folder: 3 msg(s)) fall back to the last known user (_progressCurrentUser)._resolveDisplayName()resolves email addresses in Google phase strings to the user's display name via_allUsers. Also strips trailing count suffixes (: 3 file(s)).- Pill labels standardised:
Email,OneDrive,SharePoint,Teams,Gmail,GDrive,Local— matching the source names used elsewhere in the UI. - All 25
scan_phasestrings now produce a pill:📂emoji maps toEmail;Google Workspace — emailphases resolve to display name; file scan startup usesFiles — {label}; Google per-user phase usesGoogle Workspace — {email}. - Source map ordering:
Google Workspacematched beforeGmailso the GWS startup phase shows[Gmail]only when no broader match applies. - Fixed: email regex was missing the
iflag (/E-?mail.../u→/E-?mail.../iu), causing Danish"Indsamler e-mails"to fall through to plain text.
Scheduler — Google and file sources
- Scheduled scans now run Google Workspace sources.
_build_optionsextractsgoogle_sourcesfrom the profile (with legacy fallback for profiles that stored gmail/gdrive insidesources). A separate Google scan block runs after the file scan loop using_google_scan_lock.
Profile dropdown — loader model
- Removed the selectable "Standard (sidebar)" / "Default (sidebar)" empty option. Profiles are now loaders, not persistent modes — selecting one pushes its settings into the sidebar; the sidebar is always the live state.
- Replaced with a
disabledplaceholder"— Vælg profil —"shown when no profile has been loaded. - Added a
✕clear button (#profileClearBtn) that appears next to the dropdown when a profile is active. Clicking it clears_activeProfileIdand resets the dropdown to the placeholder without touching the sidebar — the loaded settings remain. clearActiveProfile()function added.- Lang keys:
m365_profile_defaultremoved,m365_profile_placeholderadded (da/en/de).
Bug fixes
- Profile role filter respected at scan time:
getSelectedUsers()now filters the returned list by_activeRoleFilter, preventing hidden-role users from being silently included in M365 scans and profile saves via the topbar quick-save. - Profile editor role filter respected at save time:
_pmgmtSaveFullEditnow excludes IDs whose role doesn't match_pmgmtRoleActive. Prevents "select all → filter by staff → save" from silently saving student accounts that were checked but hidden. - Profile editor role filter state reset on open:
_openEditorForProfileresets_pmgmtRoleActive = ''so a stale filter from a previous session doesn't silently hide accounts when the editor is reopened. - Google and file sources not saved in profiles:
_pmgmtSaveFullEditnow checks whether the checkboxes are actually present in#peSourcesPanel(DOM query) rather than using!!window._googleConnectedand_fileSources.length > 0as proxies. The async status fetches could complete after the editor opened, leaving the panel without checkboxes while the proxy readtrue, silently discarding the user's selection. - Profile editor now re-renders
#peSourcesPanelwhensmGoogleRefreshStatus()resolves or_loadFileSources()completes if the editor is open and the panel has no Google/file checkboxes yet.
[1.6.5] — 2026-04-04
Improved — UX polish pass (topbar, sidebar, clusters)
Topbar
- All topbar elements normalised to
height: 26px: Scan/Stop buttons, profile dropdown, save button, config cluster, stats pill, icon buttons (🔍, ?, 🌙). Previously each had independent padding, making the topbar uneven. - Config buttons (Profiler, Kilder, Indstillinger) extracted from
#profileBarinto a dedicated.config-grouppill cluster separated by a.topbar-sepdivider — visually distinct from the profile selector group. - Data subject lookup moved from the sidebar footer into the topbar as a 🔍 icon button (left of
?). Sidebar strip removed.
Sidebar
- KILDER, INDSTILLINGER, and KONTI sections are now collapsible. Each header gets a
▾/▸chevron (section-collapse-btn). Collapse state persists inlocalStorageper section. KONTI releases itsflex:1when collapsed. - Role filter buttons (Alle / Ansat / Elev) converted to a pill cluster (
.role-filter-btn) matching the topbar cluster pattern. SKU debug button stays separate. - Date preset buttons (1 år / 2 år / 5 år / 10 år / Alle) converted to a pill cluster.
- All pill cluster buttons, input fields, and date picker set to
height: 26px— the universal control height across the UI. - Toggle size reduced from
36×20pxto32×18pxwith knob gap tightened from 3px to 2px. Knob-to-track ratio improved for a sleeker look. - Role filter buttons display live counts: "Alle (277)", "Ansat (62)", "Elev (254)". Updated by
updateRoleFilterCounts(), called fromrenderAccountList().
Empty state
- On load, fetches
/api/db/stats. If a previous scan exists, shows a summary card (hits, unique CPR subjects, items scanned, date, sources) instead of the bare placeholder. The placeholder is shown below as a "start new scan" prompt. Summary hidden when a scan starts.
Added — Single-instance lock
~/.gdprscanner/app.lock— an exclusive process lock is acquired at startup to prevent two instances from running simultaneously against the same database and settings files.- Desktop (
build_gdpr.pylauncher): lock is checked before Flask starts. If another instance holds the lock the app prints"GDPRScanner is already running."to stderr and exits immediately. - Server (
gdpr_scanner.py): same guard in interactive web-UI mode (not headless — batch runs may legitimately coexist with a live server). - Uses
fcntl.flock(LOCK_EX | LOCK_NB)on macOS/Linux andmsvcrt.lockingon Windows. The OS releases the lock automatically on crash or clean exit — no stale lockfiles.
Added — Port auto-increment + stdout port signal
gdpr_scanner.py(server mode): if the requested port (default 5100, or--port N) is already in use, the server auto-increments up to 100 ports and logs a warning:[!] Port 5100 in use — using 5101 instead.build_gdpr.pylauncher (desktop mode):find_free_port()was already present; auto-increment was already the desktop behaviour.- Both modes emit
GDPR_PORT=<n>(flush=True) to stdout before Flask starts — a machine-readable signal parseable by any parent process or wrapper script that needs to know the actual bound port.
Added — Built-in user manual (#31 ✅)
MANUAL-EN.md/MANUAL-DA.md— standalone end-user manuals in English and Danish. 14 sections covering all major features: Getting started, Sources panel, Running a scan, Understanding results, Reviewing results, Bulk actions, Profiles, Scheduler, Export & email, Article 30 report, Data subject lookup, Settings, Retention policy, and FAQ. Written for school administrators and municipal compliance officers — no technical knowledge assumed.GET /manual— new Flask route inroutes/app_routes.py. Reads?lang=da|en(falls back to the current UI language). Finds the appropriate.mdfile relative to the project root, converts it to a fully self-contained styled HTML page, and returns it without any external dependencies._md_to_html(md)— zero-dependency Markdown-to-HTML converter using only Python'sreandhtmlstdlib modules. Handles headings with anchor IDs, fenced code blocks, tables, ordered/unordered lists, blockquotes, bold, italic, inline code, links, and horizontal rules.?button in the topbar (right of the theme toggle) — opens the manual in a dedicated window (960×800, resizable) using the currentlangSelectvalue. In the packaged desktop app the window is a native pywebview window (pywebview.api.open_manual()); in the browser it opens viawindow.open(). Repeated clicks reuse the same window rather than spawning new ones. Does not interrupt any in-progress scan.- Manual page: 860 px max-width layout, language switcher (DA ↔ EN), 🖨 print button,
@media printCSS (toolbar hidden,h2page breaks, external link URLs appended for paper printing).
Fixed — Manual not found in packaged app
MANUAL-DA.mdandMANUAL-EN.mdwere missing from the PyInstaller bundle —build_gdpr.pynow includes allMANUAL-*.mdfiles as root-level data files (--add-data MANUAL-*.md:.). The route already usedsys._MEIPASSfor the frozen path; the files simply weren't being copied in.build_gdpr.pyLAUNCHER_CODE— addedopen_manual(lang)method to theApiclass. Creates a new pywebview window for the manual URL; reuses the existing window if already open.
Fixed — Email routing, profile source persistence, SMTP error messages
routes/email.py — structural rewrite
- Removed
__getattr__module-level hook. Bare-name lookups inside function bodies do not go through__getattr__(Python resolves them viaLOAD_GLOBALdirectly from__dict__), so_load_smtp_config,_save_smtp_config,_build_excel_bytes, and_send_report_emailall raisedNameErrorat runtime when the blueprint route won instead of the app-level duplicate. _load_smtp_config,_save_smtp_confignow imported directly fromapp_config._build_excel_bytesimported fromroutes.export._send_report_email(xl_bytes, fname, smtp_cfg, recipients)was called in three places but never defined anywhere. Now defined as a module-level helper: builds aMIMEMultipart("mixed")message with the Excel as aMIMEBaseattachment and sends via the configured SMTP server._send_email_graphmoved into the blueprint (was only used by the duplicate app-level routes).
gdpr_scanner.py
- Removed four duplicate app-level routes that were masking the broken blueprint:
GET /api/smtp/config,POST /api/smtp/config,POST /api/smtp/test,POST /api/send_report. from routes.email import _send_report_emailadded after blueprint imports soscan_scheduler.py(_m._send_report_email) and the CLI headless path both resolve the function correctly.
SMTP error messages (routes/email.py)
- All three auth/connection error handlers (smtp_test, send_report, _send_report_email) now classify errors by host type before choosing a message:
- DNS / connection failure (
nodename nor servname,getaddrinfo,Connection refused, timeout) → "Could not connect to SMTP server — check hostname and port." - Corporate M365 host (
office365,microsoft) + auth error → M365 admin centre / enable Authenticated SMTP guidance. - Personal Microsoft host (
outlook,live,hotmail) + auth error → App Password guidance ataccount.microsoft.com/security. - Gmail host + auth error → App Password guidance at Google Account Security.
- Anything else → raw SMTP error, unmodified.
- DNS / connection failure (
- Previously
530(generic "authentication required") unconditionally triggered the M365 admin centre message even when the configured host was Gmail or a personal Outlook account.
static/app.js — profile source persistence
_pmgmtSaveFullEditwas overwritinggoogle_sourcesandfile_sourceswith[]whenever the editor was opened and those checkboxes weren't rendered (Google not connected / file sources not loaded). Now preserves the profile's existinggoogle_sourceswhen_googleConnectedis false, andfile_sourceswhen_fileSourcesis empty._applyProfilebuilt_pendingProfileSourcesby filtering against_fileSources— which is empty at profile-apply time (async load not yet complete), so the pending list was always empty and file source checkboxes defaulted tochecked=trueregardless of the profile. Now storesprofile.file_sourcesdirectly (falling back to non-M365/Google IDs fromprofile.sources).- Added
_pendingGoogleSources(mirrors_pendingProfileSourcesfor Google). Set in_applyProfilefromprofile.google_sources; consumed inrenderSourcesPanel()the first time Gmail/Drive checkboxes appear (when Google connects after the profile was applied). Previously they defaulted tochecked=true.
Fixed — Progress bar and profile sources
static/app.js
- Progress bar fluctuated and ETA flickered when M365, Google, and file scans ran concurrently. Root cause: all three scan types broadcast
scan_progresson the same SSE stream and their events interleave. Fixed with two changes: (1)_maxPcttracks the highestpctseen across all concurrent scans — the bar only ever moves forward; (2) ETA and stats counter are only written when the incoming event actually carries those fields (d.eta !== undefined,d.totalpresent) — a Google/file event without ETA no longer wipes the ETA set by the M365 event a millisecond earlier. progressPhasewas being overwritten with the current filename byscan_progressevents, causing it to alternate between phase text ("Google Workspace scan…") and individual filenames. Current filename now correctly updatesprogressFileinstead.- Profile editor (
_openEditorForProfile) only passedprofile.sources(M365 IDs) to_renderEditorSources— Google and Local/SMB source checkboxes were always unchecked when reopening a saved profile. Now passes the union ofsources,google_sources, andfile_sources.
Added — SMB pre-fetch cache (#22 ✅)
- SMB file scans now decouple directory traversal from file reads. A 5-slot sliding-window
ThreadPoolExecutorkeeps up to 5 reads in flight simultaneously, with a 60-second hard timeout per file. A stalled NAS read produces an error card in the UI and the scan continues — the scan thread is never blocked. file_scanner.py—_smb_collect()new method walks the SMB tree (directory listing only, no reads), yielding file descriptors plus_COLLECT_SKIP/_COLLECT_ERRORsentinels for over-size files and listing failures._iter_smb()rewritten: phase 1 collects all candidates; phase 2 resolves sentinels immediately then feeds real files through the executor window.PREFETCH_WINDOW = 5andSMB_READ_TIMEOUT = 60constants added. Local scanner (_iter_local) untouched.
Added — PDF OCR via multiprocessing (#20 ✅)
- PDF files are now scanned in local/SMB file scans. Previously excluded because Tesseract/Poppler subprocesses could hang indefinitely.
cpr_detector.py— new_worker_scan_pdf()(module-level, required forspawncontext) runsdocument_scanner.scan_pdf()in a fresh subprocess and returns results via amultiprocessing.Queue. New_scan_bytes_timeout()wraps PDF scanning: writes content to a temp file, spawns the worker viamultiprocessing.get_context("spawn"), joins with a 60-second hard timeout, and terminates the process tree if it exceeds the limit. Non-PDF files delegate straight to_scan_bytes().scan_engine.py—run_file_scan()now calls_scan_bytes_timeout()instead of_scan_bytes()for all files. Stub added to module-level injected globals.gdpr_scanner.py—_scan_bytes_timeoutimported fromcpr_detectorand injected intoscan_engine.file_scanner.py—.pdfremoved fromFILE_SCAN_EXTENSIONSexclusion; all default extensions now included.
Fixed — Post-v1.6.4 release bugs (continued)
routes/google_scan.py
_run_google_scan()crashed withUnboundLocalError: cannot access local variable 'data'whenuser_emailswas not passed in the request. The fallbackdata.get("user_emails", [])referenced the request-handler localdatawhich is not in scope inside the scan function —dataandoptionsare the same object. Removed the redundant fallback.
routes/export.py — Article 30 report
SOURCE_LABELSwas missinggmail,gdrive,local, andsmb— all four source types rendered as raw keys in every table (inventory, Art. 9, photo, deletion audit log). Now map to "Gmail", "Google Drive", "Local files", "Network / SMB".- Per-source breakdown table only iterated M365 sources (
email,onedrive,sharepoint,teams) — Google and local/SMB findings were completely absent from the summary even when present. Loop now covers all eight source types. - Methodology bullet (
a30_method_4) only mentioned Microsoft Graph sources. Updated inen.json,da.json,de.json, and the hardcoded fallback to also mention Google Workspace (service account + domain-wide delegation) and local/SMB file shares.
scheduler.py
- Removed stale file.
scan_scheduler.pyfully supersedes it;routes/scheduler.pyandgdpr_scanner.pyboth import fromscan_scheduler. The old file had diverged significantly (missing UUID migration, connector auto-reconnect, file source resolution, debug SSE events).
templates/index.html
- Removed 9 unused CSS classes:
.sidebar-sub,.btn-secondary,.log-ok,.log-err,.log-warn,.user-bar,.sign-out-btn,.source-badge,.srcmgmt-coming-soon.
Added — Personal Google account OAuth (#30 ✅)
- Personal Google accounts can now be scanned without a service account or Workspace admin. A device-code OAuth flow (mirrors M365 delegated mode) lets a user sign in interactively with their own Google account.
google_connector.py— newPersonalGoogleConnectorclass:get_device_code_flow()/complete_device_code_flow()static methods hit Google's device-auth endpoint;_refresh_if_needed()handles transparent token refresh viagoogle.oauth2.credentials.Credentials;list_users()returns a single-item list (the signed-in user) so the scan engine needs no changes.iter_gmail_messages()/iter_drive_files()share the same iteration logic asGoogleConnectorvia extracted_gmail_iter()/_drive_iter()module-level helpers.- Token persisted to
~/.gdprscanner/google_token.json(chmod 600). New helpers:save_personal_token,load_personal_token,delete_personal_token. routes/google_auth.py— four new endpoints:GET /api/google/personal/status,POST /api/google/personal/start,POST /api/google/personal/poll,POST /api/google/personal/signout. Background thread blocks oncomplete_device_code_flow; frontend polls — identical pattern to M365 delegated auth.routes/state.py—google_pending_flowandgoogle_poll_resultadded.templates/index.html— auth-mode toggle (Workspace / Personal account) in the Google pane; personal section with client ID/secret fields and inline device-code box (reuses.device-code-boxCSS); workspace setup guide hidden in personal mode.static/app.js—smGoogleSetMode()switches visible sections;smGoogleRefreshStatus()now checks both/api/google/auth/statusand/api/google/personal/statusin parallel;smGooglePersonalStart(),smGooglePersonalPoll(),smGooglePersonalSignOut()added.lang/en.json,da.json,de.json— 14 new keys each.
Fixed — Post-v1.6.4 release bugs
checkpoint.py
- Scheduled scans crashed with
string indices must be integers, not 'str'whenuser_idsin the profile contained plain ID strings rather than dicts._checkpoint_key()now handles both formats:u["id"] if isinstance(u, dict) else u.
scan_engine.py
- Same root cause as above:
run_scan()now normalisesuser_idsentries to dicts at the top of the function before any access, so both plain strings and{id, displayName, userRole}objects work correctly.
scan_scheduler.py
file_sourcesin profiles are stored as source ID strings by the JS frontend. The scheduler now resolves each ID to its full source dict via_load_file_sources()before callingrun_file_scan(). Plain path strings are also handled as a fallback.- Full traceback is now included in the
scheduler_errorSSE event so failures are diagnosable from the UI status panel without needing the CLI.
routes/app_routes.py
/api/langs(language selector endpoint) only globbed*.langfiles — after the v1.6.3 JSON migration the language dropdown was silently empty. Now globs both*.jsonand*.langwith deduplication, matching the existing logic ingdpr_scanner.py.
static/app.js
- Profile editor (
_pmgmtSaveFullEdit) did not updatefile_sourcesorgoogle_sourceswhen the user changed source checkboxes — both fields were carried forward unchanged via...profile. Now splits#peSourcesPanelcheckboxes bydata-source-typeand writesfile_sources,google_sources, andsourcesexplicitly on every save.
gdpr_scanner.py
/api/langsonly globbed*.langfiles — after migrating to JSON, the language selector showed nothing. Now globs both*.jsonand*.lang, deduplicates by language code, and sorts alphabetically.SOURCE_LABELSwas missinggmail,gdrive,local, andsmbentries — these sources now get correct tab names in Excel export and correct labels in the Article 30 report.- Excel export filename changed from
m365_scan_*.xlsxtogdpr_scan_*.xlsx. - Article 30 methodology paragraph now mentions Google Workspace scanning via service account with domain-wide delegation. DA and DE lang files updated to match.
routes/google_scan.py
- Gmail and Google Drive result cards showed the email address as account name instead of the user's display name. Fixed:
_user_display_mapis now built fromlist_users()and applied to each scanned item. - Role badge (Elev/Ansat/Anden) was missing on Google results when
user_emailscame from the request rather thanlist_users(). Fixed: role map is now populated in both cases. - Google scan now emits
google_scan_doneinstead ofscan_doneso the progress bar stays open until both M365 and Google scans finish.
scan_engine.py
- File scan now emits
file_scan_doneinstead ofscan_doneso the progress bar stays open until all active scan types finish. pctin both Google and file scan progress events was hardcoded at 50 — now increments from 10 to a max of 90.
static/app.js
- Progress bar now tracks three independent flags (
_m365ScanRunning,_googleScanRunning,_fileScanRunning) and only hides when all active scans have completed. google_scan_doneandfile_scan_doneSSE event handlers added.- Source filter dropdown (search results) and bulk delete source dropdown were missing Gmail, Google Drive, Lokal, and Netværk (SMB) options.
- Profile preset buttons (1 år / 2 år / etc.) were never highlighted when applying a profile — matching used
years × 365.25but profiles storeyears × 365. Fixed. _fileScanRunningflag set correctly at scan start fromfileSources.length.
routes/state.py / routes/google_scan.py
- M365 and Google scans shared
_scan_lock— Google now uses_google_scan_lockand_google_scan_abortso both platforms scan in parallel.
templates/index.html
- Sources, Settings and Schedule indicator moved from sidebar section header / footer into the topbar, to the right of the Profiles button.
- Source filter dropdown and bulk delete dropdown updated with Google and file source options.
README.md
- All emoji removed (role badges, action icons, status indicators). Plain text equivalents used throughout.
lang/da.jsonandlang/de.jsonupdated with Google Workspace methodology text for the Article 30 report.
[1.6.4] — 2026-04-03
Added — Full profile editor (#15e ✅)
- Two-panel modal (profile list left, full editor right). Click a profile row to edit it; the active row is highlighted.
- + Ny profil button in the left panel footer — creates a blank profile and opens the editor immediately, works when no profiles exist.
- Editor sections match the sidebar exactly:
- Navn — name + description fields
- Kilder — same rendering as the main KILDER panel, including M365, Google Workspace, and file/SMB sources
- Konti — role filter (Alle / Ansat / Elev), text search, Alle / Ingen select buttons, + Tilføj konto manual entry, platform badges (M365 / GWS / M365+GWS), role badges
- Indstillinger — date picker with year presets (1/2/5/10/Alle), Scan e-mailindhold, Scan vedhæftede filer, Maks. vedhæftet filstørrelse (MB), Maks. e-mails pr. bruger, Delta-scanning, Søg efter ansigter i billeder — all as toggle sliders
- Opbevaringspolitik — always visible; Opbevaringsår + Regnskabsår slut dropdown
- Annuller, ×, and Gem all close the full modal. Auto-opens first profile on modal open.
- Profile editor defaults match the main window: accounts are unchecked by default; only explicitly saved
user_idsare shown as checked.
Fixed — Parallel M365 + Google scanning
- M365 and Google scans shared
_scan_lock— starting both simultaneously caused "Google scan already running" immediately after scan start. Fixed:routes/state.pynow defines_google_scan_lockand_google_scan_abortas separate threading primitives;routes/google_scan.pyuses these instead of the M365 lock. Both platforms now scan in parallel.
Fixed — User selection defaults
- All users now default to
selected: falseon page load (previouslytrue). The profile editor follows the same rule. - "Vælg alle" button renamed to "Alle" to match the main sidebar.
[1.6.3] — 2026-04-03
Fixed — Post-v1.6.3 release bugs
static/app.js
- Source toggle state (Email, OneDrive, SharePoint, Teams, Gmail, Google Drev) not persisted across restarts. Fixed: all toggles now save to
~/.gdprscanner/src_toggles.jsonvia a new/api/src_togglesendpoint and are restored on page load. - Deselecting M365 sources in Source Management did not update account badges —
M365 + GWSstill shown. Fixed: badge now useshasM365SrcandeffectiveGwscomputed insiderenderAccountList(), and M365 source toggles now callrenderAccountList()on change. - Google-only scans reported wrong account count in live log (e.g. "26 konto(er)" when 1 was selected). Root cause:
getSelectedUsers()returned all selected users including Google-only accounts. Fixed:getSelectedUsers()now returns only M365 users; Google users are counted separately for the log message. The "select at least one account" guard no longer blocks Google-only scans. - Cross-platform identity matching used email prefix (
anne.hansenbefore@) — changed todisplayNamematching since both M365 and GWS are maintained from the same AD source. _onGoogleSourceToggle()and M365 source toggles did not callrenderAccountList()— account badges not updated when toggling sources in Source Management.
routes/google_auth.py
- Removed
/api/google/auth/sourcesendpoint andsrc_gmail/src_drivekeys from the status response — replaced by unified/api/src_togglesendpoint ingdpr_scanner.py.
app_config.py / gdpr_db.py / checkpoint.py / google_connector.py / m365_connector.py / scan_scheduler.py / scheduler.py / gdpr_scanner.py
- All data files moved from
~/root into~/.gdprscanner/subdirectory with cleaner short names (scanner.db,config.json,token.json, etc.). A migration shim runs on first startup and moves existing~/.gdpr_scanner_*files automatically.MAINTAINER.mdupdated with new file locations.
scan_scheduler.py
- Scheduled scans ignored
file_sourcesfrom the profile —_build_options()dropped them. Fixed:file_sourcesnow included in opts, andrun_file_scan()is called for each file source in the profile during a scheduled run (#15f ✅).
static/app.js — profile save
file_sourcesin profile was hardcoded to[]— now saves the actual checked file sources frombuildScanPayload()(#15f).
Fixed — Post-release (continued)
routes/state.py / routes/google_scan.py
- M365 and Google scans shared
_scan_lock— starting both simultaneously caused "Google scan already running" immediately. Fixed: Google scan now uses its own_google_scan_lockand_google_scan_abortso both platforms can run in parallel.
static/app.js — profile editor (#15e ✅)
- Profile editor drawer implemented: two-panel modal (profile list left, full editor right). Click any profile to open its editor.
- Editor sections: Navn + beskrivelse, Kilder (same rendering as main KILDER panel, including Google and file sources), Konti (with Alle / Ansat / Elev role filter, text search, Alle / Ingen select buttons, + Tilføj konto manual add), Indstillinger (full mirror of sidebar — date picker with year presets, Scan e-mailindhold, Scan vedhæftede filer, Maks. vedhæftet filstørrelse, Maks. e-mails pr. bruger, Delta-scanning, Søg efter ansigter i billeder, all as toggle sliders), Opbevaringspolitik (always visible — Opbevaringsår + Regnskabsår slut).
-
- Ny profil button in left panel footer — creates a blank profile and opens the editor immediately, works even when no profiles exist.
- Annuller, ×, and Gem all close the full modal (not just the editor panel).
- Auto-opens first profile's editor when modal opens.
static/app.js — defaults
- All users now default to
selected: falseon load (weretrue). Profile editor follows the same rule — only explicitly saved user_ids are shown as checked. - "Vælg alle" button renamed to "Alle" to match the main sidebar.
routes/state.py
- Added
_google_scan_lockand_google_scan_abortas separate threading primitives for Google scans.
Added — Google Workspace full integration
Accounts panel
- Google Workspace users now appear in the Accounts panel alongside M365 users. Each row shows a platform badge:
M365(blue) orGWS(green). - Account list filters by checked sources: check only Google sources → only GWS accounts shown; check only M365 → only M365 accounts; check both → all; check none → empty.
- Role filter (All / Ansat / Elev) works across both platforms.
_mergeGoogleUsers()— dedicated async function fetches/api/google/scan/usersand merges results into_allUsersindependently of M365 auth timing. Called on page load, on Google connect/disconnect, and after M365loadUsers().
Scanning
- Selected Google user emails are now passed as
user_emailsto/api/google/scan/start— only selected accounts are scanned, not all users in the domain. routes/google_scan.py—_scan_lockand_scan_abortnow imported directly fromroutes.state(previously relied on__getattr__, which does not resolve bare names inside function bodies — causedNameErroron scan start).user_emailsnow read from the top-level request body in addition to the nestedoptionsdict.- Gmail scan result cards now correctly labelled "Gmail" (source_type was
email→ mapped to "Exchange"). Fixed ingoogle_connector.py. - Gmail and Google Drive cards now show styled source badges (
badge-gmailred tint,badge-gdriveblue tint). Previously fell back to unstyled.
Profiles
- Google sources (
gmail,gdrive) and selected Google user emails are now saved to scan profiles and correctly restored on load. - Fixed
googleSourcesconsttemporal dead zone — declaration moved before use inbuildScanPayload().
Added — OU-based role classification for Google Workspace (#23 Phase 1 ✅)
classification/google_ou_roles.json— maps Google Workspace Organisational Unit paths to roles. Edit to match your school's OU structure; no code change required. Default:/Elever→ student,/Personale→ staff.google_connector.py—list_users()fetchesorgUnitPathviaprojection=fulland classifies each user viaclassify_ou_role().routes/google_scan.py— role map built fromlist_users()result; each scan card now carries the correctuser_role.
Added — Documentation split
M365_SETUP.md— step-by-step Microsoft 365 setup (app registration, permissions, auth modes, headless config, troubleshooting).GOOGLE_SETUP.md— step-by-step Google Workspace setup (service account, domain-wide delegation, scopes, OU role mapping, troubleshooting).README.md— trimmed from 774 to 611 lines; setup detail moved to the two new files.
Changed — i18n migrated from .lang to JSON (#27 ✅)
lang/en.json,da.json,de.json— 709 keys each, standard flat JSON.app_config.py— loader now prefers.json, falls back to.langfor backward compatibility.- Old
.langfiles retained as fallback; can be deleted once JSON files are confirmed working.
Changed — skus/ renamed to classification/ (#29 ✅)
skus/education.json→classification/m365_skus.jsonskus/google_ou_roles.json→classification/google_ou_roles.json- All path references updated in
m365_connector.py,google_connector.py,routes/users.py,gdpr_scanner.py,build_gdpr.py, all lang files, andstatic/app.js.
Changed — UI polish (icons removed, badges added)
- Role filter buttons (Staff / Student), scan option labels (Delta scan, Scan photos, Retention policy), and account list role badges — all emoji removed, plain text only.
- Role badge on account rows changed from emoji icon button to plain outline pill (
Ansat/Elev/Anden). - Scan result cards — role icon prefix replaced with small inline badge.
- All six lang files cleaned of emoji in role, mode, option, and Art.30 inventory keys.
- Progress bar fixed at 32px height — emoji in filenames no longer push the bar taller.
- Scrollbars in Sources and Accounts panels thinned to 4px.
Fixed — Account list / source interaction
- Deselecting all sources now empties the account list.
- Deselecting M365 sources no longer disables Accounts when Google sources are still checked.
_updateAccountsVisibility()now checks all source types, not just M365.
Fixed — Role override cycling
- Role override never cleared for users loaded with a pre-existing override (
roleOverride: truefrom a previous session) because_autoRolewas never populated from the server. Fixed: replaced_autoRolecomparison with a step counter — after 3 clicks the override clears regardless of the original auto role. - Role badge changed from
<span>to<button type="button">inside label rows — prevents label click-forwarding to the checkbox (which caused the first user to receive the override instead of the clicked user).
[1.6.2] — 2026-03-28
Added — Google Workspace account list and source integration
static/app.js— Google Workspace users (292 users in testing) now appear in the Accounts panel withGWSbadge (blue = M365, green = GWS). M365 users carryM365badge.- Account list filters by checked sources: check only Google sources → only GWS accounts shown; check only M365 → only M365 accounts; check both → all accounts; check none → empty list.
- Role filter buttons (All / Ansat / Elev) work across both platforms.
_mergeGoogleUsers()— dedicated function fetches/api/google/scan/usersand merges results into_allUsersindependently of M365 auth timing. Called on page load, on Google connect/disconnect, and after M365loadUsers().startScan()— selected Google user emails now passed asuser_emailsto/api/google/scan/start, so only the chosen accounts are scanned (previously ignored selection and scanned all users).routes/google_scan.py—_scan_lockand_scan_abortnow imported directly fromroutes.state(previously relied on__getattr__which doesn't resolve bare names inside function bodies — causedNameErroron scan start).user_emailsnow read from the top-level request body in addition to the nestedoptionsdict.
Added — OU-based role classification for Google Workspace (#23 Phase 1)
classification/google_ou_roles.json— new file mapping Google Workspace Organisational Unit paths to roles. Edit to match your school's OU structure; no code change required. Default:/Elever→ student,/Personale→ staff.google_connector.py—list_users()now fetchesorgUnitPathviaprojection=fulland classifies each user viaclassify_ou_role(). Each user dict now includesuserRoleandorgUnitPath.
Added — Documentation split
M365_SETUP.md— step-by-step Microsoft 365 setup guide (app registration, permissions, auth modes, headless config, role classification, troubleshooting).GOOGLE_SETUP.md— step-by-step Google Workspace setup guide (service account, domain-wide delegation, OAuth scopes, OU role mapping, troubleshooting).README.md— trimmed from 774 to 611 lines. Auth/permissions/headless detail moved to setup guides. Two new "Microsoft 365" and "Google Workspace" sections link to the respective files.
Changed — UI polish (icons removed)
- Role filter buttons (Staff / Student) — emoji removed, plain text only.
- Scan option labels (Delta scan, Scan photos for faces, Retention policy) — emoji removed.
- Account list role badge — replaced clickable emoji button (
👔/🎓/👤) with plain outline pill badge (Ansat/Elev), matching the platform badge style. - Scan result cards — role icon prefix removed from account name; replaced with small inline outline badge.
- All three lang files (
en.lang,da.lang,de.lang) cleaned of emoji inm365_role_staff,m365_role_student,m365_opt_delta,m365_opt_scan_photos,m365_opt_retention,m365_mode_delegated,m365_bulk_overdue_btn,a30_inv_staff,a30_inv_students.
Fixed — Profile save/load with Google sources
- Google sources (
gmail,gdrive) and selected Google user emails now saved in scan profiles and correctly restored on load. googleSourcesconstdeclaration moved before use inbuildScanPayload()— fixed temporal dead zoneReferenceError.
Fixed — Account list / source interaction
- Deselecting all sources now empties the account list (previously kept showing all users).
- Selecting only Google sources no longer disables the Accounts section (previously greyed out when no M365 sources were checked).
_updateAccountsVisibility()now checks all source types, not just M365.
Added — Google Workspace role classification via OU mapping (#23 Phase 1)
classification/google_ou_roles.json— new file mapping Google Workspace Organizational Unit paths to roles (student/staff). Edit to match your school's OU structure; no code change required. Default prefixes:/Elever→ student,/Personale→ staff.google_connector.py—list_users()now requestsorgUnitPathfrom the Admin Directory API and classifies each user viaclassify_ou_role(). Each user dict now includesuserRoleandorgUnitPath.routes/google_scan.py— role map built fromlist_users()result; scan cards now carry the correctuser_roleinstead of always"other".
Fixed — Post-split and app runtime bugs (additional)
routes/database.py
- Settings panel showed "Scanned: 0, Flagged: 0, Scans: 0" because
get_stats()returns{}when no scan has afinished_attimestamp (interrupted or first-run). Fixed: stats endpoint now queriesflagged_itemsandscanstables directly so counts are always correct regardless of scan completion state. Stats populate on app start from existing DB data — no re-scan required. - DB export produced a ZIP but nothing was downloaded in the native app because
URL.createObjectURL()does not work in pywebview. Fixed:exportDB(),exportExcel(), andexportArticle30()instatic/app.jsnow detect pywebview and callwindow.pywebview.api.save_db_export()/save_excel()/save_article30()which use the native macOS/Windows save dialog. Browser fallback preserved. - Added
save_db_export()andsave_article30()methods to the pywebviewApiclass inbuild_gdpr.py. Fixedsave_excelfilename fromm365_scan_togdpr_scan_.
scan_engine.py
run_file_scan()called_db.start_scan()which does not exist — the correct method isbegin_scan(). Silent exception meant_db_scan_idwas alwaysNoneand no file scan results were ever written to the database. Fixed.
Added — Personal use disposition value (#28)
Staff members using work equipment for private purposes will now appear in scan
results. Added personal-use as a disposition value so reviewers can explicitly
mark items as outside the organisation's compliance scope.
- New disposition: Personal use — out of scope in both UI dropdowns
- Art. 30 report labels it "Personal use — out of GDPR scope (Art. 2(2)(c))"
- Translated in EN / DA / DE
Legal basis: GDPR Article 2(2)(c) — processing by a natural person in the course of a purely personal activity is outside GDPR scope.
Added — pytest test suite (#26)
112 tests across 4 modules — all passing.
| Test module | Tests | What it covers |
|---|---|---|
tests/test_document_scanner.py |
36 | is_valid_cpr, extract_matches, scan_docx, scan_xlsx, _scan_bytes — CPR detection, false-positive suppression, binary edge cases |
tests/test_app_config.py |
34 | i18n loading, Article 9 keyword detection, config round-trip, admin PIN, profiles CRUD, Fernet encryption |
tests/test_checkpoint.py |
18 | _checkpoint_key stability, save/load/clear, wrong-key isolation, delta token round-trip |
tests/test_db.py |
24 | Scan lifecycle, save_item, CPR hash-only storage, lookup_data_subject, dispositions, export/import cycle |
Support files:
tests/conftest.py— shared fixtures:docx_with_cpr,docx_no_cpr,xlsx_with_cpr,xlsx_no_cpr,txt_with_art9,binary_garbage,tmp_dbpytest.ini— test discovery config
Run with: pytest tests/ from the project root.
Fixed — Six post-split runtime bugs
All bugs introduced by the #25 module split — the pre-split code had none of these.
gdpr_scanner.py
_current_scan_idimported as a string binding (from sse import _current_scan_id), soscan_stream()always saw""— SSE replay filter excluded all events and the progress bar showed nothing. Fixed: readssse._current_scan_idat call time via module reference._connectorassignment only updated the local module global, not_state.connector.scan_engine.pyreads_state.connector, which stayedNoneafter sign-in — every scan reported "Not connected to M365". Fixed: all five_connector = ...assignments now dual-assign_connector = _state.connector = ....
scan_engine.py
_load_role_overrides,_resolve_display_name,_scan_text_directwere undefined bare names insiderun_scan()— raisedNameErrorat runtime. Fixed: proper imports fromapp_configandcpr_detector.PHOTO_EXTSandSUPPORTED_EXTSwere stub empty sets at import time; injection via_se.PHOTO_EXTS = ...replaced the module attribute but function bodies still saw the empty stubs. Fixed:scan_engine.pynow imports these directly fromcpr_detectorat module level.scan_progressSSE event broadcastsindexandpct; the UI handler readd.completed— progress bar was always 0%. Fixed instatic/app.js: handler now readsd.pct(pre-calculated server-side) and populatesprogressStats(n / total) andprogressEtaelements that were wired in HTML but never written.- Source collection (OneDrive, SharePoint, Teams) completed silently with no count in the live log. Fixed: broadcasts
📁 OneDrive — user: N file(s),🌐 SharePoint: N file(s),💬 Teams — user: N file(s)after each successful collection.
cpr_detector.py
_scan_text_direct()calledds.scan_text()which internally callsextract_cpr_and_dates()— a function that does not exist indocument_scanner.py(pre-existing bug in that module). Result: every email body scan returned zero CPRs. Same bug affected.txtfiles and the unknown-extension fallback in_scan_bytes(). Fixed: all three replaced withds.extract_matches(text, 1, "text")which works correctly.
static/app.js
scan_file_flaggedhandler calledrenderCards()which is not defined anywhere — silentReferenceErrorin the browser, cards pushed toflaggedDatabut never displayed. Fixed: replaced withapplyFilters()which callsrenderGrid()and shows the filter bar.scan_donehandler never showed the filter bar (containing Excel and Art.30 export buttons) when results existed — only the stats numbers updated. Fixed:scan_donenow explicitly shows the filter bar and callsapplyFilters()whenflaggedData.length > 0.
[1.6.1] — 2026-03-28
Changed — Split gdpr_scanner.py into focused modules (#25)
gdpr_scanner.py was 5554 lines. It is now 3591 lines and delegates to five
focused modules. No behaviour changes — all existing routes, blueprints, and
imports continue to work unchanged.
New files:
| Module | Lines | Contents |
|---|---|---|
sse.py |
52 | broadcast(), _sse_queues, _sse_buffer, _current_scan_id |
checkpoint.py |
79 | _save_checkpoint(), _load_checkpoint(), _checkpoint_key(), delta token load/save |
app_config.py |
553 | i18n, Article 9 keywords, config, admin PIN, profiles, settings, SMTP, file sources, Fernet encryption |
cpr_detector.py |
381 | _scan_bytes(), _extract_exif(), _detect_photo_faces(), _make_thumb(), _get_pii_counts() |
scan_engine.py |
1006 | run_scan(), run_file_scan() — M365 and file-system scan orchestration |
Changed files:
gdpr_scanner.py— imports and re-exports from all five modules; keeps Flask app init,@app.routedefinitions, blueprint registration, and__main__entry pointroutes/state.py—_scan_lockand_scan_abortmoved here fromgdpr_scanner.pysoscan_engine.pycan reference them without a circular import
Isolation: each new module is importable in isolation with fallback stubs,
enabling unit tests (#26) to import cpr_detector or checkpoint without
pulling in Flask, MSAL, or the full application.
[1.6.0] — 2026-03-28
Changed — Rename: M365 Scanner → GDPRScanner (#24)
The tool now scans M365, Google Workspace, local file systems, and SMB network shares. The name "M365 Scanner" was misleading. This release renames everything with no behaviour changes.
Files renamed:
| Old | New |
|---|---|
m365_scanner.py |
gdpr_scanner.py |
m365_db.py |
gdpr_db.py |
build_m365.py |
build_gdpr.py |
build_m365.sh |
build_gdpr.sh |
start_m365.sh (created by install_macos.sh) |
start_gdpr.sh |
start_m365.bat (created by install_windows.ps1) |
start_gdpr.bat |
Config files renamed on first startup (migration shim):
Old ~/ path |
New ~/ path |
|---|---|
.m365_scanner_config.json |
.gdpr_scanner_config.json |
.m365_scanner.db |
.gdpr_scanner.db |
.m365_scanner_token.json |
.gdpr_scanner_token.json |
.m365_scanner_delta.json |
.gdpr_scanner_delta.json |
.m365_scanner_settings.json |
.gdpr_scanner_settings.json |
.m365_scanner_smtp.json |
.gdpr_scanner_smtp.json |
.m365_scanner_role_overrides.json |
.gdpr_scanner_role_overrides.json |
.m365_scanner_file_sources.json |
.gdpr_scanner_file_sources.json |
.m365_scanner_machine_id |
.gdpr_scanner_machine_id |
.m365_scanner_checkpoint.json |
.gdpr_scanner_checkpoint.json |
.m365_scanner_schedule.json |
.gdpr_scanner_schedule.json |
.m365_scanner_msal_cache.bin |
.gdpr_scanner_msal_cache.bin |
.m365_scanner_lang |
.gdpr_scanner_lang |
The migration runs silently at startup — existing scan history, credentials, settings, and role overrides are preserved automatically.
Intentionally unchanged:
m365_connector.py— kept as-is; it is the Microsoft Graph connector and them365_prefix accurately describes what it connects to- i18n keys with the
m365_prefix that describe M365-specific UI elements (Azure credential fields, device code flow screens) — the prefix is correct
Run with:
python gdpr_scanner.py [--port 5100]
[1.5.9] — 2026-03-28
Added — Google Workspace scanning (#10)
Organisations running mixed Microsoft/Google environments can now scan Gmail and Google Drive alongside M365 in a single tool. The Google Workspace tab in Source Management is now fully active (was "Coming soon" stub).
New files:
google_connector.py— service account OAuth with domain-wide delegation; Gmail message + attachment iterator; Drive file iterator with automatic export of native Docs/Sheets/Slides → DOCX/XLSX/PPTX before scanningroutes/google_auth.py—/api/google/auth/status,/connect,/disconnectroutes/google_scan.py—/api/google/scan/start,/cancel,/users
Changed files:
routes/state.py—google_connectorslot addedm365_scanner.py— Google blueprints registered;GOOGLE_CONNECTOR_OK/GOOGLE_AUTH_OKflags; connector auto-restored from saved key on startuptemplates/index.html— Google tab activated; full credentials pane with key file upload, admin email field, Gmail + Drive source toggles, and setup guidestatic/app.js—smGoogleRefreshStatus(),smGoogleConnect(),smGoogleDisconnect(),getGoogleScanOptions(), key file readerrequirements.txt,install_windows.ps1,install_macos.sh— three new optional Google API dependencieslang/en.lang,da.lang,de.lang— 14 new i18n keys each
Dependencies (optional — scanner starts without them):
pip install google-auth google-auth-httplib2 google-api-python-client
Setup required in Google Workspace Admin Console:
- Create a Google Cloud project; enable Gmail API, Drive API, Admin SDK
- Create a service account; download the JSON key; enable domain-wide delegation
- In Workspace Admin → Security → API Controls → Domain-wide delegation add the
service account client ID with scopes:
gmail.readonly,drive.readonly,admin.directory.user.readonly
Scan results write to the same SQLite database with source_type = "gmail"
or "gdrive" — Article 30 reports and data subject lookups cover both platforms
automatically.
[1.5.8] — 2026-03-28
Fixed — Scheduled scans invisible in the browser (#21)
Scheduled scans now show full live progress in the browser — progress bar, phase text, flagged cards, and log entries — exactly like manual scans.
Root cause (critical): When run as python m365_scanner.py, the module
loads as __main__. The scheduler's import m365_scanner as _m loaded a
second copy of the module with its own empty _sse_queues. Events from
_m.broadcast() went nowhere — the browser's SSE connection was reading from
__main__'s queues.
Fix: sys.modules["m365_scanner"] = sys.modules[__name__] at the top of
the module ensures all imports share the single running instance.
Fixed — SSE event replay for late-connecting browsers (#21)
Opening the browser mid-scan (manual or scheduled) now replays all buffered progress events so the live log and card grid are fully populated.
Additional root causes and fixes:
_autoConnectSSEIfRunning()only attachedscheduler_*listeners on page load — replayedscan_phase,scan_file_flagged, andscan_doneevents were silently ignored- Idle SSE connections died silently (Flask/Werkzeug threading); the browser had no live connection when a scheduled scan fired minutes/hours later
Changes — Python (m365_scanner.py):
- Module identity fix:
sys.modules["m365_scanner"] = sys.modules[__name__] - Added
_current_scan_idglobal — unique timestamp-based ID set at the start of every scan (M365 and file scans) and cleared afterscan_done broadcast()injectsscan_idinto every SSE event payloadscan_stream()filters the replay buffer to only include events matching the currentscan_id, preventing stale replay from previous scans- New
sse_replay/sse_replay_donemarker events bracket the replayed block so the browser can distinguish replay from live events - New
GET /api/scan/statuslightweight endpoint returning{running, scan_id}
Changes — JavaScript (static/app.js):
- Extracted
_attachScanListeners(es)and_attachSchedulerListeners(es)— shared by bothstartScan()and_autoConnectSSEIfRunning() _attachSchedulerListenersnow shows the progress bar onscheduler_startedand hides it onscheduler_done/scheduler_error- SSE polling watchdog (
_sseWatchdog) checks/api/scan/statusevery 4s; reopens the SSE connection via_ensureSSE()if it has died _userStartedScanflag —scan_doneonly closes the SSE connection for user-initiated scans; scheduled scans keep it alive- Fixed
es.onerrorhandler — no longer silently nullses
Fixed — File scan scan_complete → scan_done event name
run_file_scan() was broadcasting scan_complete on finish, but the JS only
listens for scan_done. Renamed to scan_done with the same total_scanned /
flagged_count payload shape as M365 scans.
Fixed — Resume scan used wrong profile
startScan() never told the server which profile was active. Settings were
always saved to the Default profile. Now profile_id is sent in the scan start
payload and _save_settings() accepts a profile_id parameter (takes
precedence over profile_name).
Fixed — install_macos.sh launcher scripts
start_gdpr.shandbuild_m365.shtemplates now useexec python3instead ofexec python— fixes "not found" after removing python.org interpreter- spaCy model install: creates a
pipshim invenv/bin/(spaCy'sshutil.which("pip")couldn't find the venv's pip3), falls back to directpip installifspacy downloadstill fails, and prependsvenv/binto PATH explicitly
Added — Diagnostic logging
[run_scan]prints sources, user count, app_mode, and a sample user entry at scan start — helps verify scheduled scans use the correct profile[SSE]console.log messages in the browser forscan_phase,scan_done,scan_file_flagged,scheduler_started,scheduler_done,scheduler_error— aids debugging SSE delivery issues
Added — i18n keys (EN / DA / DE)
m365_sse_reconnecting— shown when page load detects a running scanm365_sse_replay_note— logged after replayed events finish
[1.5.7] — 2026-03-28
Fixed — Missing translations in Settings modal
Several strings in the Settings → General and Settings → Scheduler tabs were displaying in English regardless of the active language.
Missing lang keys added (EN / DA / DE):
btn_save— Save / Gem / Speichern (used by scheduler editor Save button and others)m365_settings_about— About / Om / Überm365_settings_save_pin— Save PIN / Gem PIN / PIN speichernm365_sched_freq_daily/weekly/monthly— frequency labels in job list and editorm365_sched_dow_monthrough_sun— day-of-week labels
Template fixes:
- "About" group heading now has
data-i18n="m365_settings_about" - "Save PIN" button uses dedicated key
m365_settings_save_pininstead of genericbtn_save - Frequency and day-of-week
<option>elements now havedata-i18nattributes - Scheduler job list (
schedRenderJobs) and status update now uset()for frequency labels
Changed — Theme toggle replaced with slider
The "Toggle dark / light" text button in Settings → General is replaced with a standard toggle slider (consistent with all other toggles in the UI). The slider reflects the current theme state when the tab opens and toggles the theme on click.
[1.5.6] — 2026-03-28
Feature — SSE event replay (#21)
Opening the browser mid-scan (e.g. while a scheduled scan is running) now replays all buffered events so the live log and result cards populate immediately, rather than showing nothing until the next event fires.
m365_scanner.py:
- Added
_sse_buffer: deque = deque(maxlen=500)— a ring buffer that stores everybroadcast()event broadcast()appends to the buffer before sending to SSE clientsrun_scan()clears the buffer at the start of each scan so stale events from the previous scan are not replayed- Removed duplicate
@app.route("/api/scan/stream")— route is now handled exclusively by theroutes/scan.pyblueprint
routes/scan.py:
scan_stream()replays_m._sse_bufferimmediately when a new client connects, then switches to live events- All globals accessed directly via
import m365_scanner as _mto avoid__getattr__resolution failures that caused 500 errors - A
: connectedcomment line is sent first to confirm the stream is flowing
static/app.js:
_autoConnectSSEIfRunning()— new function called onDOMContentLoadedthat always opens the SSE connection on page load. If a scan is already running, buffered events replay immediately. If the buffer is empty, no events fire and the log stays quiet.- Handles
scan_phase,scan_progress,scan_start,scan_file_flagged,scan_done,scheduler_started,scheduler_done,scheduler_errorevents startScan()closes and reopens the SSE connection to get a clean stream for each manual scan
m365_scanner.py — CLI output when no browser connected:
broadcast()now prints key events to the terminal when_sse_queuesis empty (i.e. no browser tab is watching), so scheduled scans are visible in the CLI: scan phases, file progress, errors, and completion summary
[1.5.5] — 2026-03-28
Fixed — Scheduler: multiple bugs after multi-job implementation
scheduler.py renamed to scan_scheduler.py
Python's stdlib includes a sched/scheduler module that was being resolved
instead of the project's own scheduler.py, causing module 'scheduler' has no attribute 'load_jobs'. Renaming the project file to scan_scheduler.py eliminates
the collision entirely. All imports updated in routes/scheduler.py and
m365_scanner.py.
Jobs with missing UUID assigned on load
Jobs saved before the multi-job refactor had "id": "". load_jobs() now detects
any job with a missing or empty id and assigns a fresh UUID, then rewrites the file.
This fixed "Delete failed: id required" and silent edit failures.
Enabled toggle added to each scheduler row
Each job row now has an inline toggle switch instead of a static ✓/— indicator.
Toggling saves the change immediately via /api/scheduler/jobs/save. The job
description also shows "Next: [date]" after the status fetch resolves.
Edit no longer duplicates the job
_sched().reload() inside the save route was not wrapped in its own try/except.
If APScheduler threw (e.g. not yet started), the exception propagated and caused
the save to fall through to the "create new" path. Both reload() calls (save and
delete) are now wrapped in try/except: pass.
Delete button now works
The delete button was passing the HTML-escaped job name through the onclick
attribute — names with apostrophes or special characters broke the JS string.
Fixed by passing only id and looking up the name from _schedJobs inside
schedDeleteJob(). The route and JS both have proper error handling now.
"Not authenticated" on scheduled run
state.connector is assigned once at startup (_state.connector = _connector)
and never updated when the user authenticates later. The scheduler now reads
_m._connector directly from the live m365_scanner module at run time,
guaranteeing it sees the current authenticated connector.
flagged_items and scan_meta reads also updated to use _m.flagged_items
and _m.scan_meta directly.
[1.5.4] — 2026-03-28
Feature — Multiple scheduled scans
The Settings → Scheduler tab now supports multiple independent named scan jobs, replacing the previous single-job form.
scheduler.py
- Config format changed from a single dict to
{"jobs": [...]}. Each job has its ownid(UUID),name, and all existing fields (frequency, time, profile, auto-email, auto-retention). - Old single-job
~/.m365_scanner_schedule.jsonfiles are automatically migrated to the new format on first load — no manual changes needed. ScanSchedulerregisters one APScheduler job per enabled scan and tracks running state and last-run info per job independently.- Backward-compat shims (
load_schedule_config,save_schedule_config) kept for any existing integrations.
routes/scheduler.py — new CRUD endpoints:
GET /api/scheduler/jobs— list all jobsPOST /api/scheduler/jobs/save— create or update a job (by id)POST /api/scheduler/jobs/delete— delete a job by idPOST /api/scheduler/jobs/run_now— run a specific job immediately by id- Old
/api/scheduler/configand/api/scheduler/run_nowkept as backward-compat shims
templates/index.html — scheduler pane replaced with a job list (styled like
File sources) and an inline editor that slides open when adding or editing. Each
row shows the job name, frequency summary, enabled/running status pill, and
▶ Run / ✏ Edit / ✕ Delete buttons. Schedule configuration lives exclusively in
the editor — nothing schedule-related appears in the sidebar except the existing
"Next: …" indicator.
static/app.js — all sched* functions rewritten for multi-job:
schedLoad, schedRenderJobs, schedAddJob, schedEditJob, schedSaveJob,
schedDeleteJob, schedRunJob, schedCancelEdit, schedLoadHistory,
schedUpdateSidebarIndicator.
Lang keys added: m365_sched_add, m365_sched_name, m365_sched_editor_new,
m365_sched_editor_edit, m365_sched_name_required, m365_sched_no_runs,
btn_cancel (da/en/de).
[1.5.3] — 2026-03-27
Added — Suggestion #19: Scheduled / automatic scans
In-process scheduler using APScheduler so GDPR scans run automatically on a configurable cadence — no cron or Task Scheduler setup required.
Backend:
- New
scheduler.pymodule wrapping APSchedulerBackgroundSchedulerwith a single coalescing job; misfire grace time 1 hour. - Config stored in
~/.m365_scanner_schedule.json(daily/weekly/monthly, time-of-day, profile selector, auto-email, auto-retention). - Run history persisted in new
schedule_runsDB table (migration #7). routes/scheduler.pyblueprint —GET/POST /api/scheduler/config,GET /api/scheduler/status,POST /api/scheduler/run_now,GET /api/scheduler/history.- Scheduler starts automatically on
app.run; status printed at boot. - Scheduled scans reuse the full
run_scan()pipeline (checkpoints, delta, broadcast, DB) — identical to interactive scans. - Auto-email sends the Excel report via Graph or SMTP after each scheduled scan.
- Auto-retention optionally enforces the retention policy on overdue items.
UI:
- Settings → Scheduler tab — enable/disable toggle, frequency picker (daily/weekly/monthly), time-of-day, profile selector, auto-email and auto-retention toggles, status display, run history, "Run now" button.
- Sidebar — 🕐 next-scan indicator near the settings button; click to open scheduler config. Polls every 60 s.
- Scan log — scheduled scans appear with 🕐 prefix via SSE events
(
scheduler_started,scheduler_done,scheduler_error).
Build / deps:
APScheduler>=3.10added torequirements.txt.scheduler.pyand APScheduler hidden imports added tobuild_m365.py.- Schedule config added to
--purgecleanup list. - Lang keys added for DA / EN / DE.
[1.5.2] — 2026-03-27
Fixed — File/SMB scan: image-only PDFs no longer hang the scanner
scan_pdf() in document_scanner launches Tesseract OCR and Poppler subprocesses
when a PDF has no text layer. These subprocesses cannot be killed from a Python thread,
causing the scanner to hang indefinitely on scanned documents (e.g. ESTA applications,
invoice scans).
Fix: Before calling scan_pdf(), _scan_bytes() now opens the PDF with pdfplumber
(pure Python, no subprocesses) and checks whether any page has a text layer using the
existing is_text_page() helper. If all pages are image-only, the file is skipped
immediately with no CPR hits — which is correct, since machine-readable CPR numbers
cannot exist in an image-only PDF.
Text-layer PDFs (the majority) pass the check and are scanned normally. Only image-only PDFs (scanned documents) are skipped.
This replaces multiple failed approaches (ThreadPoolExecutor timeouts,
shutdown(wait=False), extension-based skipping) that either blocked on context manager
exit or removed legitimate file types from scanning.
Fixed — SMB scanning: multiple smbprotocol 1.14+ API changes
See v1.5.1 for details. Additional fix in this release:
smb_hostis now auto-derived from the path (//host/share→host) when not explicitly stored in the source JSON, so SMB sources saved without an explicit host field still connect correctly.
Fixed — Routes blueprint: globals resolved lazily to prevent circular imports
Each route blueprint (routes/*.py) now uses Python's module __getattr__ hook to
lazily resolve globals from m365_scanner at call time, not at import time. This
prevents the circular import that caused double blueprint registration on startup.
Added — File source Edit button
See v1.5.1.
[1.5.1] — 2026-03-27
Fixed — SMB scanning: multiple smbprotocol 1.14+ API incompatibilities
Several functions in file_scanner.py used deprecated or renamed smbprotocol APIs:
uuid4_str()removed —Connection()now requires auuid.UUIDobject, not a string. Changed touuid.uuid4()directly; addedimport uuidat module level.RequestedOpcodesremoved fromsmbprotocol.open— was imported but never used; removed.FilePipePrinterAccessMask.FILE_LIST_DIRECTORY→DirectoryAccessMask.FILE_LIST_DIRECTORY— directory listing requiresDirectoryAccessMask, not the file/pipe mask.FileDirectoryInformationmoved — fromsmbprotocol.query_infotosmbprotocol.file_info; import updated.FileInformationClassenum —query_directory()expectsFileInformationClass.FILE_DIRECTORY_INFORMATION(int enum), not a class instance.query_directory()kwargs renamed —file_name=→pattern=,output_buffer_length=→max_output=.- Filename bytes —
file_namefield now returns UTF-16-LE bytes; decoded to str with error handling. smb_hostauto-derivation — ifsmb_hostis not explicitly stored in the source JSON, it is now extracted from the path (//host/share→host).is_smbno longer requiressmb_hostto be pre-set.
Fixed — SMB scanning: junk directories skipped
Added SKIP_DIRS constant — a set of folder names silently skipped in both local and SMB walks:
.recycle .recycler $recycle.bin .trash .trashes
.sync .btsync .syncthing
.git .svn .hg
__pycache__ node_modules
.spotlight-v100 .fseventsd .temporaryitems
system volume information lost+found
Local walker prunes these from _dirs[:] before os.walk descends. SMB walker checks before recursing. Hidden directories (. prefix) are also skipped in both.
STATUS_END_OF_FILE errors (zero-byte placeholder files from Bittorrent Sync, .sync/stream_test.txt etc.) are now silently skipped instead of logged as warnings.
Fixed — SMB/local file scans: OCR disabled, per-file timeout added
PDF scanning via document_scanner.scan_pdf() would trigger Tesseract OCR on image-based PDFs (scanned forms, photos) causing single files to hang for minutes.
_scan_bytes_timeout() — new wrapper around _scan_bytes using ThreadPoolExecutor with a 30-second deadline per file. Timed-out files are logged as errors and scanning continues.
skip_ocr=True — file scan loop now passes skip_ocr=True to _scan_bytes, disabling OCR and reducing DPI to 150. Only the text layer is extracted from PDFs. This is appropriate for bulk compliance scanning where image-only PDFs rarely contain machine-readable CPR numbers.
Added — File source Edit button
Each file source row in ⚙ Sources → File sources now has an ✏ Edit button between Scan and Delete. Clicking it pre-fills the add form with the existing name, path, SMB host, and username (password shown as placeholder dots). Saving with an existing ID updates the source in-place. The Add button label changes to Save changes while editing and reverts on save.
[1.5.0] — 2026-03-27
Refactor — HTML template and JavaScript extracted from m365_scanner.py
m365_scanner.py was a ~9600-line monolith containing HTML, CSS, JavaScript,
and Python all in one string. This made frontend edits unsafe (no linting,
no syntax highlighting, string escaping hazards) and diffs unreadable.
What changed:
templates/index.html— the full HTML/CSS template (1418 lines), served via Flask'srender_template()with two Jinja2 variables:app_versionandlang_jsonstatic/app.js— all JavaScript (2832 lines), served by Flask's built-in static file handler at/static/app.jsm365_scanner.py— reduced from 9586 to 5334 lines (44% smaller); now contains only Python: business logic, API routes, and configuration
Flask configuration updated:
app = Flask(__name__,
template_folder=os.path.join(BASE_DIR, "templates"),
static_folder=os.path.join(BASE_DIR, "static"))
BASE_DIR resolves to sys._MEIPASS when running as a PyInstaller bundle,
or to the directory containing m365_scanner.py otherwise — the same pattern
already used for lang/, keywords/, and classification/.
Build script updated:
build_m365.py now bundles templates/ and static/ alongside the existing
lang/, keywords/, and classification/ directories.
Zero behaviour change — the app works identically. Only the file organisation changed.
[1.5.0] — 2026-03-27
Refactor — HTML template and JavaScript extracted from m365_scanner.py
m365_scanner.py was a ~9600-line monolith containing HTML, CSS, JavaScript,
and Python all in one string. This makes frontend edits unsafe (no linting,
no syntax highlighting, string-escaping hazards) and diffs unreadable.
New files:
templates/index.html— full HTML/CSS template (1452 lines) served via Flaskrender_template(). Two Jinja2 variables:{{ app_version }}and{{ lang_json | safe }}.static/app.js— all JavaScript (2832 lines) served by Flask's built-in static file handler at/static/app.js.
Flask app updated:
app = Flask(__name__,
template_folder=os.path.join(BASE_DIR, "templates"),
static_folder=os.path.join(BASE_DIR, "static"))
BASE_DIR resolves to sys._MEIPASS when running as a PyInstaller bundle,
or the directory of m365_scanner.py otherwise — the same pattern already
used for lang/, keywords/, and classification/. build_m365.py updated to bundle
both new directories.
Result: m365_scanner.py reduced from 9586 to ~2100 lines of pure Python.
Zero behaviour change.
Refactor — Routes split into Flask Blueprints
All 55 API routes extracted from m365_scanner.py into a routes/ package.
Shared mutable state lives in routes/state.py; blueprints import from there
to avoid circular imports.
routes/
__init__.py package marker
state.py shared globals: connector, flagged_items, LANG, …
auth.py /api/auth/* 174 lines
users.py /api/users/* + role overrides 222 lines
scan.py /api/scan/* + /api/settings/* 123 lines
sources.py /api/file_sources/* + /api/file_scan 93 lines
profiles.py /api/profiles/* 48 lines
email.py /api/smtp/* + /api/send_report 210 lines
database.py /api/db/* + /api/admin/* + preview 536 lines
export.py Excel + Art.30 export + bulk delete 1177 lines
app_routes.py /api/about + /api/langs + /api/lang 67 lines
Housekeeping — Document Scanner files removed
The following files belonged to the standalone Document Scanner product and have been removed from this repository:
server.py— Document Scanner web appscanner_worker.py— Document Scanner process-pool workerbuild.py— Document Scanner build scriptbuild_app.sh— Document Scanner shell build scriptDockerfile— Document Scanner Docker imagedocker-compose.yml— Document Scanner Docker Compose filedoc_scanner_icon.png— Document Scanner app icon
requirements.txt rewritten for the M365 Scanner only. Removed
pdf2image, pytesseract, pypdf, reportlab, img2pdf, and py7zr
(Document Scanner dependencies). Added cryptography>=42.0 (SMTP password
encryption, already in use since v1.4.7).
[1.4.8] — 2026-03-27
Changed — Email: Microsoft Graph API preferred over SMTP
Both Test and Send now now try the Microsoft Graph API first when the scanner is authenticated to Microsoft 365. This avoids SMTP AUTH entirely — no port 587, no app password, no admin centre changes needed.
New _send_email_graph() helper — sends via /me/sendMail (delegated mode)
or /users/{sender}/sendMail (app mode). Supports optional Excel attachment for
the full report. Requires the Mail.Send Graph permission on the Azure app
registration (Application or Delegated, depending on auth mode).
Priority order:
- Microsoft Graph API — used when connected to M365
- SMTP — fallback if not connected or Graph fails
Error surfacing — Graph permission errors (403 / Forbidden / Mail.Send /
insufficient privileges) are now returned directly with a clear actionable
message: add Mail.Send permission to the Azure app registration and grant
admin consent. Previously the error was silently swallowed and the scanner
fell through to SMTP, masking the real problem.
SMTP AUTH error — if SMTP is used and Microsoft 365 returns error 530 5.7.57 ("Client not authenticated"), the error message now includes a plain-English tip explaining how to enable SMTP AUTH in the M365 admin centre, or how to use Graph instead.
Changed — Test button sends a real email to configured recipients
The SMTP Test button previously only verified connectivity (EHLO/STARTTLS handshake). It now sends an actual HTML test email to the configured recipients, making it easy to verify end-to-end delivery including spam filtering.
[1.4.7] — 2026-03-27
Security — SMTP password encrypted at rest
Previously the SMTP password was stored as plaintext in ~/.m365_scanner_smtp.json.
It is now encrypted using Fernet symmetric encryption (cryptography library,
already a dependency).
Implementation:
- A random Fernet key is generated on first use and saved to
~/.m365_scanner_machine_id(chmod 0o600 — owner-readable only) - Passwords are stored as
enc:<ciphertext>in the JSON file _encrypt_password()/_decrypt_password()handle the encode/decode cycle_load_smtp_config()transparently decrypts on load;_save_smtp_config()encrypts on save- Legacy plaintext passwords (no
enc:prefix) are read as-is and re-encrypted next time settings are saved — no migration step required - Encrypted blobs are machine-specific — the ciphertext cannot be decrypted on another machine without the key file
- Graceful fallback to plaintext if
cryptographyis unavailable (rare) - The GET
/api/smtp/configendpoint never returns the password to the browser; it returns onlyhas_password: true/false
Fixed — EXIF has_pii false positives on screenshots
_EXIF_PII_TAGS previously included HostComputer, DocumentName, and PageName.
These are set automatically by macOS/Windows on every screenshot (machine name, app
name) and contain no personal data about an individual. Removed from the tag set.
Minimum content length of 3 characters added — a field must contain at least 3
non-whitespace characters to trigger a has_pii flag. Prevents empty or
single-character values from causing false positives.
Affected fields retained: Artist, Copyright, ImageDescription,
UserComment, XPAuthor, XPSubject, XPComment, XPKeywords — all fields
a human would deliberately fill with personal information.
Fixed — Accounts section not greyed out when switching to a file-only profile
_applyProfile() restores source checkboxes but did not call
_updateAccountsVisibility() afterwards. Switching to a profile with no M365
sources selected left the accounts section fully interactive. Fixed by calling
_updateAccountsVisibility() immediately after the checkbox restore loop.
[1.4.6] — 2026-03-27
Changed — Excel export updated for EXIF, GPS, and file sources
New columns in all source sheets:
- GPS — ✔ tick when GPS coordinates are present in the item's EXIF data
- EXIF author — author/artist name extracted from EXIF metadata
- Special category column now filters out
gps_locationandexif_pii(represented by the dedicated GPS column instead)
New source types in SOURCE_MAP:
local— 📁 Local (green tab), for files from local folder scanssmb— 🌐 Network (blue tab), for files from SMB/CIFS network shares- Both get their own sheet when results exist; skipped silently if empty
Summary sheet:
- Row 4: "Items with GPS data" count (shown only when non-zero)
- Summary table shifted to row 7 to accommodate (was row 6)
- Source rows now skipped when a source has zero items
New GPS locations sheet:
- Teal tab — created only when GPS items exist
- Columns: Name, Latitude, Longitude, Maps link (blue hyperlink), Account, Date Modified
- Auto-filter enabled; alternating row colours
Bug fix: dead old function body (164 lines after the return) removed — the previous str_replace only replaced the docstring, leaving unreachable code in the file.
[1.4.5] — 2026-03-26
Fixed — _detect_photo_faces missing after EXIF insertion
The str_replace that added _extract_exif() accidentally consumed the
def _detect_photo_faces function definition (it was part of the replaced
string). All image scans raised NameError: name '_detect_photo_faces' is not defined. Function restored at its original position before _scan_bytes().
Fixed — Progress bar shows "undefined / undefined" during file scans
The M365 scan_progress SSE event sends {index, total, pct, file, eta}.
The file scanner sent only {scanned, flagged}. The JS handler blindly read
d.index and d.total, producing undefined / undefined.
Fixes:
run_file_scan()now broadcasts{scanned, flagged, file, pct}so the current filename and a progress indicator are shown while scanning.- The
scan_progressJS handler now checks which fields are present and renders accordingly:index / totalfor M365 scans,N · M flaggedfor file scans.
Fixed — Local file preview: PDF, XLSX, DOCX now render content
/api/preview/<id> for source_type=local previously showed only a metadata
placeholder for PDF and Office files. Now:
| Type | Preview |
|---|---|
First 5 pages extracted via pdfplumber, CPR numbers highlighted in red |
|
| XLSX / XLSM | First 50 rows of up to 3 sheets as a styled table |
| CSV | First 50 rows as a table |
| DOCX / DOC | First 80 paragraphs as text, CPR numbers highlighted |
All fall back to a metadata card if the library is unavailable or the file
cannot be parsed. document_scanner (already imported) provides access to
pdfplumber and openpyxl.
[1.4.4] — 2026-03-26
Added — #18 EXIF metadata extraction from images
New function _extract_exif(content, filename) — extracts structured EXIF data from JPEG, PNG, TIFF, WEBP, and HEIC images using Pillow (already a dependency). No new packages required.
Extracted fields:
- GPS coordinates — converted from DMS rational values to decimal degrees; Google Maps link generated
- Author / Artist / Copyright / Description / UserComment / Keywords — checked for PII content
- Device — camera make and model
- Datetime — DateTimeOriginal or DateTime
Behaviour changes:
- EXIF extraction runs on all scanned images regardless of the "Scan photos" toggle — it is lightweight (no CV processing) and always relevant
- Images with GPS or PII-bearing EXIF fields are flagged even without CPR hits
special_categorygains"gps_location"and/or"exif_pii"entries as appropriate- Face detection (
_detect_photo_faces) still requires the "🖼 Scan photos for faces" opt-in
UI:
- 🌍 GPS badge — teal pill on result cards (grid and list view) when GPS coordinates are present
- Preview panel — local image previews now show a collapsible "EXIF data" section beneath the image with GPS (clickable Google Maps link), author, date, device, and any other PII-bearing fields
Applies to both M365 and file system scans — OneDrive/SharePoint images and local/SMB files go through the same extraction path.
[1.4.3] — 2026-03-26
Added — General Settings modal
Three sidebar sections (✉ Email report, 🗄 Database, and the language selector + About link) have been removed from the sidebar and consolidated into a single ⚙ Settings modal, opened via a button in the sidebar footer.
General tab — language selector (mirrors the hidden langSelect), theme toggle, and About info (version, Python, MSAL, Requests, openpyxl versions).
Email report tab — full SMTP configuration (host, port, username, password, from address, STARTTLS, recipients), Save, and Send now. Pre-fills from saved config. openSmtpModal() now redirects to this tab for backward compatibility.
Database tab — DB stats (total items, flagged items, scan count), ⬇ Export, ⬆ Import, and 🗑 Reset DB. exportDB() and openImportDBModal() work unchanged.
🔍 Data subject lookup remains as a sidebar shortcut since it is part of the active compliance workflow.
[1.4.2] — 2026-03-26
Added — Dynamic sources panel in sidebar
The sidebar sources panel is now fully dynamic. Previously the four M365 sources (Email, OneDrive, SharePoint, Teams) were hardcoded checkboxes. Now:
renderSourcesPanel()builds the list at runtime from_M365_SOURCES(the four fixed M365 entries) and_fileSources(saved local/SMB sources). A "File sources" group header appears automatically when any file sources are configured.- Per-source visibility toggles in the ⚙ Sources modal (Microsoft 365 tab) control which M365 sources appear in the panel. Toggling one off removes it from the panel immediately.
- File sources added in the Sources modal appear as checkboxes in the panel alongside the M365 sources, with 📁 (local) or 🌐 (SMB) icons.
- The panel shows up to 5 rows before scrolling (
max-height: calc(5 * 26px)). - Profile save/restore — file source selections are now included when saving a profile.
buildScanPayload()merges M365 and file source IDs intoallSources;_applyProfile()restores all of them. A_pendingProfileSourcesmechanism handles the async case where file sources load after the profile is applied.
Added — Hint tooltips on Delta scan, Scan photos, Retention policy toggles
Each of the three advanced option toggles now has a circled ? icon to the right of the label. Clicking it shows a speech bubble (fixed-positioned, z-index: 9999) with the hint text, positioned to the right of the icon and visible above the main content area. Only one bubble can be open at a time; clicking anywhere outside closes it.
Changed — ⚙ Profiles button moved to topbar
The accent-coloured ⚙ Profiles button was removed from the Database section in the sidebar. A plain ⚙ Profiles button (matching the style of ⚙ Sources) now appears to the right of the 💾 save button in the topbar profile bar.
Changed — App mode badge (modeBadge) removed
The modeBadge button and userBar div have been removed from the sidebar. Connection status and mode (App / Delegated) are now shown exclusively in the Sources modal (Microsoft 365 tab) — connection info row with green/grey status dot, display name, email, and mode label.
Fixed — Sources modal: credentials pre-filled from saved config
smRefreshStatus() now calls /api/auth/status (correct endpoint) and pre-fills Client ID, Tenant ID, and Client Secret fields from the saved config. Connects via /api/auth/config + /api/auth/start; disconnects via /api/auth/signout + signOut().
Fixed — File source naming: Name field required; auto-suggest from path
The "Label" field renamed to "Name" and marked required (red asterisk). fsrcAutoName() suggests a name as the user types the path — last path segment for local paths, host / share for SMB paths. The user's own name is never overwritten once typed.
Fixed — Sources panel fixed height with scroll
#sourcesPanel in the sidebar now has max-height: calc(5 * 26px); overflow-y: auto so it shows exactly 5 rows before scrolling, regardless of how many sources are configured.
Fixed — Fiscal year end dropdown alignment
The "Fiscal year end" label and select were previously side-by-side, causing the label to wrap on long translations (e.g. "Regnskabsårs afslutning"). Now stacked vertically (flex-direction: column) with width: 100% on the select.
Fixed — ⚙ cog size inconsistency between Sources and Profiles buttons
Both buttons previously used ⚙️ (U+2699 + variation selector U+FE0F), which can render at emoji size rather than text size. Replaced with plain ⚙ (U+2699) in both so they render at identical size.
Fixed — MB label removed from max attachment size picker
The "MB" text span to the right of the attachment size number input has been removed.
Fixed — File source selections included in profiles
buildScanPayload() now collects both M365 and file source IDs and merges them into allSources, which is saved as profile.sources. Previously only M365 source IDs were saved.
[1.4.1] — 2026-03-26
Added — #17 Unified source management modal
Replaced the fragmented sidebar source configuration with a single ⚙️ Sources button above the sources panel. This opens a tabbed modal:
Microsoft 365 tab: Azure credentials (Client ID, Tenant ID, Client Secret) moved from the auth screen into the modal — can be updated or cleared post-connect. Per-source toggles (Email, OneDrive, SharePoint, Teams) control which sources appear in the sidebar panel. Disconnect button signs out without leaving the page.
Google Workspace tab: Stub with "Coming soon" — placeholder for Gmail and Google Drive when implemented.
File sources tab: Full file source management (list, add, delete, scan) moved from the standalone "📁 File sources" sidebar row into this tab. The separate sidebar row is removed.
Sidebar change: The "📁 File sources" sidebar section is removed. The sources panel now has a compact ⚙️ Sources button in its header row. The panel itself respects the per-source visibility toggles set in the modal — if a user disables OneDrive, it disappears from the panel immediately.
Backward compatibility: openFileSourcesModal() redirects to openSourcesMgmt('files') so any existing call sites continue to work.
[1.4.0] — 2026-03-26
Added — #8 File system scanning (local folders and SMB/CIFS network shares)
New file: file_scanner.py — unified local + network file iterator.
FileScanner.iter_files() yields (relative_path, bytes, metadata) regardless
of whether the source is a local path or a network share. All CPR scanning, card
streaming, and DB persistence stay in m365_scanner.py — file_scanner.py only
handles how files are accessed.
Local scanning uses os.walk() on any path (workstation, USB drive, or
already-mounted network share). SMB/CIFS scanning uses smbprotocol directly
without requiring a mount — supports SMB2/3 with NTLM or domain credentials.
smbprotocol is optional: if not installed, the scanner falls back to local-only
mode with a logged warning.
Credential storage priority (SMB):
- OS keychain via
keyring(recommended — password never touches the filesystem) NAS_PASSWORDenvironment variable.envfile (chmod 600) viapython-dotenv
Both optional dependencies (smbprotocol, keyring, python-dotenv) are added
to requirements.txt as opt-in extras.
Results write to the same SQLite DB as M365 items with
source_type = "local" or "smb", so the Article 30 report and data subject
lookup cover all sources in a single view. File/network cards use 📁 and 🌐
source badges respectively.
UI — 📁 File sources sidebar section:
- Manage button → opens the File Sources modal
- Add source form — label, path; SMB fields (host, user, password) appear
automatically when the path starts with
//or\; host is auto-filled from the path - Per-source ▶ Scan button — starts a scan immediately; results stream into the main grid via SSE exactly like an M365 scan
- Delete — removes a source definition (does not affect scan results already in the DB)
- Sources persist in
~/.m365_scanner_file_sources.json
New API routes:
| Route | Method | Description |
|---|---|---|
/api/file_sources |
GET | List all file source definitions |
/api/file_sources/save |
POST | Add or update a source |
/api/file_sources/delete |
POST | Remove a source by id |
/api/file_sources/store_creds |
POST | Store SMB password in OS keychain |
/api/file_scan/start |
POST | Start a file scan (non-blocking) |
New CLI flags:
# Scan a local folder
python m365_scanner.py --scan-path ~/Documents
# Scan an SMB share (password from OS keychain)
python m365_scanner.py --scan-path //nas.school.dk/shares \
--smb-user "DOMAIN\\henrik" --smb-keychain-key gdpr-scanner-nas
# One-time credential storage
python m365_scanner.py --smb-store-creds --smb-host nas.school.dk \
--smb-user "DOMAIN\\henrik"
# With photo scanning and file size limit
python m365_scanner.py --scan-path //nas/staff --scan-photos --max-file-mb 100
build_m365.py — file_scanner.py added to PyInstaller datas bundle.
[1.3.11] — 2026-03-26
Fixed — Face detection: excessive false positives on background elements
Haar cascade detection with minNeighbors=5 and min_size=40px was triggering
on background textures, bottle labels, artwork, and out-of-focus persons,
reporting up to 16 faces for a photo containing 1–2 actual subjects.
Changes in _detect_photo_faces() (m365_scanner.py):
min_sizeraised 40 → 80 px — eliminates detections on small background features; out-of-focus background persons and objects are too small in pixels to exceed this thresholdminNeighborsraised 5 → 8 — each candidate region must be confirmed by 8 overlapping scale-pyramid detections instead of 5; random texture patterns rarely survive this many confirmations
If over-detection persists on a specific image, minNeighbors=10 and
min_size=100 are reasonable next steps before genuine faces are missed.
Fixed — Result cards: replaced 👤 + separate role-pill with unified role icon
The account-pill (showing the owner's display name) previously prepended a
static 👤 via CSS ::before and rendered a separate role-pill span
(🎓/👔) alongside it. Both elements have been merged: the account-pill now
prefixes the display name directly with the role icon — 🎓 name for
students, 👔 name for staff, 👤 name for unclassified — removing the
redundant separate badge and saving horizontal space in both grid and list view.
[1.3.10] — 2026-03-26
Changed — Role classification: fragment-first, ID-second
Motivation: Microsoft has reissued new UUIDs for the same licence multiple
times over the past 5–6 years (EA → A1/A3/A5 → new commerce/CSP → benefit
variants). skuPartNumber strings like STANDARDWOFFPACK_FACULTY have been
stable across all those generations while UUIDs change with every new issuance.
New classify_user_role() order:
- Fragment match on
skuPartNumber(runs first whensku_mapavailable) — staff fragments checked before student across all licences, so aSTUDENT_BENEFITadd-on cannot mask aFACULTYlicence. - SKU ID lookup from
m365_skus.json— fallback whensku_mapis empty or when a licence has no recognisable fragment (e.g. Power Automate Free assigned to faculty).
Any future Microsoft SKU re-issuance is classified correctly without updating m365_skus.json, as long as the part number still contains FACULTY or STUDENT.
Fixed — m365_skus.json: added two missing faculty SKUs
c2273bd0-dff7-4215-9ef5-2c7bcfb06425— Microsoft 365 Apps for Faculty (primary licence at Gudenåskolen, absent from all previous versions)f30db892-07e9-47e9-837c-80727f46fd3d— relabelled Microsoft Power Automate Free (assigned to faculty)
[1.3.9] — 2026-03-26
Fixed — m365_skus.json not deployed; build_sku_map_from_users sampled wrong users
File missing: m365_skus.json was never copied into classification/ on disk. _load_sku_data() fell back to empty sets (staff_ids_count: 0). Students still classified via STUDENT fragment; staff always "other". Fix: file now shipped. Place in GDPRScanner/classification/m365_skus.json.
Wrong sample: build_sku_map_from_users took the first 20 alphabetical users — all students at Gudenåskolen — so it never fetched a staff part number. Fixed to sample evenly across the full list and always include the last 5 users.
[1.3.8] — 2026-03-26
Fixed — m365_skus.json not found in PyInstaller bundle; 🔍 SKU debug modal
_SKU_FILE = Path(__file__).parent / ... evaluated at class-definition time, before sys._MEIPASS is set in a frozen build. Replaced with _sku_file_path() classmethod that checks _MEIPASS at call time.
Added 🔍 SKU debug button to the accounts panel role-filter row. Opens a modal showing every tenant SKU ID colour-coded as 🎓 student / 👔 staff / ❓ unknown, with selectable text for pasting unknowns into m365_skus.json.
/api/users/license_debug extended: now returns student_ids, staff_ids, runtime block (set sizes, fragment lists, file path, sku_map entry count), and per-licence in_staff/in_student/frag_staff/frag_student trace for every user — sufficient to diagnose any classification failure without reading server logs.
[1.3.7] — 2026-03-26
Fixed — license_debug extended for full runtime diagnostics
/api/users/license_debug rewritten to expose all runtime state: staff_ids_count, student_ids_count, fragment lists, sku_file_path, sku_map_entries, and a step-by-step per-licence classification trace for every user (in_staff, in_student, frag_staff, frag_student, skuName).
[1.3.6] — 2026-03-26
Fixed — Staff misclassified as student: two-pass classify_user_role
Root cause: f30db892-07e9-47e9-837c-80727f46fd3d is a Microsoft Student
Use Benefit add-on that Microsoft automatically assigns alongside faculty
licences in Education tenants. Its skuPartNumber contains "STUDENT". Because
the old single-pass loop checked student and staff in per-licence order, the
fragment match on this add-on fired before the authoritative faculty ID
(94763226) was ever reached, returning "student" instead of "staff".
Fix — classify_user_role() now uses a strict two-pass approach:
Pass 1 — authoritative ID match (m365_skus.json), staff before student: All licences are scanned for staff IDs first, then student IDs. A single faculty SKU ID anywhere in the licence list wins regardless of what other add-on licences appear before it.
Pass 2 — skuPartNumber fragment match, staff before student:
Only reached if no ID match was found. Staff fragments are checked across every
licence before student fragments — preventing a STUDENT_BENEFIT add-on from
masking a FACULTY licence later in the list.
Result: A staff member holding [STUDENT_BENEFIT_ADDON, FACULTY_A1, STUDENT_DEVICE]
is now correctly classified as "staff" in all cases, whether sku_map is
populated or not.
[1.3.5] — 2026-03-26
Fixed — Staff not recognised: always merge per-user SKU map
Root cause: build_sku_map_from_users() (which calls /users/{id}/licenseDetails
for up to 20 sampled users) was only called when sku_map was completely empty.
In practice get_subscribed_skus() tier 2 (/me/licenseDetails) always succeeds
in delegated mode, returning the signed-in admin's own license — making sku_map
non-empty and silently skipping the per-user sampling.
If the admin's license happened to be a faculty A1 and other staff held A3 or an
unlisted variant, those A3 users were never added to sku_map and fragment
matching could not fire for them, leaving them as "other".
Fix: build_sku_map_from_users() is now always called and its results
merged into sku_map, regardless of whether get_subscribed_skus() already
returned entries. This guarantees that every distinct SKU ID actually in use by
any of the first 20 users gets a skuPartNumber entry, enabling fragment matching
for all staff variants — including those not yet listed in m365_skus.json.
Same merge applied in license_debug so the 🔍 modal also sees complete data.
[1.3.4] — 2026-03-26
Fixed — Role classification: three-tier SKU map fallback
Root cause: get_subscribed_skus() requires Directory.Read.All or
Organization.Read.All. If the Azure app registration does not have that
permission (typical delegated/device-code setups), it silently returned {}
and the fragment fallback never ran, leaving every user as "other".
Fix — get_subscribed_skus() now tries three endpoints in order:
| Tier | Endpoint | Permission needed |
|---|---|---|
| 1 | /subscribedSkus |
Directory.Read.All (admin) |
| 2 | /me/licenseDetails |
User.Read only |
| 3 | build_sku_map_from_users() via /users/{id}/licenseDetails (up to 20 users) |
User.Read.All |
Each tier logs how many SKU entries it found. Tier 2 always works in delegated mode and covers the signed-in user's licenses. Tier 3 covers all distinct SKUs used in the tenant by sampling up to 20 users. If any tier returns results, the others are skipped.
UI warning banner — when every fetched user resolves to "other", a red
banner appears above the accounts list: "No users classified — click 🔍 to
diagnose." It disappears automatically once classification succeeds.
[1.3.3] — 2026-03-26
Fixed — Role classification: SKU debug modal + path resolution
Problem: Even with classification/m365_skus.json loading correctly, users showed as
unclassified because the tenant's actual SKU IDs were not in the file. There was
no easy way to discover which IDs to add.
Changes:
-
🔍 SKU debug button — a small magnifying-glass button added to the role filter row (next to 🎓 Elev). Clicking it opens a modal that calls
GET /api/users/license_debugand lists every unique SKU ID in the tenant, colour-coded:🎓 student/👔 staff/❓ unknown. Unknown IDs can be selected and copied directly intoclassification/m365_skus.json. -
/api/users/license_debugextended — now also returnsstudent_idsandstaff_idsarrays from the loaded SKU file so the frontend can mark each tenant SKU as known or unknown without a second round-trip. -
_sku_file_path()classmethod — replaced the static_SKU_FILEclass attribute with a method that checkssys._MEIPASSfirst (PyInstaller bundle) then falls back toPath(__file__).parent / "skus" / "m365_skus.json". The static attribute evaluated at class-definition time before_MEIPASSwas set, causing the frozen app to look in the wrong directory. -
Server-side warning —
GET /api/usersnow logs aWARNINGto stdout when 0 out of N users are classified, including a sample of the unrecognised SKU IDs seen in the first 20 users. -
Translated — EN / DA / DE (3 new keys)
[1.3.2] — 2026-03-26
Fixed — Student/Staff misclassification: incomplete SKU lists + no override (#1.3.2)
Root cause: The hardcoded SKU lists introduced in v1.0.0 covered only ~8 student
and 6 staff SKUs. Microsoft publishes 100+ Education SKU IDs; any tenant using a SKU
not in those lists silently fell through to "other", leaving users unclassified
or relying solely on the skuPartNumber fragment fallback — which itself was too
specific (STANDARDWOFFPACK_STUDENT instead of just STUDENT).
m365_connector.py — Expanded SKU lists and broader fragment matching
Student set expanded from 8 → 12 SKUs:
- Added
46c119d4(M365 A1 for Students — student use benefit) - Added
8fc2205d(O365 A5 for Students) - Added
160d616a(O365 A3 for Students device) - Added
a4e376bd(M365 A1 for Students new commerce)
Staff set expanded from 6 → 9 SKUs:
- Added
2d61d025(M365 A1 for Faculty — faculty use benefit) - Added
15b1d32e(O365 A3 for Faculty device) - Added
ba04c29e(M365 A1 for Faculty new commerce)
Fragment patterns broadened — "STUDENT" and "FACULTY" now catch all
part-number variants (_STUDENT, STUDENT_, STUDENT_BENEFIT, _FAC, etc.)
without needing to enumerate every Microsoft naming permutation.
m365_scanner.py — Manual role overrides
Because no SKU list can ever be complete, admins can now correct individual users directly from the accounts panel:
- 🎓/👔/❓ role badge on every user row — click to cycle:
auto → student → staff → other → (clear, back to auto) - Overridden rows show the badge in accent colour with a ✎ indicator
- Overrides persisted to
~/.m365_scanner_role_overrides.json— survive restarts and re-authentication - Applied at both display time (
/api/users) and scan time (_user_role_map) so card badges, filter buttons, Excel Role column, and Article 30 inventory split all reflect the corrected role GET /api/users/role_override— returns all current overridesPOST /api/users/role_override— sets or clears one override- Override file added to
--purgefile list - Translated — EN / DA / DE (3 new keys)
[1.3.1] — 2026-03-26
Fixed — Student/Staff role misclassification (m365_connector.py)
Two SKU ID collisions in _STUDENT_SKU_IDS / _STAFF_SKU_IDS caused Faculty
users to be shown as Students (and vice versa) for any tenant using A5 or A3
Education licenses:
| SKU ID | Correct role | Bug |
|---|---|---|
e578b273-6db4-4691-bba0-8d691f4da603 |
Staff (M365 Education A5 for Faculty) | Was also in _STUDENT_SKU_IDS as "O365 A5 for Students" — Faculty A5 users always showed as 🎓 Student |
78e66a63-337a-4a9a-8959-41c6654dfb56 |
Student (Office 365 A3 for Students) | Was also in _STAFF_SKU_IDS as "M365 A1 for Faculty (device)" — this had no effect because student is checked first, but the comment was wrong and the duplicate entry was confusing |
classify_user_role() checks student first, so any overlap resolves to student,
silently misclassifying all affected Faculty accounts.
Fix: removed e578b273 from _STUDENT_SKU_IDS and 78e66a63 from
_STAFF_SKU_IDS. Also removed a stale duplicate of e578b273 that appeared
twice in _STAFF_SKU_IDS. Added a RuntimeWarning guard inside
classify_user_role() that logs any future collision between the two sets.
Impact: Article 30 staff/student inventory split, role filter buttons (👔 / 🎓), role badges on cards, and Excel Role column are all now correct for A5 and A3 Education tenants.
Workaround until update: use GET /api/users/license_debug to see the raw
SKU IDs and current classification for each user.
[1.3.0] — 2026-03-26
Added — Biometric photo scanning (#9)
GDPR reference: Article 9 (special categories — biometric data), Article 5(1)(b)(e), Recital 38, Databeskyttelsesloven §6
PHOTO_EXTS— new constant covering.jpg .jpeg .png .bmp .tiff .tif .webp .heic .heif_detect_photo_faces(content, filename)— callsds._get_cv2()+ds.detect_faces_cv2()(already indocument_scanner.py); PIL fallback for HEIC/HEIF;minNeighbors=5for conservative detection; returns face count or 0 on any failure; entirely safe — exceptions swallowed silentlyscan_photosoption — new boolean scan option (defaultFalse— opt-in); extracted fromscan_optsalongsidedeltaandemail_body🖼 Scan photos for facestoggle in the Options panel, with hint: "Slower — opt in"- Photo items flagged even without CPRs — a file is added to results if
face_count > 0, even if no CPR number is found; photographs of identifiable people are Art. 9 data regardless of CPR content "biometric"auto-injected intospecial_categorywhen faces are detected and"biometric"is not already presentface_countfield added to card payload, DB, Excel, and Article 30 report
DB (migration #4):
face_count INTEGER NOT NULL DEFAULT 0added toflagged_itemsvia auto-migrationsave_item()updated to persistface_count
UI:
📷 N facesbadge — tealphoto-face-badgepill shown on cards in both grid and list view whenface_count > 0📷 Photos / biometricfilter added to the Special dropdown in the filter bar;applyFilters()handlesspecialVal === 'photo'buildScanPayload()includesscan_photos;_applyProfile()restores it when loading a profile
Excel export:
Face countcolumn added as column 3 (between CPR Hits and Special category); URL column index updated from 10 → 11 for hyperlink styling
Article 30 report:
- Summary section:
Photos with detected faces (Art. 9 biometric)row with item + face count; explanatory note on legal basis and parental consent (Databeskyttelsesloven §6) - New dedicated section: Photographs and Biometric Data (Article 9) — intro paragraph, 4-bullet retention guidance (purpose limitation, pupil consent, website removal, archiving), item table (name, account, source, faces, modified date), capped at 50 rows
- Methodology section: bullet added describing OpenCV Haar cascade detection
Translated — EN / DA / DE (16 new keys per language)
[1.2.3] — 2026-03-26
Added — Profile management modal (#15d)
- ⚙ Profiles button in the sidebar Database row opens a modal listing all saved profiles
- Each profile row shows name (with ● active indicator), sources summary, description, and last run timestamp
- Use — loads the profile into the sidebar and updates the topbar dropdown; closes the modal
- Edit — expands an inline edit form directly in the row; saves name and description via
POST /api/profiles/save - Duplicate — creates a copy with a unique
(copy)/(copy 2)suffix; reloads the list - Delete — confirms, removes via
POST /api/profiles/delete, clears_activeProfileIdif the deleted profile was active - Empty state shown when no profiles have been saved yet
- Translated — EN / DA / DE (14 new keys per language)
Added — Database export/import UI (#11)
- 🗄 Database sidebar section with Export and Import buttons (always visible; sits between Email report and User info)
- Export button — calls
GET /api/db/export; triggers a browser download of a timestamped ZIP (gdpr_export_YYYYMMDD_HHmmss.zip) containing 8 JSON files; CPR hashes only, thumbnails stripped - Import modal — file picker (
.ziponly), mode selector (Merge / Replace), replace warning panel, status line, and Import button; callsPOST /api/db/importwith multipart form data GET /api/db/exportFlask route — generates ZIP in a temp file, streams bytes asapplication/zipattachmentPOST /api/db/importFlask route — accepts multipartfile,mode,confirm; validates replace confirmation server-side; returns{ok, mode, imported: {table: count}}- Translated — EN / DA / DE (17 new keys per language)
Changed — Article 9 keyword matching compiled to regex (#13)
_load_keywords()now compiles onere.Patternper Article 9 category at startup using a longest-first alternation:(?:keyword_a|keyword_b|…)withre.IGNORECASE- Short keywords (≤ 4 chars) retain
(?<!\w)…(?!\w)word-boundary anchors to prevent substring false positives _check_special_category()uses the compiled patterns viapattern.finditer()instead of a sequentialstr.find()loop over up to 459 entries- Startup log now reports compiled category count:
Loaded 459 keywords (9 categories compiled) - Performance: ~10–50× faster for large tenants; negligible difference for typical school tenants (~100 flagged items); meaningful saving at 1 000+ items
[1.2.2] — 2026-03-21
Added — Profile selector in topbar (15c)
- Profile dropdown in the topbar, between the Scan button and the spacer — shows "Default (sidebar)" plus all saved profiles with their last run date
- 💾 Save button next to the dropdown — prompts for a name and saves the current sidebar state (sources, options, user selection, retention settings) as a named profile via
POST /api/profiles/save onProfileChange()— fires when the dropdown changes; calls_applyProfile()to populate the sidebar controls from the selected profile_applyProfile(profile)— sets all source checkboxes, scan options, retention fields, and queues user selection for when the accounts list is loaded_applyPendingProfileUsers()— applies a profile'suser_idsto the accounts list afterloadUsers()completes; safe to call multiple timesloadProfiles()— fetches/api/profilesand populates the dropdown; called ononAuthenticated()saveCurrentAsProfile()— collects the fullbuildScanPayload()state and posts it as a new or updated profile- Profiles with a description show it as a tooltip on the dropdown option
- Selecting "Default (sidebar)" clears
_activeProfileIdso the sidebar is used directly with no profile applied - Translated — EN / DA / DE (6 new keys)
[1.2.1] — 2026-03-21
Added — Scan profiles 15a + 15b
15a — Backend profile storage
_profiles_load()— reads all profiles from~/.m365_scanner_settings.json_profiles_write()— atomic write of the full settings dict_profile_from_settings()— wraps a flat settings dict as a profile object_profile_get(name_or_id)— case-insensitive lookup by name or UUID_profile_save(profile)— insert or update a profile_profile_delete(name_or_id)— delete by name or UUID_profile_touch(id, scan_id)— updateslast_runandlast_scan_idafter a successful scan- Automatic migration — on first run, existing flat
~/.m365_scanner_settings.jsonis silently wrapped into a profile named "Default"; no user action required - Legacy shim —
_save_settings()and_load_settings()continue to work unchanged; all existing headless setups are unaffected - Profile API routes —
GET /api/profiles,POST /api/profiles/save,POST /api/profiles/delete,GET /api/profiles/getfor future UI use (15c/15d)
15b — CLI profile support
--list-profiles— tabular listing of all profiles with name, sources, last run, and scan ID--save-profile NAME— saves current CLI options as a named profile; updates existing if name matches--delete-profile NAME— removes a profile by name--profile NAME— loads a named profile for--headlessruns; populates sources, retention, fiscal year end, and email recipients from the profile; prints profile name, description, and last run before scanning- After a successful headless scan, the active profile's
last_runandlast_scan_idare updated automatically
[1.2.0] — 2026-03-20
Added — Article 9 sensitive category detection (#3)
keywords/da.json— 459 Danish keywords across 9 Article 9 categories: health, mental health, criminal (Art. 10), trade union, religion, ethnicity, political, biometric, and sexual orientation. Includes_false_positive_guidancefor ambiguous terms and_proximity_noteexplaining the matching strategykeywords/subfolder — mirrors thelang/pattern;keywords/en.jsonandkeywords/de.jsoncan be added without code changes_load_keywords()— loads the keyword file at startup matching the active UI language; falls back toda.json_check_special_category(text, cprs)— returns a sorted list of matched Article 9 category keys; a keyword only triggers when within 150 characters of a CPR number (proximity filter); if no CPRs are present in the text, any keyword occurrence triggers- Card badge — purple
⚠ Art.9 — health, criminalpill on flagged cards showing all detected categories - Filter bar dropdown — "All risk levels / Art. 9 special category" quick filter in the results grid
- DB migration #3 —
special_category TEXT NOT NULL DEFAULT '[]'added toflagged_itemsvia auto-migration; stored as JSON array finish_scan()— counts special category items per scan and writes toscan_history.special_categoryfor trend tracking- Excel export — "Special category" column added as column 3 on all per-source sheets
- Article 30 report — special category item count and DPIA warning added to the summary section; "Art. 9" column added to the per-source breakdown table with purple highlighting on non-zero values
- Translated — EN / DA / DE (6 new keys per language)
- Build scripts —
keywords/folder bundled into PyInstaller app alongsidelang/ .gitignore—!keywords/*.jsonadded to prevent keyword files being excluded by the*.jsoncatch-all
[1.1.3] — 2026-03-20
Fixed
- Stray duplicate
_get_bytesbody — dead code block left afterdelete_drive_item_for_userfrom a previous edit has been removed
Changed — m365_connector.py
-
Split timeouts — replaced all hardcoded
timeout=30/timeout=60with two tuned constants:_TIMEOUT_API = (10, 45)— 10s connect, 45s read for JSON API calls_TIMEOUT_BYTES = (10, 120)— 10s connect, 120s read for file/attachment downloads- The 10s connect timeout makes hung connections fail fast; the read timeout allows slow wireless links to complete a transfer without aborting
-
Exponential backoff with retry — all four core request methods (
_get,_post,_get_bytes,_delete) now retry up to 4 times on transient network errors:- Retried:
ConnectionError,Timeout,ChunkedEncodingError,ReadTimeout, HTTP 429, HTTP 503, HTTP 504 - Not retried: HTTP 403 (permission), HTTP 410 (delta token expired) — raised immediately
- Backoff: 2s → 4s → 8s between attempts (capped at 30s); 429 responses use the
Retry-Afterheader value - Intermittent wireless dropouts and brief gateway errors are now absorbed transparently without interrupting a scan
- Retried:
-
Streaming file downloads —
_get_bytesnow usesstream=Trueanditer_content(65536)so large attachments are received in 64 KB chunks rather than one blocking read; prevents read timeouts on slow connections for large files -
list_usersinline timeout — the_fetchhelper insidelist_userswas using its own hardcodedtimeout=30; updated to use_TIMEOUT_API
[1.1.2] — 2026-03-20
Fixed
- App does not start after build —
m365_db.py,scanner_worker.py, andVERSIONwere missing from PyInstallerdatasinbuild_m365.py; the app crashed immediately on launch because these files could not be found inside the bundle _read_app_version()broken in both build scripts — still searched forAPP_VERSION = "..."as a string literal in the scanner source, but both scanners now read from theVERSIONfile; build scripts updated to readVERSIONdirectlyVERSIONnot bundled —build.py(Document Scanner) also missing theVERSIONfile indatas
Added
--purgeCLI flag — permanently deletes all data files created by the scanner (SQLite database, Azure credentials, SMTP credentials, settings, checkpoint, delta tokens, language preference, OCR cache, MSAL token cache); prompts foryesconfirmation;--yesskips prompt for scripted use--export-db FILE— exports the database to a structured ZIP archive containing 8 JSON files; thumbnails excluded; CPR stored as hashes only--import-db FILE— imports a previously exported ZIP;--import-mode merge(default) adds dispositions and deletion log only;--import-mode replacewipes and restores all tables;--yesskips confirmation on replace
[1.1.1] — 2026-03-19
Fixed
- Layout collapse in light mode —
.topbarCSS rule was broken by an earlier edit;border-bottomandbackgroundproperties were orphaned onto a dangling line, causing the topbar to render with no background and the Scan button to be nearly invisible - Sidebar missing —
.layoutusedheight: 100vhwhich ignoredbodypadding, causing the flex layout to overflow and the sidebar to disappear - macOS pywebview titlebar overlap — content rendered behind the traffic-light buttons; fixed with
padding-top: 30pxonbodywhen running inside pywebview on macOS, combined withbox-sizing: border-boxandheight: 100%on.layout <option>elements not translated —applyI18n()usedel.innerHTMLon<option>elements; some browsers do not re-render the select's visible text wheninnerHTMLis set on an already-mounted option; switched toel.textContentfor option elements- Disposition filter dropdown not translated on load — filter bar is hidden until first scan result arrives so
applyI18n()onDOMContentLoadedmissed it;applyI18n()is now called when the filter bar is first shown - Card delete button z-index — added
z-index: 1to.card-delete-btnso it stacks correctly within its card context
Added
--reset-dbCLI flag — permanently drops and recreates all database tables; shows a summary of what will be deleted and requires typingyesto confirm--yesflag — skips confirmation prompts; use with--reset-dbfor scripted/automated resetsScanDB.reset()— new method inm365_db.pythat drops all tables in correct foreign-key order, resetsuser_versionto 0, and reopens the connection with a fresh schema
[1.1.0] — 2026-03-19
Added — M365 Scanner
- Student / staff role classification — O365 license SKU IDs used to classify users as 🎓 Student or 👔 Staff with no extra Azure permissions required. Hardcoded known Microsoft Education SKU IDs cover M365/Office 365 A1/A3/A5 for Students and Faculty. Fragment fallback for future SKUs.
- Role filter in accounts panel — All / 👔 Ansat / 🎓 Elev buttons filter the user list before selecting accounts to scan
- Role badge on result cards — 🎓/👔 pill shown on every card in grid and list view
user_rolein SQLite DB — stored inflagged_itemstable; DB migration applied automatically on first run- Licensed users only — accounts without an assigned O365 license are excluded from the user list
- Disposition filter in filter bar — filter results grid by compliance disposition status
- Headless auto-delete of
delete-scheduleditems — items tagged for deletion are removed automatically after each headless scan - Deletion audit log — every deletion logged to
deletion_logtable with timestamp, actor, reason, and legal basis GET /api/db/deletion_log— API endpoint for the deletion log- Deletion log in Article 30 report — dedicated section with summary-by-reason table and full 7-column log
- Article 30 — student/staff split — Section 3 (Data Inventory) now shows Staff and Student tables separately; parental consent note added for student items (Databeskyttelsesloven §6)
GET /api/users/license_debug— diagnostic endpoint showing raw SKU IDs and classified roles for each user_resolve_display_name()— resolves GUIDs and "Microsoft Konto" guest account placeholders to email address throughout UI and Article 30 report- Account name in Article 30 — resolved via
user_idsstored in scan options; GUID no longer shown in any column - All Article 30 strings translated — deletion log section now uses
L()throughout; 19 new keys in EN/DA/DE VERSIONfile — single source of truth; both scanners read version at startup viaPath(__file__).parent / "VERSION"CHANGELOG.md— release history and versioning policySECURITY.md— responsible disclosure processCONTRIBUTING.md— development setup, code style, PR processLICENSE— AGPL-3.0 with commercial licensing note and GDPR disclaimer.gitignore— covers credentials, databases, audit logs, venv, build artefacts
Fixed — M365 Scanner
- Language switching no longer reloads the page — translations applied in-place, scan results preserved
- Connect screen freeze — duplicate
renderAccountListfunction definition caused a JavaScript syntax error that preventedonAuthenticated()from firing - Account column in Article 30 report showing GUIDs — resolved via
_acct_mapbuilt from storeduser_ids - "Microsoft Konto" / GUID display names on cards and in reports — resolved to email address
Changed — M365 Scanner
- Excel export — 9 columns (was 7): added Account (display name), Role, and Disposition; URL hyperlink column index updated accordingly
- Accounts list — licensed users only;
assignedLicensespost-filter applied
[1.0.0] — 2026-03-19 — Initial public release
Document Scanner (server.py)
- Scan PDFs, Word, Excel, CSV, and image files for Danish CPR numbers
- OCR support via Tesseract for scanned/image-based PDFs
- NER-based detection of names, addresses, phone numbers, emails, IBANs, and bank accounts via spaCy
- CPR validation: strict Modulus 11 check + century-digit verification
- Redaction modes: mask CPR only, or full anonymisation of all personal data
- Face detection and blurring in image files via OpenCV
- Risk scoring per file based on CPR count, age, and PII density
- Dry-run mode — scan without writing any output files
- JSON audit log (
scanner_audit.jsonl) — append-only, records every action - SQLite OCR cache (
~/.document_scanner_ocr_cache.db) — avoids re-OCR of unchanged pages - Web UI on port 5000 with grid and list view, live progress, drag-and-drop upload
- Standalone macOS
.appand Windows.exevia PyInstaller + pywebview
M365 Scanner (m365_scanner.py)
Scanning
- Exchange mailboxes: all folders and subfolders, recursive, language-independent using
wellKnownNameidentifiers - OneDrive, SharePoint, Teams file scanning via Microsoft Graph API
- Attachment scanning: PDF, Word, Excel inside emails
- CPR detection with the same strict validator as the Document Scanner
- NER-based PII detection (phone, IBAN, bank account, name, address, org)
- Progressive streaming — results appear card-by-card via Server-Sent Events
- Incremental / resumable scans — checkpoint saved on interruption, resume on next run
- Delta scan — Graph
/deltaendpoints fetch only changed items since last scan - Per-item thumbnail generation — image previews and placeholder SVGs
Results
- Results grid with grid and list view, search, source filter, and disposition filter
- Account name and role (🎓 Student / 👔 Staff) badge on every card
- 🗓 Overdue badge on items exceeding the retention cutoff
- Preview panel with iframe preview, metadata strip, and disposition dropdown
Compliance features
- Retention policy enforcement (GDPR Art. 5(1)(e)): rolling or fiscal-year cutoff (e.g. Bogføringsloven Dec 31), 🗓 Overdue badge, bulk-delete quick filter, headless auto-delete via
--retention-yearsand--fiscal-year-end - Data subject lookup (Art. 15/17): modal, CPR hashed before query, bulk delete with audit logging
- Disposition tagging (Art. 5(1)(a)): Unreviewed / Retain (legal/legitimate/contract) / Delete-scheduled / Deleted — filter bar, preview panel, Excel export, headless auto-delete of scheduled items
- Deletion audit log (Art. 5(2)): every deletion logged with timestamp, actor, reason, legal basis
- Article 30 report (Art. 30): structured
.docxexport — summary, data categories, data inventory (staff and student sections), retention analysis, compliance trend, deletion audit log, methodology
User management
- Application mode (service account) and Delegated mode (device code flow)
- License-based role classification: 🎓 Student / 👔 Staff detected from O365 SKU IDs — no extra permissions needed
- Role filter buttons in accounts panel (All / 👔 Ansat / 🎓 Elev)
- Licensed users only — accounts without an assigned license are excluded
- Display name resolution: GUIDs and "Microsoft Konto" guest placeholders resolved to email address
Database (m365_db.py)
- SQLite persistence layer alongside JSON session cache
- Tables:
scans,flagged_items,cpr_index,pii_hits,dispositions,scan_history,deletion_log - CPR numbers stored as SHA-256 hashes only — never in plaintext
- Schema migration support via
_MIGRATIONS+user_versionpragma
Exports
- Excel export: 9 columns including Account, Role, Disposition; per-source sheets with auto-filter
- Article 30 Word document export
- Email report via SMTP (STARTTLS / SMTPS / plain); headless
--email-toflag
Headless / scheduled mode
--headless --output DIR --settings FILEfor cron / Task Scheduler--retention-years N --fiscal-year-end MM-DDfor automated retention enforcement--email-tofor automated report delivery- Non-interactive: deletes automatically; interactive (TTY): prompts for confirmation
Internationalisation
- Language files: English (
en), Danish (da), German (de) - Language switching applies in-place — no page reload, scan results preserved
Installation
install_windows.ps1: Python, Tesseract, Poppler, venv — all local to project folder, no system PATH changes; all downloads viacurl.exeinstall_macos.sh: Homebrew, Python 3.12, Tesseract, Poppler, spaCy modelDockerfile+docker-compose.ymlfor containerised deployment- GitHub Actions: 4 parallel build jobs (Document Scanner + M365 × Windows + Linux), auto-release on
v*tags
Versioning policy
- PATCH (
1.0.x) — bug fixes, translation updates, minor UI tweaks - MINOR (
1.x.0) — new feature, new suggestion from SUGGESTIONS.md implemented - MAJOR (
x.0.0) — breaking change: DB migration required, config format change, or Azure permission requirement change
To release a new version:
# 1. Update VERSION
echo "1.1.0" > VERSION
# 2. Update CHANGELOG (add new section above [1.0.0])
# 3. Commit and tag
git commit -am "Release 1.1.0"
git tag v1.1.0
git push && git push --tags
# GitHub Actions builds and publishes automatically