GDPRScanner

Author	SHA1	Message	Date
StyxX65	874c3ccec1	Add "prefer SMTP" toggle to skip Microsoft Graph for email When the M365 connector is connected the app always tries Graph first, and a Graph 202 ends the send — so report mail to recipients Exchange silently drops (Google-hosted subdomains of the O365 domain) never reaches them, even with working SMTP configured. New prefer_smtp flag gates all three Graph branches (smtp_test, send_report, _maybe_send_auto_email) so they go straight to SMTP. UI toggle #st-smtpPreferSmtp in Settings → E-mailrapport, saved/loaded by scheduler.js, with da/de/en strings. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 11:30:45 +02:00
StyxX65	526e2b0b78	Fix SMTP auth: settings tab saved wrong config keys The Settings → E-mailrapport tab (scheduler.js) saved the SMTP username as `user` and TLS flag as `starttls`, but every backend reader expects `username`/`use_tls` (routes/email.py). Result: username was always empty, server.login() was skipped, and the SMTP server rejected the send — surfacing as a misleading "authentication failed" message even with a valid App Password. The bug was latent because Graph is preferred whenever M365 is connected, so the SMTP path was rarely exercised. - scheduler.js: send/load canonical keys (username, use_tls). The send-report modal (scan.js) already used these. - _load_smtp_config(): normalise legacy user→username / starttls→use_tls so configs saved before the fix work without re-entry. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 11:25:15 +02:00
StyxX65	b661a94f98	Restore user/group badges on DB-loaded result cards The card badge only rendered when f.account_name was set, and the group (role) badge was nested inside that same check. But save_item never persisted account_name — only account_id (a GUID) and user_role. Live SSE cards carried account_name so badges showed during a scan; now that the grid loads finalized scans from the DB, the gap is exposed and both badges vanish for earlier scans. - Persist account_name (migration 11 + save_item) so future scans show the user badge. Both M365 and Google cards already carry it. - _accountPill() in results.js drives the group badge off user_role alone (shows for legacy rows) and resolves a best-effort user label: account_name → S._allUsers (id/email) → email-style account_id → omit. Both card layouts share the one helper. Legacy rows still lack account_name (never captured), but now show the group badge and a resolved/email user label where possible. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 10:15:19 +02:00
StyxX65	68076eba52	Show all open (unactioned) items by default, not just the last scan The default results view loaded only the latest scan session (±300s window), so items dropped out of sight once a newer scan started — and a long scheduled scan could show little or nothing on browser open. Add get_open_items(): every flagged item with no disposition (or status 'unreviewed') across all scans, deduped by id to the latest finished scan. GET /api/db/flagged now serves it when no ?ref is given; ?ref=N still loads a specific past session. Frontend loadHistorySession(null) routes to a new loadOpenItems() loader. Rename the banner button to "Open items" (da/de/en). get_session_items() default is unchanged — export.py and scan_scheduler.py still rely on latest-session for the current scan's report/email. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 09:19:55 +02:00
StyxX65	f84c8516df	Reliably restore last session on refresh after a server restart The page-load restore was one-shot and bailed when a completed scan's replayed scan_phase left a running flag set; sse_replay_done (the other retry) only fires for a non-empty replay buffer, which is empty after a restart — so refreshing post-update showed a blank grid despite the results being in the DB. The watchdog now retries the restore on each 4s poll while nothing is shown and no scan runs, clearing stale flags first. /api/scan/status also reports google_running separately so a refresh during a live Google scan is no longer treated as idle. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-16 11:53:07 +02:00
StyxX65	bdba80e72d	Remove stale link preview from share modal after create The generated-link "Copy link:" row stayed visible after creating, looking like the form hadn't reset — but the new link was already in the Active links list with its own Copy button. Drop the redundant preview row; on create, reset the form and briefly highlight the new entry in the active list. Removes the now-dead shareNewLinkRow markup and copyShareLink(). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-15 10:11:03 +02:00
StyxX65	9cbd93e1f5	Reset all share modal fields after creating a link Create only cleared the label; scope type, user email, date range, and expiry carried over, so the next link silently inherited the previous link's scope. Extracted openShareModal's reset logic into _resetShareForm() and call it after every successful create. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 15:32:03 +02:00
StyxX65	679f91da2c	Use page origin for share links except when browsing at localhost The LAN-IP rewrite in _getShareBaseUrl() exists to fix unusable 127.0.0.1 links; applying it to every origin meant links copied behind a reverse proxy pointed at http://<LAN-IP>:5100, bypassing TLS. HTTPS and non-localhost origins are now used as-is. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 15:14:33 +02:00
StyxX65	35e767b506	Fix copy buttons doing nothing over plain HTTP navigator.clipboard is undefined in non-secure contexts, so the direct writeText() call threw synchronously and the execCommand fallback in its .catch() never ran. _copyText() now feature-detects the API, falls back to execCommand('copy'), then to a prompt() for manual copying. log.js reuses the helper; _getShareBaseUrl() caches the LAN-IP lookup so token Copy buttons stay within the click gesture execCommand requires. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 15:09:34 +02:00
StyxX65	a325349ecd	Fix stale ~/.gdpr_scanner_* paths in help text, docs, and UI strings Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 14:41:23 +02:00
StyxX65	6a4b0e1706	Show delta token source count, add hint bubble, fix README data paths Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 14:27:14 +02:00
StyxX65	c0e45df440	Add software update from Settings GUI and update_gdpr.sh script Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 12:54:29 +02:00
StyxX65	95f1f39a1f	Keep data-subject-deleted cards in grid until next scan Apply the keep-until-next-scan behaviour to deleteSubjectItems: mark the deleted items _deleted (using deleted_ids from the response) and keep them greyed in the grid instead of filtering them out. Also fixes a latent bug where renderGrid() was called with no argument and threw on files.forEach, which the surrounding try/catch swallowed as a false "Delete failed" after a successful erasure. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 11:47:52 +02:00
StyxX65	386831c423	Keep bulk-deleted cards in grid until next scan Extend the keep-until-next-scan behaviour to the bulk delete modal: instead of removing matched cards on success, mark them _deleted and keep them greyed with a "🗑 Deleted" badge and hidden buttons. /api/delete_bulk now returns deleted_ids so the grid marks exactly the items the server actually deleted — partial failures stay active and re-deletable. Already-handled (_deleted / _redacted) items are excluded from the bulk-delete match set so they aren't re-counted or re-processed. 201 tests pass. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 11:46:14 +02:00
StyxX65	ed3c3a80d6	Keep deleted cards in grid until next scan Mirror the redact behaviour for the card delete button (🗑): instead of removing the card on success, mark the item _deleted and keep it in the grid — greyed via card-resolved, shown with a red "🗑 Deleted" badge, action buttons hidden so it can't be re-processed. The grid is rebuilt on the next scan run, clearing the markers. results.js only — no server change. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 11:44:10 +02:00
StyxX65	7c1c2b390d	Keep selected card in view when opening preview Opening the preview panel narrows .grid-area and reflows the auto-fill grid to fewer columns, moving the clicked card to a new row. The single-frame scrollIntoView ran while the browser's scroll-anchoring re-adjusted scrollTop mid-reflow, so the card scrolled out of view. Disable scroll anchoring on .grid-area (overflow-anchor:none) and defer the scroll by two animation frames against the settled layout, centring the card (block:'center'). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 11:35:04 +02:00
StyxX65	d82a0d6004	Keep redacted cards in grid until next scan Redacting a card (✏) previously removed it from the grid and from S.flaggedData/S.filteredData immediately. Now the item is marked _redacted and kept: greyed via card-resolved styling, shown with a "✏ Redacted" badge, and its delete/redact buttons hidden so it can't be re-processed. The grid is rebuilt on the next scan run, which clears the markers. results.js only — no server change. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 11:30:41 +02:00
StyxX65	c39d68ca19	Document XSS escaping + secret-encryption hardening - CHANGELOG: add Unreleased ### Security section covering the stored XSS in the results grid, the reflected XSS in /api/thumb, and the Claude API key now being encrypted at rest. - CLAUDE.md / static/js/CLAUDE.md: add the esc() / _html_esc escaping rule for scan-derived strings and the onclick-JSON " pattern. - CLAUDE.md / routes/CLAUDE.md: note that secret config fields use the machine-keyed Fernet and must be read via a decrypting accessor (get_claude_api_key()), never config.json directly. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 11:15:39 +02:00
StyxX65	b6d2915d49	Harden XSS escaping and encrypt Claude API key at rest - results.js: add esc() helper and apply to all scan-derived fields (name, account_name, folder, source, modified, label, img alt) across card/list/preview/subject-lookup/related views. Scan-derived strings can carry attacker-controlled markup (e.g. a OneDrive file named with HTML), so they must be escaped before innerHTML/attribute embedding. Also escape the related-docs onclick JSON to match the delete/redact " pattern. - cpr_detector._placeholder_svg: escape label/name before embedding — served as image/svg+xml via /api/thumb?name=, so an unescaped value was a reflected-XSS vector when the URL is opened directly. - cpr_detector: remove 44-line unreachable duplicate of the face-detection body left inside _extract_audio_metadata after its return. - app_config: encrypt claude_api_key at rest with the machine-keyed Fernet (same as the SMTP password); add get_claude_api_key() for decryption. Legacy plaintext keys still read and are re-encrypted on next save. Update readers in document_scanner.py and routes/app_routes.py. 201 tests pass. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 11:06:36 +02:00
StyxX65	1903115e02	CLAUDE.md restructured	2026-06-08 14:44:37 +02:00
StyxX65	f845a2f686	### Fixed - Cards not shown after browser refresh — when the browser reconnected to the SSE stream after a completed scan, the `scan_phase` events in the replay buffer temporarily set `S._m365ScanRunning = true` (all running flags start at `false` after a page reload). The watchdog's `loadHistorySession` call fired in this window and bailed on the stale flag; once `scan_done` cleared the flag, `_initialStatusChecked` was already `true` so `loadHistorySession` was never retried. Fixed by having the `sse_replay_done` handler retry `loadHistorySession(null)` when no scan is running and `S._historyRefScanId` is still `null` after replay.	2026-06-08 14:28:24 +02:00
StyxX65	fa6601ffdd	Bugfixes	2026-06-01 15:15:43 +02:00
StyxX65	4e5a8934d7	Fix Google scan not stopping cleanly before a new scan starts	2026-05-29 04:53:42 +02:00
StyxX65	034ced943e	Extended document redaction to Google Drive, SFTP, SMB, and local PDFs Extends the ✂ in-place redaction feature beyond local DOCX/XLSX/CSV/TXT files to cover all remaining file source types and adds PDF support for local files.	2026-05-28 17:47:02 +02:00
StyxX65	6ce7583b26	Added NER/AI integration	2026-05-28 11:50:10 +02:00
StyxX65	26c45165b9	v1.6.28 — Scheduled report-only jobs, compliance audit log, and documentation update - Scheduled jobs can now run in report-only mode (skip scan, email latest DB results) - Compliance audit log records all significant admin actions in an immutable DB table - VERSION bumped to 1.6.28; CHANGELOG [Unreleased] sealed as [1.6.28] — 2026-05-28 - Both manuals updated: CPR-only mode, OCR language, file redaction, related documents, date-range token scoping, report-only jobs, audit log tab, two new FAQ entries - TODO.md updated with all completed tasks Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-28 11:08:52 +02:00
StyxX65	744813f4ac	Add compliance audit log Immutable audit_log table in the scanner DB records every significant admin action (profile save/delete, token create/revoke, PIN changes, source add/update/delete, scheduler job changes, scan start/stop, SMTP save, dispositions, item delete/redact). GET /api/audit_log exposes entries newest-first. New Audit Log tab in the Settings modal renders the table on demand. Settings modal widened 540→640 px and tab labels set to white-space:nowrap so the six-tab row fits on one line. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-28 10:51:23 +02:00
StyxX65	4ef2dfb352	Date-range scoping for viewer tokens	2026-05-28 10:34:55 +02:00
StyxX65	c820d6f6db	Two bugs in the abort mechanism: 1. POST /api/scan/stop only set state._scan_abort (M365/file abort event) but never touched state._google_scan_abort. Now sets both. 2. _check_abort() inside _run_google_scan imported gdpr_scanner._scan_abort (= state._scan_abort, the M365 event) instead of using the module-level _scan_abort alias (= state._google_scan_abort). This meant the dedicated /api/google/scan/cancel endpoint — which correctly sets _google_scan_abort — was silently ignored by the scan loop. Fixed to use the module-level alias consistently. Also aligned the end-of-scan checkpoint-clear check.	2026-05-28 10:20:22 +02:00
StyxX65	2c5f5d3283	Add OCR language override setting Operators can now choose Tesseract language pack(s) per profile via a sidebar select (#optOcrLang) and profile editor (#peOptOcrLang). Presets: dan+eng (default), dan, eng, dan+eng+deu, dan+eng+swe, dan+eng+fra. The ocr_lang option flows from the UI through all three scan engines (M365 files/attachments, Google Drive, Gmail) down to document_scanner.scan_pdf and scan_image — including the spawned PDF-OCR subprocess worker. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-28 09:59:40 +02:00
StyxX65	23b9555dcf	Built-in file redaction for local files	2026-05-27 14:49:06 +02:00
StyxX65	78fb406422	Fixed two bugs: selected cards staying visible after preview opens, and stale history results showing when a new scan starts.	2026-04-29 15:18:58 +02:00
StyxX65	d84e57239a	Add CPR cross-referencing (related documents) Clicking any flagged card that contains CPR hits now shows a "Related documents" section in the preview panel, listing other items from the same scan session that share at least one CPR number. Items are ordered by number of shared CPRs; clicking any entry opens it in the preview panel. Works in both live mode and scan history mode. Implementation - GDPRDb.get_related_items() — SQL self-join on the existing cpr_index table using the same symmetric 300 s session window as get_session_items. No new data collection needed. - GET /api/db/related/<item_id>?ref=N — new endpoint in routes/database.py, consistent with the ?ref convention used by /api/db/flagged. - #previewRelated div injected between the metadata block and disposition row in the preview panel. - _loadRelated(f) in results.js fetches and renders the list; window._openRelated() resolves items from the live grid or falls back to the API response for history-mode items. Also - Added keyword/FTS5 search as a deferred idea in SUGGESTIONS.md - Updated CHANGELOG.md, README.md, and CLAUDE.md	2026-04-25 21:15:50 +02:00
StyxX65	8b55e9d933	Extended the M365 checkpoint/resume mechanism to all three scan engines. Each engine writes its own +file (`checkpoint_m365.json`, `checkpoint_google.json`, `checkpoint_file_{source_id}.json`) every 25 + items.	2026-04-25 20:30:59 +02:00
StyxX65	2254e00481	recap: Added email and phone number detection as opt-in scan options across all three engines, plus translation fixes. Both CHANGELOG and SUGGESTIONS are updated — everything is committed and ready to test.	2026-04-25 19:33:28 +02:00
StyxX65	e35bbe78a5	Added SFTP to sources	2026-04-25 08:48:54 +02:00
StyxX65	f7f1194d63	Fix: Profile copy rename not reflected in left column until modal reopen	2026-04-21 20:33:16 +02:00
StyxX65	c350014b16	fix: scan button stuck, CPR dedup crash, role scope filter, profile race conditions; add auto-email toggle and route integration tests	2026-04-21 18:43:25 +02:00
StyxX65	7c1afca80b	Bugfixes fix: select mode onclick exports, multi-source progress counter, OCR page-by-page	2026-04-21 13:12:54 +02:00
StyxX65	d8083eb0c0	feat: interface PIN, bulk disposition tagging, Google Drive delta scan, OCR memory fixes - Interface PIN: optional session-level auth gate for the main scanner UI (Settings → Security → Interface PIN). Salted SHA-256 in config.json, rate-limited (5 attempts/5 min per IP). /view and viewer auth exempt. New /login page, before_request hook, GET/POST/DELETE /api/interface/pin, POST /api/interface/pin/verify, POST /api/interface/logout. - Bulk disposition tagging: Select mode (filter bar "Vælg" button) reveals per-card checkboxes. Bulk tag bar at bottom of grid; POST /api/db/disposition/bulk. Disposition stats bar (total · unreviewed · retain · delete · % reviewed) updates after every save. - Google Drive delta scan: uses Drive Changes API when delta is enabled. Per-user token stored as gdrive:{email} in delta.json. Load-then-merge save avoids racing with concurrent M365 token writes. - PDF OCR OOM fix: render one page at a time with convert_from_path (first_page=N, last_page=N). Added _ocr_mem_ok() psutil guard (500 MB threshold) before each page render across scan_pdf, redact_fitz_pdf, redact_pdf. - Email test message translation fix: routes/email.py returns structured {ok, method, recipients} instead of a hardcoded English string; scheduler.js builds the translated message client-side. - Docs: CHANGELOG, README, TODO, MANUAL-EN, MANUAL-DA all updated. Lang files (en/da/de) extended with bulk, interface PIN, and SMTP keys. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-18 18:46:45 +02:00
StyxX65	c9aab19a97	feat: scan history browser, user-scoped viewer tokens, export fixes, email fixes (v1.6.20) - Scan history browser (history.js, GET /api/db/sessions, get_sessions(), get_session_items(ref_scan_id)) — review any past session without rescanning - User-scoped viewer tokens (#34) — scope by individual employee across M365 and GWS; autocomplete from Accounts list; dual-email support - Fix: GWS scan never marked finished (end_scan → finish_scan) and emitted wrong SSE event (scan_done → google_scan_done), excluding GWS items from all exports - Fix: file scan begin_scan called with wrong keyword args (TypeError swallowed), so local/SMB items were never written to DB - Fix: Graph sendMail reported failure on success — _post() now returns {} on empty 202 response instead of raising JSONDecodeError - Fix: Graph error hidden behind generic "No SMTP host" message when both Graph and SMTP were unavailable - Fix: Gmail vs Google Workspace SMTP error messages distinguished by username domain; Workspace errors point to admin console, not personal security settings - Docs: update README, MANUAL-EN, MANUAL-DA, CLAUDE.md, TODO.md, CHANGELOG.md Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-18 13:57:54 +02:00
StyxX65	1aaf400771	feat: role-scoped viewer tokens — restrict shared links to student or staff items Add a Role scope dropdown to the Share modal (All roles / Ansatte / Elever). Scope is stored as {"role": "student"\|"staff"} in viewer_tokens.json and enforced server-side in GET /api/db/flagged via session["viewer_scope"]. Client-side, #filterRole is pre-set and hidden for scoped viewers so the constraint cannot be bypassed. Existing tokens and PIN sessions remain unrestricted. Role badge shown on each scoped token row in the Active links list. Files: app_config.py, routes/viewer.py, routes/database.py, gdpr_scanner.py, templates/index.html, static/js/viewer.js, static/js/auth.js, lang/en.json, lang/da.json, lang/de.json, CLAUDE.md, CHANGELOG.md, README.md, MANUAL-EN.md, MANUAL-DA.md Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 09:30:38 +02:00
StyxX65	0c35a7a83d	feat: role filter in results grid + role-scoped Excel and Art.30 exports - New Role dropdown in filter bar (All / Ansatte / Elever) — filters the results grid client-side via applyFilters() and clearFilters(). - Exports respect the active role: exportExcel() and exportArticle30() append ?role=student\|staff to the fetch URL when a role is selected. - _build_excel_bytes(role='') and _build_article30_docx(role='') filter to a local _items list at the top; all internal sheets (Summary, GPS, External transfers, Art.30 staff/student tables) see only the filtered subset. Filenames get _elever or _ansatte suffix. - i18n: m365_filter_all_roles / m365_filter_staff / m365_filter_student added to en/da/de.json. - CLAUDE.md, README.md, CHANGELOG.md, MANUAL-EN.md, MANUAL-DA.md updated. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 09:02:52 +02:00
StyxX65	28c9effd17	feat: student scan filters — skip GPS images and min CPR threshold New profile options to reduce noise when scanning student accounts: - skip_gps_images: images flagged solely by GPS coordinates are suppressed. GPS data is still extracted and shown in the detail card when the item is flagged by another signal (faces, EXIF author/comment). - min_cpr_count (default 1): only flag a file if it contains at least N distinct CPR numbers. Deduplication is by value. Faces and EXIF PII still trigger flags regardless of CPR count. Both options apply to M365, Google, and file scan paths. Saved in profiles and editable in the Profile Manager editor. Docs, manuals, i18n (DA/EN/DE), CHANGELOG, and VERSION (1.6.14 → 1.6.15) updated. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 08:48:12 +02:00
StyxX65	6e0aab788a	Fix: macOS runner, scan hang, export sources, profile role filter/badge	2026-04-12 07:48:26 +02:00
Henrik Højmark	3ad68b45f7	Fix viewer share links to use LAN IP; bind Flask to 0.0.0.0 Share links copied from the Share modal were built with window.location.origin, producing 127.0.0.1 URLs that remote viewers could never reach. - Bind Flask to 0.0.0.0 in gdpr_scanner.py (--host default), m365_launcher.py, and build_gdpr.py so the server is reachable on the local network. Internal loopback URLs (urllib exports, webview window, port probe) intentionally keep 127.0.0.1. - Add /api/local_ip endpoint: UDP probe to 8.8.8.8 discovers the active LAN IP without sending real traffic. - Add _getShareBaseUrl() in viewer.js: fetches /api/local_ip and substitutes the LAN IP; falls back to window.location.origin. - createShareLink and copyTokenLink are now async and await _getShareBaseUrl() before building the viewer URL. - Update CLAUDE.md and static/js/CLAUDE.md with the new invariants. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-11 06:14:17 +02:00
Henrik Højmark	9c7df76fbd	Initial commit	2026-04-11 04:38:11 +02:00

47 Commits