Compare commits

..

96 Commits

Author SHA1 Message Date
StyxX65
efbbeb7306 Restore M365Connector.delete_message (was an orphaned method body)
Some checks are pending
Build — Windows, Linux & macOS / GDPRScanner / linux (push) Waiting to run
Build — Windows, Linux & macOS / GDPRScanner / macos (push) Waiting to run
Build — Windows, Linux & macOS / GDPRScanner / windows (push) Waiting to run
Build — Windows, Linux & macOS / Create GitHub Release (push) Blocked by required conditions
The def line for delete_message had been lost, leaving its body as
unreachable dead code at the end of _delete() and no delete_message
attribute on the connector. Deleting an Outlook message therefore failed
with "'M365Connector' object has no attribute 'delete_message'". Restored
the method (soft-delete: move to Deleted Items, fall back to DELETE).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 15:43:46 +02:00
StyxX65
54f8848e30 Document renderGrid landing-card hiding in static/js/CLAUDE.md
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 14:49:43 +02:00
StyxX65
8a446509c6 Hide landing/last-scan card whenever results render
The live scan_file_flagged handler showed the grid but never hid
#emptyState / #lastScanSummary, so when a scan ran with the landing
card visible, results appeared underneath it until a manual refresh
(which re-ran loadOpenItems and cleared it). Hide both panels in
renderGrid whenever files are present, covering every render path
(live SSE, open-items load, history, filters).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 14:45:50 +02:00
StyxX65
d55778ab35 Release 1.7.9: changelog + manual updates
Document this cycle's changes: open-items default results view,
interrupted-scan recovery, restored user/group badges, the SMTP
username-key fix, and the new "always send via SMTP" toggle. Stamp
manuals (EN/DA) to 1.7.9.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 11:36:41 +02:00
StyxX65
874c3ccec1 Add "prefer SMTP" toggle to skip Microsoft Graph for email
When the M365 connector is connected the app always tries Graph first,
and a Graph 202 ends the send — so report mail to recipients Exchange
silently drops (Google-hosted subdomains of the O365 domain) never
reaches them, even with working SMTP configured.

New prefer_smtp flag gates all three Graph branches (smtp_test,
send_report, _maybe_send_auto_email) so they go straight to SMTP. UI
toggle #st-smtpPreferSmtp in Settings → E-mailrapport, saved/loaded by
scheduler.js, with da/de/en strings.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 11:30:45 +02:00
StyxX65
526e2b0b78 Fix SMTP auth: settings tab saved wrong config keys
The Settings → E-mailrapport tab (scheduler.js) saved the SMTP username
as `user` and TLS flag as `starttls`, but every backend reader expects
`username`/`use_tls` (routes/email.py). Result: username was always
empty, server.login() was skipped, and the SMTP server rejected the
send — surfacing as a misleading "authentication failed" message even
with a valid App Password. The bug was latent because Graph is preferred
whenever M365 is connected, so the SMTP path was rarely exercised.

- scheduler.js: send/load canonical keys (username, use_tls). The
  send-report modal (scan.js) already used these.
- _load_smtp_config(): normalise legacy user→username / starttls→use_tls
  so configs saved before the fix work without re-entry.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 11:25:15 +02:00
StyxX65
b661a94f98 Restore user/group badges on DB-loaded result cards
The card badge only rendered when f.account_name was set, and the
group (role) badge was nested inside that same check. But save_item
never persisted account_name — only account_id (a GUID) and user_role.
Live SSE cards carried account_name so badges showed during a scan;
now that the grid loads finalized scans from the DB, the gap is exposed
and both badges vanish for earlier scans.

- Persist account_name (migration 11 + save_item) so future scans show
  the user badge. Both M365 and Google cards already carry it.
- _accountPill() in results.js drives the group badge off user_role
  alone (shows for legacy rows) and resolves a best-effort user label:
  account_name → S._allUsers (id/email) → email-style account_id → omit.
  Both card layouts share the one helper.

Legacy rows still lack account_name (never captured), but now show the
group badge and a resolved/email user label where possible.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 10:15:19 +02:00
StyxX65
29d9168643 Recover unfinished scans so their items aren't stranded
get_session_items / get_open_items / latest_scan_id all require
finished_at IS NOT NULL, but the M365 and Google engines return early
on abort (skipping finish_scan) and a process kill mid-scan (deploy,
OOM, crash) never reaches it either. Result on prod: 41/42 scans had
finished_at NULL, so 291 already-saved flagged items were invisible —
the grid showed nothing.

- finalize_orphan_scans(): finalises every finished_at-NULL scan; runs
  once at startup before the scheduler (nothing is scanning at boot, so
  any unfinished scan is dead). Recovers existing stranded items and
  guards against future mid-scan restarts.
- run_scan: finalise the DB scan on the abort early-return too, so a
  stopped scan's items stay visible without waiting for a restart.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 09:51:22 +02:00
StyxX65
7bf589bf7a Update ZORAXY_SETUP.md 2026-06-22 09:21:08 +02:00
StyxX65
68076eba52 Show all open (unactioned) items by default, not just the last scan
The default results view loaded only the latest scan session (±300s
window), so items dropped out of sight once a newer scan started — and
a long scheduled scan could show little or nothing on browser open.

Add get_open_items(): every flagged item with no disposition (or status
'unreviewed') across all scans, deduped by id to the latest finished
scan. GET /api/db/flagged now serves it when no ?ref is given; ?ref=N
still loads a specific past session. Frontend loadHistorySession(null)
routes to a new loadOpenItems() loader. Rename the banner button to
"Open items" (da/de/en).

get_session_items() default is unchanged — export.py and
scan_scheduler.py still rely on latest-session for the current scan's
report/email.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 09:19:55 +02:00
StyxX65
67f66c8441 Document self-update system and related changes in CLAUDE.md
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-16 12:16:14 +02:00
StyxX65
8bb482925f Release 1.7.8
- CHANGELOG: cut the 1.7.8 release (dated 2026-06-16); reset Unreleased.
- VERSION: 1.7.7 -> 1.7.8.
- Manuals (DA + EN): bump version stamps.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-16 11:56:12 +02:00
StyxX65
f84c8516df Reliably restore last session on refresh after a server restart
The page-load restore was one-shot and bailed when a completed scan's
replayed scan_phase left a running flag set; sse_replay_done (the other
retry) only fires for a non-empty replay buffer, which is empty after a
restart — so refreshing post-update showed a blank grid despite the
results being in the DB. The watchdog now retries the restore on each
4s poll while nothing is shown and no scan runs, clearing stale flags
first. /api/scan/status also reports google_running separately so a
refresh during a live Google scan is no longer treated as idle.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-16 11:53:07 +02:00
StyxX65
9fd1aa1f8a Manuals: describe new share-link create flow
After Create the form clears and the new link appears highlighted in
the Active links list, copied from there — not from a preview row.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-15 10:52:22 +02:00
StyxX65
da356fb310 Release 1.7.7
- CHANGELOG: cut the 1.7.7 release (dated 2026-06-15); reset Unreleased.
- VERSION: 1.7.6 -> 1.7.7.
- Manuals (DA + EN): bump version stamps.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-15 10:12:00 +02:00
StyxX65
bdba80e72d Remove stale link preview from share modal after create
The generated-link "Copy link:" row stayed visible after creating,
looking like the form hadn't reset — but the new link was already in
the Active links list with its own Copy button. Drop the redundant
preview row; on create, reset the form and briefly highlight the new
entry in the active list. Removes the now-dead shareNewLinkRow markup
and copyShareLink().

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-15 10:11:03 +02:00
StyxX65
c26dd7d320 Add Zoraxy HTTPS setup guide, correct SECURITY.md bind address
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 15:20:33 +02:00
StyxX65
841311a6bd Release 1.7.6
- CHANGELOG: cut the 1.7.6 release (dated 2026-06-11); reset Unreleased.
- VERSION: 1.7.5 -> 1.7.6.
- Manuals (DA + EN): bump version stamps.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 15:02:44 +02:00
StyxX65
dd19be8bbf Close leaked listening socket on update restart
Werkzeug sets its server socket inheritable unconditionally, so the
os.execv restart carried it into the new process as a zombie listener:
one PID listening on both 5100 (never accepted) and 5101 (the real
server). Mark all fds above stderr close-on-exec before exec'ing so
the old socket dies and the new server rebinds the original port.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 15:01:17 +02:00
StyxX65
c43725ca7f Release 1.7.5
- CHANGELOG: cut the 1.7.5 release (dated 2026-06-11); reset Unreleased.
- VERSION: 1.7.4 -> 1.7.5.
- Manuals (DA + EN): bump version stamps.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 14:42:06 +02:00
StyxX65
a1712ae178 Make static files revalidate so the UI is fresh after updates
No Cache-Control header meant browsers cached JS/CSS heuristically for
days; after a server update (including the in-app self-update reload)
the backend was new but the frontend stayed stale. SEND_FILE_MAX_AGE
_DEFAULT=0 forces ETag revalidation — 304 when unchanged, fresh file
immediately after an update.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 14:39:45 +02:00
StyxX65
c1cddb8ea7 Release 1.7.4
- CHANGELOG: cut the 1.7.4 release (dated 2026-06-10); reset Unreleased.
- VERSION: 1.7.3 -> 1.7.4.
- Manuals (DA + EN): bump version stamps.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 15:33:16 +02:00
StyxX65
9cbd93e1f5 Reset all share modal fields after creating a link
Create only cleared the label; scope type, user email, date range, and
expiry carried over, so the next link silently inherited the previous
link's scope. Extracted openShareModal's reset logic into
_resetShareForm() and call it after every successful create.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 15:32:03 +02:00
StyxX65
d4cf2db347 Release 1.7.3
- CHANGELOG: cut the 1.7.3 release (dated 2026-06-10); reset Unreleased.
- VERSION: 1.7.2 -> 1.7.3.
- Manuals (DA + EN): bump version stamps.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 15:24:27 +02:00
StyxX65
d6bf80a68a Keep the same port across app restarts
The port probe did a plain bind() without SO_REUSEADDR, so TIME_WAIT
connections left by the previous instance (e.g. the in-app update
restart) made the port look occupied and the app hopped to the next
one. Probe with SO_REUSEADDR like Werkzeug binds, and give the
requested port a 10-second grace period before auto-incrementing.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 15:23:18 +02:00
StyxX65
679f91da2c Use page origin for share links except when browsing at localhost
The LAN-IP rewrite in _getShareBaseUrl() exists to fix unusable
127.0.0.1 links; applying it to every origin meant links copied behind
a reverse proxy pointed at http://<LAN-IP>:5100, bypassing TLS. HTTPS
and non-localhost origins are now used as-is.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 15:14:33 +02:00
StyxX65
c79e7097ea Release 1.7.2
- CHANGELOG: cut the 1.7.2 release (dated 2026-06-10); reset Unreleased.
- VERSION: 1.7.1 -> 1.7.2.
- Manuals (DA + EN): bump version stamps.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 15:10:59 +02:00
StyxX65
35e767b506 Fix copy buttons doing nothing over plain HTTP
navigator.clipboard is undefined in non-secure contexts, so the direct
writeText() call threw synchronously and the execCommand fallback in its
.catch() never ran. _copyText() now feature-detects the API, falls back
to execCommand('copy'), then to a prompt() for manual copying. log.js
reuses the helper; _getShareBaseUrl() caches the LAN-IP lookup so token
Copy buttons stay within the click gesture execCommand requires.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 15:09:34 +02:00
StyxX65
652031b31d Release 1.7.1
- CHANGELOG: cut the 1.7.1 release (dated 2026-06-10); reset Unreleased.
- VERSION: 1.7.0 -> 1.7.1.
- Manuals (DA + EN): bump version stamps; document the new
  Settings -> General -> Software update group (check/install/auto-update,
  git-checkout-only, self-restart, refused during scans).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 14:50:31 +02:00
StyxX65
df54b20735 Document software updates in README, refresh test suite table
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 14:47:58 +02:00
StyxX65
a325349ecd Fix stale ~/.gdpr_scanner_* paths in help text, docs, and UI strings
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 14:41:23 +02:00
StyxX65
6a4b0e1706 Show delta token source count, add hint bubble, fix README data paths
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 14:27:14 +02:00
StyxX65
c0e45df440 Add software update from Settings GUI and update_gdpr.sh script
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 12:54:29 +02:00
StyxX65
fcf32f3751 Release 1.7.0
- CHANGELOG: cut the 1.7.0 release (dated 2026-06-10); reset Unreleased.
- VERSION: 1.6.28 → 1.7.0.
- Manuals (DA + EN): bump version stamps; correct the redaction section
  (cards are now kept/greyed until the next scan, not removed) and add the
  same keep-until-next-scan note to the deletion section, including the
  partial-failure behaviour.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 12:06:36 +02:00
StyxX65
95f1f39a1f Keep data-subject-deleted cards in grid until next scan
Apply the keep-until-next-scan behaviour to deleteSubjectItems: mark the
deleted items _deleted (using deleted_ids from the response) and keep them
greyed in the grid instead of filtering them out. Also fixes a latent bug
where renderGrid() was called with no argument and threw on files.forEach,
which the surrounding try/catch swallowed as a false "Delete failed" after a
successful erasure.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 11:47:52 +02:00
StyxX65
386831c423 Keep bulk-deleted cards in grid until next scan
Extend the keep-until-next-scan behaviour to the bulk delete modal: instead
of removing matched cards on success, mark them _deleted and keep them greyed
with a "🗑 Deleted" badge and hidden buttons. /api/delete_bulk now returns
deleted_ids so the grid marks exactly the items the server actually deleted —
partial failures stay active and re-deletable. Already-handled (_deleted /
_redacted) items are excluded from the bulk-delete match set so they aren't
re-counted or re-processed.

201 tests pass.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 11:46:14 +02:00
StyxX65
ed3c3a80d6 Keep deleted cards in grid until next scan
Mirror the redact behaviour for the card delete button (🗑): instead of
removing the card on success, mark the item _deleted and keep it in the grid
— greyed via card-resolved, shown with a red "🗑 Deleted" badge, action
buttons hidden so it can't be re-processed. The grid is rebuilt on the next
scan run, clearing the markers. results.js only — no server change.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 11:44:10 +02:00
StyxX65
7c1c2b390d Keep selected card in view when opening preview
Opening the preview panel narrows .grid-area and reflows the auto-fill grid
to fewer columns, moving the clicked card to a new row. The single-frame
scrollIntoView ran while the browser's scroll-anchoring re-adjusted scrollTop
mid-reflow, so the card scrolled out of view. Disable scroll anchoring on
.grid-area (overflow-anchor:none) and defer the scroll by two animation
frames against the settled layout, centring the card (block:'center').

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 11:35:04 +02:00
StyxX65
d82a0d6004 Keep redacted cards in grid until next scan
Redacting a card (✏) previously removed it from the grid and from
S.flaggedData/S.filteredData immediately. Now the item is marked _redacted
and kept: greyed via card-resolved styling, shown with a "✏ Redacted" badge,
and its delete/redact buttons hidden so it can't be re-processed. The grid is
rebuilt on the next scan run, which clears the markers. results.js only — no
server change.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 11:30:41 +02:00
StyxX65
1b3d7f5698 Fix card action buttons clipped in grid view (missing position:relative)
The real cause behind the invisible redact/delete buttons: .card lacked
position:relative, so the position:absolute action buttons (delete, redact)
and the bulk-select checkbox anchored to the viewport instead of the card
and were clipped by .card overflow:hidden. They only showed in list view,
where those elements are position:static. Add position:relative to .card so
all three position within each card. Keep the 0.35 baseline opacity on the
redact button for discoverability.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 11:24:00 +02:00
StyxX65
39500edfbc Changelog: note redact button visibility fix
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 11:21:37 +02:00
StyxX65
35fd00437f Fix redact button invisible in grid view
.card-redact-btn had opacity:0 at rest (only opacity:1 on .card:hover), so
the ✏ redact button was completely invisible in the default grid/thumbnail
view — it only showed in list view, which forces opacity:1. Give it the same
0.35 baseline opacity as .card-delete-btn so it's discoverable at rest and
brightens on hover. The button was always rendered in the DOM; this is a
pure visibility fix.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 11:20:06 +02:00
StyxX65
c39d68ca19 Document XSS escaping + secret-encryption hardening
- CHANGELOG: add Unreleased ### Security section covering the stored XSS
  in the results grid, the reflected XSS in /api/thumb, and the Claude API
  key now being encrypted at rest.
- CLAUDE.md / static/js/CLAUDE.md: add the esc() / _html_esc escaping rule
  for scan-derived strings and the onclick-JSON &quot; pattern.
- CLAUDE.md / routes/CLAUDE.md: note that secret config fields use the
  machine-keyed Fernet and must be read via a decrypting accessor
  (get_claude_api_key()), never config.json directly.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 11:15:39 +02:00
StyxX65
b6d2915d49 Harden XSS escaping and encrypt Claude API key at rest
- results.js: add esc() helper and apply to all scan-derived fields
  (name, account_name, folder, source, modified, label, img alt) across
  card/list/preview/subject-lookup/related views. Scan-derived strings can
  carry attacker-controlled markup (e.g. a OneDrive file named with HTML),
  so they must be escaped before innerHTML/attribute embedding. Also escape
  the related-docs onclick JSON to match the delete/redact &quot; pattern.
- cpr_detector._placeholder_svg: escape label/name before embedding — served
  as image/svg+xml via /api/thumb?name=, so an unescaped value was a
  reflected-XSS vector when the URL is opened directly.
- cpr_detector: remove 44-line unreachable duplicate of the face-detection
  body left inside _extract_audio_metadata after its return.
- app_config: encrypt claude_api_key at rest with the machine-keyed Fernet
  (same as the SMTP password); add get_claude_api_key() for decryption.
  Legacy plaintext keys still read and are re-encrypted on next save.
  Update readers in document_scanner.py and routes/app_routes.py.

201 tests pass.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 11:06:36 +02:00
StyxX65
1903115e02 CLAUDE.md restructured 2026-06-08 14:44:37 +02:00
StyxX65
f845a2f686 ### Fixed - **Cards not shown after browser refresh** — when the browser reconnected to the SSE stream after a completed scan, the scan_phase events in the replay buffer temporarily set S._m365ScanRunning = true (all running flags start at false after a page reload). The watchdog's loadHistorySession call fired in this window and bailed on the stale flag; once scan_done cleared the flag, _initialStatusChecked was already true so loadHistorySession was never retried. Fixed by having the sse_replay_done handler retry loadHistorySession(null) when no scan is running and S._historyRefScanId is still null after replay. 2026-06-08 14:28:24 +02:00
StyxX65
79e589b525 Bugfix in Scheduler 2026-06-04 14:47:01 +02:00
StyxX65
fa6601ffdd Bugfixes 2026-06-01 15:15:43 +02:00
StyxX65
4e5a8934d7 Fix Google scan not stopping cleanly before a new scan starts 2026-05-29 04:53:42 +02:00
StyxX65
66986a16f9 ※ recap: Extended in-place CPR redaction to Google Drive, SFTP, SMB, and local PDFs, then updated CLAUDE.md and both manuals. Everything is committed and all 201 tests pass. (disable recaps in /config) 2026-05-28 17:53:53 +02:00
StyxX65
034ced943e Extended document redaction to Google Drive, SFTP, SMB, and local PDFs Extends the ✂ in-place redaction feature beyond local DOCX/XLSX/CSV/TXT files to cover all remaining file source types and adds PDF support for local files. 2026-05-28 17:47:02 +02:00
StyxX65
6ce7583b26 Added NER/AI integration 2026-05-28 11:50:10 +02:00
StyxX65
6e0dc8ee92 Minor changes to layout in Manuals 2026-05-28 11:23:20 +02:00
StyxX65
26c45165b9 v1.6.28 — Scheduled report-only jobs, compliance audit log, and documentation update
- Scheduled jobs can now run in report-only mode (skip scan, email latest DB results)
- Compliance audit log records all significant admin actions in an immutable DB table
- VERSION bumped to 1.6.28; CHANGELOG [Unreleased] sealed as [1.6.28] — 2026-05-28
- Both manuals updated: CPR-only mode, OCR language, file redaction, related documents,
  date-range token scoping, report-only jobs, audit log tab, two new FAQ entries
- TODO.md updated with all completed tasks

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-28 11:08:52 +02:00
StyxX65
744813f4ac Add compliance audit log
Immutable audit_log table in the scanner DB records every significant
admin action (profile save/delete, token create/revoke, PIN changes,
source add/update/delete, scheduler job changes, scan start/stop, SMTP
save, dispositions, item delete/redact). GET /api/audit_log exposes
entries newest-first. New Audit Log tab in the Settings modal renders
the table on demand. Settings modal widened 540→640 px and tab labels
set to white-space:nowrap so the six-tab row fits on one line.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-28 10:51:23 +02:00
StyxX65
4ef2dfb352 Date-range scoping for viewer tokens 2026-05-28 10:34:55 +02:00
StyxX65
c820d6f6db Two bugs in the abort mechanism: 1. POST /api/scan/stop only set state._scan_abort (M365/file abort event) but never touched state._google_scan_abort. Now sets both. 2. _check_abort() inside _run_google_scan imported gdpr_scanner._scan_abort (= state._scan_abort, the M365 event) instead of using the module-level _scan_abort alias (= state._google_scan_abort). This meant the dedicated /api/google/scan/cancel endpoint — which correctly sets _google_scan_abort — was silently ignored by the scan loop. Fixed to use the module-level alias consistently. Also aligned the end-of-scan checkpoint-clear check. 2026-05-28 10:20:22 +02:00
StyxX65
7ffd8370f4 Fix Stop button not halting Google Workspace scan
Two bugs in the abort mechanism:

1. POST /api/scan/stop only set state._scan_abort (M365/file abort event)
   but never touched state._google_scan_abort. Now sets both.

2. _check_abort() inside _run_google_scan imported gdpr_scanner._scan_abort
   (= state._scan_abort, the M365 event) instead of using the module-level
   _scan_abort alias (= state._google_scan_abort). This meant the dedicated
   /api/google/scan/cancel endpoint — which correctly sets _google_scan_abort
   — was silently ignored by the scan loop. Fixed to use the module-level
   alias consistently. Also aligned the end-of-scan checkpoint-clear check.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-28 10:19:54 +02:00
StyxX65
2c5f5d3283 Add OCR language override setting
Operators can now choose Tesseract language pack(s) per profile via a
sidebar select (#optOcrLang) and profile editor (#peOptOcrLang). Presets:
dan+eng (default), dan, eng, dan+eng+deu, dan+eng+swe, dan+eng+fra. The
ocr_lang option flows from the UI through all three scan engines (M365
files/attachments, Google Drive, Gmail) down to document_scanner.scan_pdf
and scan_image — including the spawned PDF-OCR subprocess worker.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-28 09:59:40 +02:00
StyxX65
23b9555dcf Built-in file redaction for local files 2026-05-27 14:49:06 +02:00
StyxX65
c490b3d76a Merge remote CHANGELOG entries and add Preview section to CLAUDE.md
Resolved conflict in CHANGELOG.md: combined the two bug fixes from the
remote branch (stale history results, selected card scroll) with the
local Gmail/Drive preview fix under a single [1.6.26] — 2026-04-29 entry.
Added Preview dispatch documentation to CLAUDE.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 13:43:59 +02:00
StyxX65
051a53ae85 Update CHANGELOG.md 2026-05-27 13:40:21 +02:00
Henrik Højmark
99157e6fd7
Update CHANGELOG for version 1.6.26
Updated release date for version 1.6.26 and added detailed fixes related to scan history, card visibility, and Google Drive/Gmail previews.
2026-05-27 13:38:40 +02:00
StyxX65
78fb406422 Fixed two bugs: selected cards staying visible after preview opens, and stale history results showing when a new scan starts. 2026-04-29 15:18:58 +02:00
StyxX65
a76df463e8 Changelog updated 2026-04-27 18:47:43 +02:00
StyxX65
ce5a5f1cbb Fixed Gmail and Google Drive preview: items were being sent to the Microsoft Graph API instead of handled correctly. 2026-04-26 11:04:05 +02:00
StyxX65
d84e57239a Add CPR cross-referencing (related documents)
Clicking any flagged card that contains CPR hits now shows a "Related documents" section in the preview panel,
  listing other items from the same scan session that share at least one CPR number. Items are ordered by number of
  shared CPRs; clicking any entry opens it in the preview panel. Works in both live mode and scan history mode.

  Implementation
  - GDPRDb.get_related_items() — SQL self-join on the existing cpr_index table using the same symmetric 300 s session
  window as get_session_items. No new data collection needed.
  - GET /api/db/related/<item_id>?ref=N — new endpoint in routes/database.py, consistent with the ?ref convention used
   by /api/db/flagged.
  - #previewRelated div injected between the metadata block and disposition row in the preview panel.
  - _loadRelated(f) in results.js fetches and renders the list; window._openRelated() resolves items from the live
  grid or falls back to the API response for history-mode items.

  Also
  - Added keyword/FTS5 search as a deferred idea in SUGGESTIONS.md
  - Updated CHANGELOG.md, README.md, and CLAUDE.md
2026-04-25 21:15:50 +02:00
StyxX65
8b55e9d933 Extended the M365 checkpoint/resume mechanism to all three scan engines. Each engine writes its own +file (checkpoint_m365.json, checkpoint_google.json, checkpoint_file_{source_id}.json) every 25 + items. 2026-04-25 20:30:59 +02:00
StyxX65
2254e00481 recap: Added email and phone number detection as opt-in scan options across all three engines, plus translation fixes. Both CHANGELOG and SUGGESTIONS are updated — everything is committed and ready to test. 2026-04-25 19:33:28 +02:00
StyxX65
56a744d896 Fixed missing translation in Sources 2026-04-25 10:57:41 +02:00
StyxX65
9da4403bdf Update VERSION 2026-04-25 08:51:28 +02:00
StyxX65
e35bbe78a5 Added SFTP to sources 2026-04-25 08:48:54 +02:00
StyxX65
360eb1caed Bugfixes in media detection 2026-04-21 21:42:54 +02:00
StyxX65
d42518dc81 Added tests for Video & Audio
feat: video/audio metadata scanning, profile rename fix, route tests

  - Scan .mp4/.mov/.avi/.mkv and .mp3/.flac/.ogg/.m4a/.wma (+ 7 more)
    for GPS coordinates, artist/author, title, comment — metadata only,
    no frame or audio analysis. Uses mutagen (added to requirements.txt).
    GPS-tagged phone recordings now flag with gps_location like photos.

  - Fix _extract_audio_metadata silently returning empty results:
    mutagen.File() first positional arg is `filename`, not `fileobj` —
    was passing BytesIO as the filename. Fixed to keyword args.

  - Fix profile copy rename not reflected in left column until modal
    reopen: _pmgmtSaveFullEdit called loadProfiles() but never
    _renderProfileMgmt(). Added re-render and active-row highlight.

  - Add TestProfileRoutes (10 tests) covering all profile API endpoints
    including a rename regression test. Total: 182 tests.

  - generate_fixtures.py now produces 6 audio/video fixtures (14–19):
    2 MP3, 2 FLAC, 2 MP4 — 4 flagged, 2 negative cases.
2026-04-21 21:26:58 +02:00
StyxX65
2a2d79de90 Added testing of Profile 2026-04-21 20:51:37 +02:00
StyxX65
f7f1194d63 Fix: Profile copy rename not reflected in left column until modal reopen 2026-04-21 20:33:16 +02:00
StyxX65
08d811b329 Update README.md 2026-04-21 18:53:15 +02:00
StyxX65
f3a4c60136 Delete GDPR_ERRORLOG.md 2026-04-21 18:48:02 +02:00
StyxX65
c350014b16 fix: scan button stuck, CPR dedup crash, role scope filter, profile race conditions; add auto-email toggle and route integration tests 2026-04-21 18:43:25 +02:00
StyxX65
7c1afca80b Bugfixes
fix: select mode onclick exports, multi-source progress counter, OCR       page-by-page
2026-04-21 13:12:54 +02:00
StyxX65
d8083eb0c0 feat: interface PIN, bulk disposition tagging, Google Drive delta scan, OCR memory fixes
- Interface PIN: optional session-level auth gate for the main scanner UI
  (Settings → Security → Interface PIN). Salted SHA-256 in config.json,
  rate-limited (5 attempts/5 min per IP). /view and viewer auth exempt.
  New /login page, before_request hook, GET/POST/DELETE /api/interface/pin,
  POST /api/interface/pin/verify, POST /api/interface/logout.

- Bulk disposition tagging: Select mode (filter bar "Vælg" button) reveals
  per-card checkboxes. Bulk tag bar at bottom of grid; POST /api/db/disposition/bulk.
  Disposition stats bar (total · unreviewed · retain · delete · % reviewed)
  updates after every save.

- Google Drive delta scan: uses Drive Changes API when delta is enabled.
  Per-user token stored as gdrive:{email} in delta.json. Load-then-merge
  save avoids racing with concurrent M365 token writes.

- PDF OCR OOM fix: render one page at a time with convert_from_path
  (first_page=N, last_page=N). Added _ocr_mem_ok() psutil guard (500 MB
  threshold) before each page render across scan_pdf, redact_fitz_pdf,
  redact_pdf.

- Email test message translation fix: routes/email.py returns structured
  {ok, method, recipients} instead of a hardcoded English string;
  scheduler.js builds the translated message client-side.

- Docs: CHANGELOG, README, TODO, MANUAL-EN, MANUAL-DA all updated.
  Lang files (en/da/de) extended with bulk, interface PIN, and SMTP keys.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-18 18:46:45 +02:00
StyxX65
b2bfa40f27 v1.6.20 — Scan history, user-scoped sharing, export fixes, email fixes
New features

  Scan history browser
  Results from any past scan session can now be reviewed without running a new scan. On page load the latest
  completed session is loaded automatically. A Sessions button opens a picker listing all past sessions with
  date, sources, item count, and Delta/Latest badges. All filters, exports, and disposition tagging work
  normally in history mode. Starting a new scan exits history mode.

  User-scoped viewer tokens (#34)
  Viewer token links can now be restricted to a specific employee so they only see their own flagged files —
  across both M365 and Google Workspace. The Share modal's scope selector gains a User option with a searchable
   name autocomplete. Selecting a person stores both their M365 and GWS email addresses; the server filters by
  account_id IN (list) so items from either platform are included. The viewer header shows the person's full
  name in a locked identity badge.

  ---
  Bug fixes

  GWS and local/SMB results missing from exports
  Two silent failures caused Google Workspace and file-scan results to disappear from Art.30 and Excel exports
  after a page reload:
  - google_scan.py called _db.end_scan() (method doesn't exist — should be finish_scan), so GWS scan records
  never got finished_at set and were permanently excluded from get_session_items()
  - google_scan.py emitted scan_done instead of google_scan_done, breaking SSE teardown logic
  - File scan called begin_scan() with keyword arguments it doesn't accept, silently leaving _db_scan_id = None
   so local/SMB items were never written to the database

  Graph sendMail reported as failure despite email being delivered
  _post() called r.json() unconditionally. Graph's sendMail returns HTTP 202 with no body on success, causing a
   JSONDecodeError that was caught and reported as a send failure. Fixed with r.json() if r.content else {}.

  Graph error hidden by generic SMTP message
  When Graph failed and no SMTP host was saved, the real Graph error was swallowed by "No SMTP host
  configured". The error is now surfaced directly.

  Gmail vs Google Workspace SMTP errors
  Auth failure messages now distinguish between personal Gmail (@gmail.com) and Google Workspace custom-domain
  accounts. Workspace errors point to the admin console (SMTP relay, 2-Step Verification policy) rather than
  the user's personal security settings.
2026-04-18 13:59:27 +02:00
StyxX65
c9aab19a97 feat: scan history browser, user-scoped viewer tokens, export fixes, email fixes (v1.6.20)
- Scan history browser (history.js, GET /api/db/sessions, get_sessions(),
  get_session_items(ref_scan_id)) — review any past session without rescanning
- User-scoped viewer tokens (#34) — scope by individual employee across M365
  and GWS; autocomplete from Accounts list; dual-email support
- Fix: GWS scan never marked finished (end_scan → finish_scan) and emitted
  wrong SSE event (scan_done → google_scan_done), excluding GWS items from all
  exports
- Fix: file scan begin_scan called with wrong keyword args (TypeError swallowed),
  so local/SMB items were never written to DB
- Fix: Graph sendMail reported failure on success — _post() now returns {} on
  empty 202 response instead of raising JSONDecodeError
- Fix: Graph error hidden behind generic "No SMTP host" message when both Graph
  and SMTP were unavailable
- Fix: Gmail vs Google Workspace SMTP error messages distinguished by username
  domain; Workspace errors point to admin console, not personal security settings
- Docs: update README, MANUAL-EN, MANUAL-DA, CLAUDE.md, TODO.md, CHANGELOG.md

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-18 13:57:54 +02:00
StyxX65
e64d7eb958 Update DEPENDENCIES.md 2026-04-12 14:53:07 +02:00
StyxX65
9c38188bb4 Update CONTRIBUTING.md 2026-04-12 14:49:28 +02:00
StyxX65
854f862bd1 Update README.md 2026-04-12 14:29:01 +02:00
StyxX65
d542357855 docs: add #34 user-scoped viewer tokens, remove SUGGESTIONS.md
- CLAUDE.md: document planned user-scoped token scope (account_id filter)
- TODO.md: add #34 spec, drop stale SUGGESTIONS.md reference
- SUGGESTIONS.md: deleted — fully superseded by TODO.md + CLAUDE.md

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 14:28:32 +02:00
StyxX65
4dfbae49a4 fix: suppress OneDrive 404 errors during delta scans as non-provisioned
Add M365DriveNotFound(M365Error) exception raised by _get() on HTTP 404.
Catch it explicitly in _scan_user_onedrive before the generic handler,
broadcasting a scan_phase ("not provisioned — skipped") instead of a red
scan_error card. Full-scan path is unaffected (bare except Exception: return
in _iter_drive_folder_for already silenced the same 404).

Root cause: _get() fell through to raise_for_status() on 404, caught by
the generic except Exception handler and broadcast as scan_error. The
asymmetry with full scans (which silently skipped 404s) was confusing.

Common causes of OneDrive 404: no licence assigned, service plan disabled,
drive never provisioned (account never signed in), account suspended.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 14:05:59 +02:00
StyxX65
1aaf400771 feat: role-scoped viewer tokens — restrict shared links to student or staff items
Add a Role scope dropdown to the Share modal (All roles / Ansatte / Elever).
Scope is stored as {"role": "student"|"staff"} in viewer_tokens.json and
enforced server-side in GET /api/db/flagged via session["viewer_scope"].
Client-side, #filterRole is pre-set and hidden for scoped viewers so the
constraint cannot be bypassed. Existing tokens and PIN sessions remain
unrestricted. Role badge shown on each scoped token row in the Active links list.

Files: app_config.py, routes/viewer.py, routes/database.py, gdpr_scanner.py,
templates/index.html, static/js/viewer.js, static/js/auth.js,
lang/en.json, lang/da.json, lang/de.json,
CLAUDE.md, CHANGELOG.md, README.md, MANUAL-EN.md, MANUAL-DA.md

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 09:30:38 +02:00
StyxX65
0c35a7a83d feat: role filter in results grid + role-scoped Excel and Art.30 exports
- New Role dropdown in filter bar (All / Ansatte / Elever) — filters the
  results grid client-side via applyFilters() and clearFilters().

- Exports respect the active role: exportExcel() and exportArticle30()
  append ?role=student|staff to the fetch URL when a role is selected.

- _build_excel_bytes(role='') and _build_article30_docx(role='') filter
  to a local _items list at the top; all internal sheets (Summary, GPS,
  External transfers, Art.30 staff/student tables) see only the filtered
  subset. Filenames get _elever or _ansatte suffix.

- i18n: m365_filter_all_roles / m365_filter_staff / m365_filter_student
  added to en/da/de.json.

- CLAUDE.md, README.md, CHANGELOG.md, MANUAL-EN.md, MANUAL-DA.md updated.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 09:02:52 +02:00
StyxX65
28c9effd17 feat: student scan filters — skip GPS images and min CPR threshold
New profile options to reduce noise when scanning student accounts:

- skip_gps_images: images flagged solely by GPS coordinates are suppressed.
  GPS data is still extracted and shown in the detail card when the item
  is flagged by another signal (faces, EXIF author/comment).

- min_cpr_count (default 1): only flag a file if it contains at least N
  distinct CPR numbers. Deduplication is by value. Faces and EXIF PII
  still trigger flags regardless of CPR count.

Both options apply to M365, Google, and file scan paths. Saved in profiles
and editable in the Profile Manager editor. Docs, manuals, i18n (DA/EN/DE),
CHANGELOG, and VERSION (1.6.14 → 1.6.15) updated.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 08:48:12 +02:00
StyxX65
dfdc46c812 Update build.yml
Remove crud from builds
2026-04-12 08:01:04 +02:00
StyxX65
6e0aab788a Fix: macOS runner, scan hang, export sources, profile role filter/badge 2026-04-12 07:48:26 +02:00
StyxX65
9e940cd60a Update build.yml 2026-04-11 10:38:20 +02:00
StyxX65
c83d9c8ed5 Docs: update CHANGELOG and
README for macOS CI build + Windows artifact fix

  Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-11 10:34:20 +02:00
StyxX65
1764e784dc CI: fix Windows artifact — zip onedir output instead of globbing dist/*.exe
PyInstaller --onedir puts the exe inside dist/GDPRScanner/, so dist/*.exe
never matched. Add a PowerShell packaging step that zips the directory,
mirroring the Linux step.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-11 10:22:14 +02:00
88 changed files with 9707 additions and 2180 deletions

View File

@ -1,10 +1,21 @@
name: Build — Windows & Linux
name: Build — Windows, Linux & macOS
# Trigger on every push to main, on version tags, or manually
on:
push:
branches: [main]
tags: ['v*']
paths-ignore:
- '**.md'
- 'docs/**'
- 'tests/**'
- 'pytest.ini'
- 'run_tests.sh'
- 'build_gdpr.sh'
- 'start_gdpr.sh'
- 'install_macos.sh'
- 'install_windows.ps1'
- '.github/ISSUE_TEMPLATE/**'
workflow_dispatch:
# Only run one build at a time per branch to avoid race conditions
@ -22,10 +33,10 @@ jobs:
include:
- os: windows-latest
name: windows
artifact_glob: "dist/*.exe"
- os: ubuntu-22.04
name: linux
artifact_glob: "dist/GDPRScanner"
- os: macos-15
name: macos
runs-on: ${{ matrix.os }}
name: GDPRScanner / ${{ matrix.name }}
@ -58,6 +69,11 @@ jobs:
Xvfb :99 -screen 0 1024x768x24 &
echo "DISPLAY=:99" >> $GITHUB_ENV
- name: Install macOS system dependencies
if: runner.os == 'macOS'
run: |
brew install tesseract tesseract-lang poppler
- name: Install Python dependencies
run: |
python -m pip install --upgrade pip
@ -78,14 +94,27 @@ jobs:
cd dist
zip -r "GDPRScanner_linux_x86_64.zip" "GDPRScanner"
- name: Package Windows binary
if: runner.os == 'Windows'
shell: pwsh
run: |
Compress-Archive -Path dist\GDPRScanner -DestinationPath dist\GDPRScanner_windows_x64.zip
- name: Package macOS binary
if: runner.os == 'macOS'
run: |
cd dist
zip -r "GDPRScanner_macos_arm64.zip" "GDPRScanner.app"
- name: Upload artifact
uses: actions/upload-artifact@v4
with:
name: M365Scanner-${{ matrix.name }}
retention-days: 30
path: |
dist/*.exe
dist/GDPRScanner_linux_x86_64.zip
dist/GDPRScanner_windows_x64.zip
dist/GDPRScanner_macos_arm64.zip
# ── Release ───────────────────────────────────────────────────────────────
# • version tag (v*) → proper versioned release with generated notes

View File

@ -9,18 +9,380 @@ Version numbers follow [Semantic Versioning](https://semver.org/spec/v2.0.0.html
## [Unreleased]
---
## [1.7.9] — 2026-06-22
### Added
- **GitHub Actions CI/CD** — automated build workflow (`.github/workflows/build.yml`) builds Windows `.exe` and Linux binary on every push to `main`. Creates a GitHub Release with artifacts when a `v*` tag is pushed.
- **`EFFORT_ESTIMATE.md`** — build effort estimate document covering component-by-component hour breakdowns and complexity drivers for the project.
- **Settings → Security tab** — new dedicated pane in the Settings modal. Admin PIN and Viewer PIN groups moved here from the General tab, which now contains only Appearance and About. The Share modal's **Configure** button navigates directly to the Security tab.
- **Viewer mode layout** — the sidebar, log panel, and progress bar are now hidden in viewer mode so results fill the full window width. The `🔍 GDPRScanner` brand is shown in the top-left of the topbar (replacing the sidebar header) at the same size and weight as the normal sidebar title.
- **"Always send via SMTP" option for email reports** — new toggle in **Settings → E-mailrapport**. When the scanner is signed in to Microsoft 365 it normally sends email through Microsoft Graph; Graph reports "accepted" the instant a message is queued, which hides the case where Exchange Online later silently drops it (e.g. a recipient on a Google-hosted subdomain of your Microsoft 365 domain — the message is treated as internal, finds no mailbox, and is discarded, with no delivery and no bounce). Enabling this option makes the manual report, the test email, and the after-scan auto-email all go straight through your configured SMTP server (e.g. Google Workspace `smtp.gmail.com` / `smtp-relay.gmail.com`), bypassing the Graph routing entirely.
### Changed
- **The results grid now shows every open item by default, not just the last scan** — when you open the app (or refresh after a scheduled or manual scan), the grid loads *all* flagged items that still need action — i.e. those with no disposition — across every scan, instead of only the most recent scan session. Items you have already tagged (kept, redacted, deleted, false positive, …) drop out of the view. Re-scans are de-duplicated so each item appears once, showing its most recent state. The session picker still loads any individual past scan, and the history banner button (formerly "Latest scan") is now **"Open items"** and returns to this default view.
### Fixed
- **Interrupted scans no longer lose their results** — a scan only became visible once it was *finalised*, but the Microsoft 365 and Google scan engines skipped finalisation when a scan was stopped, and any scan cut short by a server restart, crash, or out-of-memory kill never finalised at all. Its already-found items were then stranded in the database and invisible in the grid (this is what caused "scan finished but no results shown", especially after the in-app self-update restarts). Unfinished scans are now finalised automatically on startup (nothing is scanning at boot, so any unfinished scan is known to be dead), and a manually stopped Microsoft 365 scan finalises immediately so its partial results stay visible.
- **User and group badges were missing on result cards loaded from the database** — the reviewer's display name was shown live during a scan but never saved, so cards loaded from a past scan (now the default view) lost both the person badge and the Elev/Ansat group badge. The display name is now stored with each item, and the group badge is shown from the saved role even for older items that predate this fix (where a name can't be recovered, the group badge and a resolved e-mail still appear).
- **Email reports sent via SMTP failed with "authentication failed"** — the **Settings → E-mailrapport** tab saved the SMTP username under the wrong field name, so the username never reached the mail server and sign-in was skipped — the server then rejected the unauthenticated message, which surfaced as a misleading authentication error even with a correct password or app password. The setting is now saved correctly, and configurations saved before the fix are migrated automatically.
---
## [1.7.8] — 2026-06-16
### Fixed
- **Blank results grid after a browser refresh (especially after a server restart)** — restoring the last scan session on page load was one-shot: `_sseWatchdog()` called `loadHistorySession(null)` a single time, guarded by `_initialStatusChecked`. If that attempt was blocked — a completed scan's replayed `scan_phase` event leaves a `_*ScanRunning` flag set, and the `loadHistorySession` guard then bails — nothing retried, because `sse_replay_done` (the other retry path) only fires when the SSE replay buffer is non-empty, and the buffer is empty after a server restart (so refreshing after the in-app self-update reliably showed an empty grid even though the results were in the database). The watchdog now re-attempts the restore on every 4-second poll while nothing is shown and no scan is running, clearing stale running flags first (both scan locks are confirmed free at that point). Additionally, `/api/scan/status` now reports `google_running` separately from `running` (which only ever reflected the M365 + file lock), so a refresh during a live Google scan is detected instead of being treated as idle.
---
## [1.7.7] — 2026-06-15
### Changed
- **Share modal no longer leaves a stale link in the create box** — after clicking "Create", the generated-link preview row ("Copy link:") stayed visible at the top of the modal even though the new link was already listed under Active links with its own Copy button — so it looked like the form hadn't cleared. The redundant preview row is removed; creating a link now resets the form and briefly highlights the new entry in the Active links list, where it can be copied. (The 1.7.4 fix cleared the input fields but not this preview row.)
### Added
- **Reverse-proxy / HTTPS setup guide** — new `docs/setup/ZORAXY_SETUP.md` walks through putting the scanner behind Zoraxy with a Let's Encrypt certificate on a LAN-only deployment: DNS A-record to a private IP, ACME via DNS-01 challenge (HTTP-01 cannot reach a LAN-only host), proxy rule to `127.0.0.1:5100`, binding the app to loopback with `--host 127.0.0.1`, and scanner-specific verification (SSE streaming, HTTPS share links, self-update). Linked from the README (new "HTTPS / reverse proxy" section) and SECURITY.md.
### Fixed
- **SECURITY.md corrections** — the web UI binds to `0.0.0.0` by default, not `127.0.0.1` as claimed; the MSAL token cache path was still the pre-1.x `~/.gdpr_scanner_config.json` (actual: `~/.gdprscanner/token.json`).
---
## [1.7.6] — 2026-06-11
### Fixed
- **Update restart leaked the listening socket and hopped to port 5101** — Werkzeug marks its server socket inheritable (`srv.socket.set_inheritable(True)`, unconditionally, for its debug reloader), so the in-app update's `os.execv` restart carried the old listening socket into the new process as a zombie listener: same PID listening on both 5100 (never accepted — clients hang) and 5101 (the actual server). The 1.7.3 `SO_REUSEADDR`/grace-period fix couldn't help because the port genuinely was occupied — by the restarting process itself. `_restart_self()` now marks every fd above stderr close-on-exec before the exec (`_mark_fds_cloexec()`, enumerating `/proc/self/fd` on Linux), so the old socket dies with the exec and the new server rebinds 5100 immediately.
---
## [1.7.5] — 2026-06-11
### Fixed
- **Stale UI after updating the server** — Flask served `/static/` files with no `Cache-Control` header, so browsers cached JS/CSS heuristically (often for days). After a server update — including the new in-app self-update, whose post-install reload hit the cache — the backend was new but the frontend stayed old, and fixes appeared "not to work" until a hard refresh. `SEND_FILE_MAX_AGE_DEFAULT = 0` now makes every static file revalidate via ETag: unchanged files answer with a cheap 304, changed files are re-fetched immediately on the next normal page load.
---
## [1.7.4] — 2026-06-10
### Fixed
- **Share modal kept stale input after creating a link** — clicking "Create" only cleared the label field; scope type, user email, date range, and expiry kept their values, so the next link silently inherited the previous link's scope settings. The form-reset logic from `openShareModal()` is now a shared `_resetShareForm()` helper called after every successful create (the generated link row stays visible for copying).
---
## [1.7.3] — 2026-06-10
### Fixed
- **App restart no longer hops to a new port** — the in-app update restart (and any quick stop/start) left connections from the previous instance in TIME_WAIT, and the startup port probe did a plain `bind()` that treats TIME_WAIT as occupied — so the restarted app silently came up on 5101 and the browser's reload poll never found it. The probe now sets `SO_REUSEADDR` (matching how Werkzeug actually binds, so an actively listening port is still detected as occupied), and the requested port gets a 10-second grace period before the auto-increment fallback kicks in, covering the brief window where the old process hasn't fully released the socket.
- **Share links now respect a reverse proxy**`_getShareBaseUrl()` rewrote every copied share link to `http://<LAN-IP>:5100` (via `/api/local_ip`), which would bypass TLS when the scanner sits behind a reverse proxy (Zoraxy, Caddy, nginx, …): a DPO opening the link would silently fall back to plain HTTP. The LAN-IP rewrite now only applies in the case it was built for — browsing the app at `localhost` over HTTP, where `window.location.origin` would produce links unusable from other machines. Any HTTPS or non-localhost origin is used as-is.
---
## [1.7.2] — 2026-06-10
### Fixed
- **Copy buttons did nothing over plain HTTP** — the share modal's "Copy" buttons (new link + active links) and the log panel's copy button called `navigator.clipboard.writeText()` directly. The Clipboard API only exists in secure contexts (HTTPS or localhost), so when the scanner is reached at `http://<LAN-IP>:5100` the call threw synchronously and the intended `execCommand` fallback never ran — the button silently did nothing. `_copyText()` in `viewer.js` now feature-detects the API, falls back to `document.execCommand('copy')`, and as a last resort shows the link in a `prompt()` for manual copying; `log.js` reuses the same helper via `window._copyText`. `_getShareBaseUrl()` now caches the LAN-IP lookup so the token-list Copy buttons copy synchronously within the click gesture (required for `execCommand`).
---
## [1.7.1] — 2026-06-10
### Added
- **Software update from the GUI** — a new **Settings → General → Software update** group lets the operator check for and install updates without touching the server shell. "Check for updates" fetches origin and shows either "You are running the latest version" or the list of pending commits; "Install update" fast-forwards the git checkout to `origin/<branch>`, reinstalls dependencies only if `requirements.txt` changed, writes an `app_update` audit-log entry, and restarts the app in place by re-exec'ing the process (`os.execv` — same PID, so it works both under systemd and when launched via `start_gdpr.sh`). The page polls until the server is back and reloads itself. Local server-side edits are auto-stashed (kept, never discarded) before the merge. Updating is refused with a clear message while any scan is running. An **"Install updates automatically"** toggle (stored in `config.json` under `auto_update`) enables a background thread that checks once a day and installs unattended, skipping (and retrying hourly) while a scan runs. The group is only shown when the app runs from a git checkout — the frozen desktop build hides it. New blueprint `routes/updates.py` with `GET /api/update/check`, `POST /api/update/apply`, `GET/POST /api/update/settings`; 11 new tests in `tests/test_updates.py` with fully mocked git.
- **`update_gdpr.sh`** — standalone CLI/cron equivalent of the GUI update: fetch + fast-forward-only merge with auto-stash of local hotfixes, dependency reinstall only when `requirements.txt` changed, and a `systemctl restart` if a `gdprscanner.service` unit exists (override with `GDPR_SERVICE`). `./update_gdpr.sh --check` reports pending commits without changing anything; safe to run from cron (quiet no-op when already up to date).
### Fixed
- **Delta token status hid the source count** — the "Tokens saved" line under the Δ Delta scan toggle always showed the bare translation ("Tokens gemt") because the source count only existed in the JS fallback string, which is ignored whenever the lang key exists. The translations now carry a `{n}` placeholder ("Tokens gemt for {n} kilde(r)") substituted in `checkDeltaStatus()`, and the row gained a "?" hint bubble explaining what the saved change-tokens do and that "Clear tokens" forces the next scan to be a full scan.
- **Stale data-file paths in docs and UI text** — README, SECURITY.md, MAINTAINER.md, the `--headless` argparse help (`--settings`, `--reset-db`, epilog), the DB-import replace warning/confirm strings (all three languages), and two code comments still referenced the pre-1.x flat dotfile layout (`~/.gdpr_scanner_delta.json`, `~/.gdpr_scanner_smtp.json`, `~/.gdpr_scanner_machine_id`, `~/.gdpr_scanner.db`). All now point to the actual locations under `~/.gdprscanner/` (`delta.json`, `smtp.json`, `machine_id`, `scanner.db`). The legacy-migration rename tables in `gdpr_scanner.py` intentionally keep the old names.
---
## [1.7.0] — 2026-06-10
### Added
- **PDF redaction for local files** — the ✂ redact button now works on local PDF files in addition to DOCX, XLSX, CSV, and TXT. Text-based PDFs are redacted using PyMuPDF's physical redaction (`page.apply_redactions()`), which removes the underlying text data from the PDF stream — not just paints over it. Scanned/image-based PDFs go through the OCR bbox path: CPR positions are found via Tesseract then physically painted and sanitised. Falls back to a reportlab overlay if PyMuPDF is not installed; raises a clear error if both libraries are absent.
- **Google Drive file redaction** — the ✂ redact button now works on native DOCX, XLSX, and PDF files stored in Google Drive (both Google Workspace service-account and personal OAuth connectors). The file is downloaded via the Drive API, redacted locally using the same PyMuPDF / python-docx / openpyxl pipeline as local files, then uploaded back as a new revision via `files().update()`. Google Docs/Sheets exported as DOCX are detected by MIME type and refused with a clear message (re-upload after exporting manually). Requires the `drive` scope (not `drive.readonly`) on the service-account domain-wide delegation grant; a 403 surfaces the exact Google error so admins can add the scope. Methods added: `get_drive_file_mime`, `download_drive_file_by_id`, `update_drive_file` on both `GoogleWorkspaceConnector` and `PersonalGoogleConnector`.
- **SFTP file redaction** — the ✂ button now works on SFTP files (DOCX, XLSX, CSV, TXT, PDF). The file is downloaded via paramiko, redacted locally, then written back with `sftp.open(path, "wb")`. Source config is matched from `_load_file_sources()` by host + username; credentials are resolved from the keychain via `_resolve_sftp_credentials`. Requires the item to be in the current session's `state.flagged_items` (SFTP host info is not stored in the DB). New method: `SFTPScanner.write_file(remote_path, content)`.
- **SMB file redaction** — the ✂ button now works on SMB/CIFS network share files (DOCX, XLSX, CSV, TXT, PDF). Source config is looked up by matching the host parsed from `full_path` (`//host/share/…`). File is downloaded and re-uploaded using smbprotocol with `CreateDisposition.FILE_SUPERSEDE` so the file is atomically replaced. New function: `file_scanner.write_smb_file(path, content, username, password, domain)`.
- **AI-enhanced NER via Claude** — Named Entity Recognition (names, addresses, organisations) can now be powered by Claude Haiku instead of spaCy. Enable in **Settings → AI / NER**: paste an Anthropic API key, toggle on, click Test to confirm. When enabled, `document_scanner.py` calls the Claude API (`claude-haiku-4-5-20251001`) instead of spaCy for all three scan engines; results are cached in-memory per document (bounded at 2 000 entries) so repeated scans of the same file never re-charge the API. Falls back to spaCy automatically if the key is missing or the `anthropic` package is not installed. API key stored in `config.json` under `claude_api_key`; toggle stored under `claude_ner`. Routes: `GET/POST /api/settings/claude`, `POST /api/settings/claude/test`.
### Changed
- **Redacted and deleted cards stay in the grid until the next scan** — previously redacting (✏) or deleting (🗑) a card — or running a bulk delete — removed the affected cards from the grid and from `S.flaggedData`/`S.filteredData` immediately. Now each item is kept and marked: the card is greyed (`card-resolved` styling), shows a `✏ Redacted` (green) or `🗑 Deleted` (red) badge, and its action buttons are hidden so it can't be re-processed. The operator can see what was handled during the session; the grid is rebuilt on the next scan run, which clears the markers. Implemented with `_redacted` / `_deleted` flags in `results.js` (`appendCard`, `redactItem`, `deleteItem`, `executeBulkDelete`, `deleteSubjectItems`); handled items are also excluded from the bulk-delete match set. `POST /api/delete_bulk` now returns `deleted_ids` so the grid marks exactly the items the server actually deleted (partial failures stay active). Also fixes a latent bug in the data-subject delete flow where `renderGrid()` was called with no argument and threw, falsely reporting "Delete failed" after a successful erasure.
### Fixed
- **Selected card scrolled out of view when opening the preview** — opening the preview panel narrows `.grid-area`, which reflows the `auto-fill` grid to fewer columns and moves every card to a new row. The single-frame `scrollIntoView` ran while the browser's scroll-anchoring re-adjusted `scrollTop` mid-reflow, fighting the scroll so the clicked card ended up off-screen. Fixed by disabling scroll anchoring on `.grid-area` (`overflow-anchor: none`) and deferring the scroll by two animation frames so it runs against the settled layout; the card is now centred (`block: 'center'`) instead of `'nearest'` so it stays clearly visible.
- **Cards not shown after browser refresh** — when the browser reconnected to the SSE stream after a completed scan, the `scan_phase` events in the replay buffer temporarily set `S._m365ScanRunning = true` (all running flags start at `false` after a page reload). The watchdog's `loadHistorySession` call fired in this window and bailed on the stale flag; once `scan_done` cleared the flag, `_initialStatusChecked` was already `true` so `loadHistorySession` was never retried. Fixed by having the `sse_replay_done` handler retry `loadHistorySession(null)` when no scan is running and `S._historyRefScanId` is still `null` after replay.
- **Settings modal too narrow for seven tabs** — widened from 640 px to 720 px so all tab labels fit on one line without wrapping.
- **Card action buttons invisible in grid view**`.card` was missing `position: relative`, so the `position:absolute` delete (🗑), redact (✏), and bulk-select checkbox elements anchored to the viewport instead of the card and were then clipped away by the card's `overflow:hidden`. They only appeared in list view, where those elements are `position:static` and flow inline. Added `position: relative` to `.card` so all three position correctly within each card. Also gave `.card-redact-btn` the same `0.35` baseline opacity as the delete button (it was `opacity:0` at rest) so it's discoverable without hovering.
### Security
- **Stored XSS in the results grid** — scan-derived strings (file name, account/display name, folder, source label, modified date, image `alt`) were interpolated straight into `innerHTML` and `title=` attributes across the card, list, preview, data-subject lookup, and related-documents views. Because these values come from scanned content (e.g. a OneDrive file deliberately named with markup), a crafted filename could execute script in a reviewer's session — including a shared read-only viewer/DPO session. A new `esc()` helper in `static/js/results.js` (escapes `& < > " '`) is now applied to every untrusted field before embedding. The related-documents `onclick` JSON is also escaped with `.replace(/"/g,'&quot;')` to match the delete/redact button pattern, closing an attribute-injection hole where a filename containing `"` could break out of the handler.
- **Reflected XSS in `/api/thumb`** — the `?name=` query parameter was embedded unescaped into the placeholder SVG served as `image/svg+xml`, so opening a crafted `/api/thumb?name=<script>…` URL directly executed script in the app origin. `cpr_detector._placeholder_svg` now HTML-escapes both the type label and the filename before embedding them in the SVG.
- **Claude API key now encrypted at rest** — the Anthropic API key was stored in plaintext in `config.json` while the SMTP password was already Fernet-encrypted. `save_claude_config()` now encrypts the key with the same machine-keyed Fernet (`_encrypt_password`); a new `get_claude_api_key()` decrypts it for use. Legacy plaintext keys are still read transparently and re-encrypted on the next save. Readers in `document_scanner.py` and `routes/app_routes.py` updated accordingly.
---
## [1.6.28] — 2026-05-28
### Added
- **Date-range scoping for viewer tokens** — tokens can now carry optional `valid_from` and `valid_to` scope fields (YYYY-MM-DD). When set, `GET /api/db/flagged` filters items whose `modified` date falls outside the range. The share modal now shows two date inputs ("Items from" / "Items until") that apply to any scope type (all/role/user). The token list shows a green date-range badge when a range is stored. The server validates format and enforces `valid_from ≤ valid_to`. All three scope dimensions (role, user, date-range) are independent and combinable.
- **CPR-only mode** — a new `cpr_only` scan option (sidebar toggle `#optCprOnly`, profile editor `#peOptCprOnly`) makes all three scan engines skip items that have no qualifying CPR numbers. Files whose only hits are email addresses, phone numbers, detected faces, or EXIF/GPS metadata are not flagged. The flag already detected is still shown on cards when `cpr_only=false` (default). Gated in all three engines: file scan skip condition, M365 email flagging, M365 file flagging, and Google Gmail/Drive flagging.
- **OCR language override** — a new `ocr_lang` scan option (sidebar select `#optOcrLang`, profile editor `#peOptOcrLang`) lets operators choose the Tesseract language pack(s) used when scanning scanned PDFs and images. Presets: `dan+eng` (default), `dan`, `eng`, `dan+eng+deu`, `dan+eng+swe`, `dan+eng+fra`. The setting flows from the UI through the profile, into all three scan engines (M365 `_scan_bytes_timeout`, M365 attachments `_scan_bytes`, M365 files `_scan_bytes`, Google `_scan_bytes` for both Gmail and Drive). The `lang` parameter is threaded through `cpr_detector._scan_bytes``document_scanner.scan_pdf` / `scan_image` and the spawned PDF-OCR subprocess worker. The OCR cache key already included `lang`, so per-language results are cached independently.
- **Built-in file redaction for local files** — a scissor button (`✂`) appears on cards for local DOCX, XLSX, CSV, and TXT files. Clicking it rewrites the file in-place with all detected CPR numbers replaced by `██████-████` (DOCX/XLSX) or `█`-blocks (CSV/TXT), then removes the card from the grid and logs a `"redacted"` disposition. The redaction is atomic: a temp file in the same directory is written first and then moved over the original, so a crash never leaves a half-written file. Implemented in `routes/export.py` (`POST /api/redact_item`) using the existing `document_scanner` redact functions; front-end in `results.js` (`redactItem`) with the button hidden for non-local or unsupported-extension items and for resolved/viewer-mode cards.
- **`DELETE /api/delete_item` route registration fix** — the `delete_item` handler in `routes/export.py` was missing its `@bp.route` decorator, so the endpoint was never registered in Flask's URL map. The route now works correctly.
- **Scheduled report-only email job** — scheduled jobs can now be configured as "report only" (toggle `#schedReportOnly`). When enabled, the job skips the scan entirely and instead emails the latest scan results already in the database. If the in-memory result list is empty (e.g. after a server restart), results are loaded from the DB via `get_session_items()`. M365 authentication is not required for report-only jobs — email is sent Graph-first if authenticated, SMTP otherwise. Jobs fail with a clear error if no scan results are available. The job list card shows a blue "Report only" badge. Setting `report_only=True` in the editor automatically enables "Email report automatically" and dims the Profile field (unused for report-only runs).
- **Compliance audit log** — every significant admin action is now written to an immutable `audit_log` table in the scanner database. Recorded events: profile save/delete, viewer token create/revoke, viewer/interface/admin PIN set/change/clear, file source add/update/delete, scheduler job save/delete, scan start/stop, SMTP config save, single and bulk disposition changes, item delete, and item redact. Each record stores a Unix timestamp, an action key, a human-readable detail string, and the client IP address. Accessible via `GET /api/audit_log` (returns newest-first, max 1000 entries; filterable by `?action=`). Visible in the Settings modal under a new **Audit Log** tab; the table refreshes whenever the tab is opened. The `log_audit_event()` module-level helper in `gdpr_db.py` silently no-ops if the DB is unavailable, so all call sites are safe in test and offline contexts.
### Fixed
- **Stop button had no effect on Google Workspace scans**`POST /api/scan/stop` only set `state._scan_abort` (the M365/file abort event) and never touched `state._google_scan_abort`. Separately, `_check_abort()` inside `_run_google_scan` was checking `gdpr_scanner._scan_abort` (the M365 event) instead of the module-level `_scan_abort` alias that points to `state._google_scan_abort`. Both bugs combined meant neither the Stop button nor `POST /api/google/scan/cancel` had any effect on a running Google scan. Fixed by having `scan_stop()` set both events and having `_check_abort()` use the correct module-level alias.
- **Settings tab labels wrapping to two lines** — adding the Audit Log tab pushed the six-tab row past the 540 px modal width, causing "E-mailrapport" (and similar long translations) to break onto a second line. The modal is now 640 px wide and tabs carry `white-space:nowrap`; `.settings-tabs` retains `flex-wrap:wrap` as a safety net on very small screens.
---
## [1.6.27] — 2026-05-27
### Added
- **Email body excerpt preserved for offline preview** — when an M365 email or Gmail message is flagged, the first 500 characters of its plain-text body are stored in the card (`body_excerpt`), the checkpoint JSON, and a new `body_excerpt` DB column (migration #10). The M365 email preview now falls back to this excerpt when Graph is unavailable (not authenticated, token expired) or when resuming from a checkpoint without a live connection. The Gmail preview now shows the stored excerpt as the primary content (with the "Open in Gmail" link appended below) rather than the previous plain link-card. A helper `_excerpt_page()` in `routes/database.py` renders the excerpt with the same header layout as the full Graph-fetched preview.
- **Re-scan diff — resolved items in history view** — when browsing a past scan session, items that were flagged in the immediately preceding session but are no longer present in the current one are automatically appended below a "N items no longer present" divider. Resolved items are greyed out and carry a green `✓ Resolved` badge; the delete button is hidden since the file is already gone. The history banner updates to show the resolved count alongside the flagged count. The diff is computed client-side by fetching the previous session's items and comparing IDs — no new API endpoint needed. Implemented in `history.js` (`loadHistorySession`) and `results.js` (`appendCard`).
- **Google Workspace scan test suite** — 19 new tests in `tests/test_google_scan.py` covering all three routes (`GET /api/google/scan/users`, `POST /api/google/scan/start`, `POST /api/google/scan/cancel`) and the core scan engine (`_run_google_scan`). Route tests verify: 401 when unauthenticated, 409 when scan already running, lock released on both normal completion and exception, abort event cleared on start. Engine tests verify: CPR hits are broadcast as `scan_file_flagged`, clean items are not, `source_type` is correctly set to `"gmail"` for Gmail items and `"gdrive"` for Drive items, and `google_scan_done` always fires with correct `flagged_count` / `total_scanned` values.
---
## [1.6.26] — 2026-04-29
### Fixed
- **Previous scan results visible when a new scan starts** — two async functions (`loadHistorySession` and `loadLastScanSummary`) could resolve after `startScan` had already cleared the grid. `loadHistorySession` would re-populate the grid with old history items; `loadLastScanSummary` would re-show the last-scan summary card. Both functions now bail early after each `await` if any of the three scan-running flags (`S._m365ScanRunning`, `S._googleScanRunning`, `S._fileScanRunning`) is set — those flags are written synchronously by `startScan` before any awaits, so the check is race-free.
- **Selected card scrolls out of view when preview panel opens** — clicking a card in grid view opens the 420 px preview panel, which shrinks the grid area and reflows the card columns. The selected card was no longer visible. `openPreview()` now schedules a `requestAnimationFrame` after removing `.hidden` from the panel so the card is scrolled back into view (`scrollIntoView block: nearest`) once the layout has settled.
- **Gmail and Google Drive preview crashed with a 404 Graph API error**`_source_type` was never set on Google items in `routes/google_scan.py`, so Gmail and Google Drive cards carried an empty `source_type`. The preview route in `routes/database.py` only checked for `"local"`, `"smb"`, and `"email"` before falling through to the M365 else-branch, which tried to call `https://graph.microsoft.com/.../drive/items/gmail:{id}/preview` — always a 404. Fixed by tagging Gmail items as `_source_type = "gmail"` and Google Drive items as `"gdrive"` at scan time. The preview route now handles both: Google Drive files get an embeddable `https://drive.google.com/file/d/{id}/preview` iframe; Gmail messages (not embeddable) show an info card with an "Open in Gmail" link. The `state.connector` (M365 auth) guard was also moved inside the `email` and M365 `else` branches so Google-only setups no longer receive a 401 when opening a Gmail or Drive preview.
---
## [1.6.25] — 2026-04-25
### Added
- **Checkpoint / resume for Google and File scans** — stopping a Google Workspace or file (local/SMB/SFTP) scan mid-way and restarting now resumes from where it left off, exactly like M365 scans have always done. Each engine writes its own checkpoint file (`checkpoint_google.json`, `checkpoint_file_{source_id}.json`) every 25 items. On restart, previously found cards are re-emitted via SSE so the grid is repopulated before new items arrive. The Scan button now always checks for a live checkpoint before starting — if one exists the resume banner is shown regardless of whether the user reloaded the page. `POST /api/scan/checkpoint` returns a per-engine breakdown; `POST /api/scan/clear_checkpoint` wipes all `checkpoint_*.json` files. Google users' email addresses are included in the checkpoint payload from the frontend so the server can compute a matching key. `checkpoint.py` functions gained a `prefix` keyword argument (default `"m365"`) — existing M365 call sites are unchanged.
- **CPR cross-referencing (related documents)** — clicking any flagged card that contains CPR hits now shows a "Related documents" section in the preview panel listing other items from the same scan session that share at least one CPR number. Items are ordered by number of shared CPRs; clicking any entry opens it in the preview panel. Works in both live mode and history mode (respects `?ref=N`). Powered by a self-join on the existing `cpr_index` table — no new data collection needed. New `GDPRDb.get_related_items(item_id, ref_scan_id)` method and `GET /api/db/related/<item_id>?ref=N` endpoint in `routes/database.py`. Frontend: `#previewRelated` div in the preview panel, `_loadRelated(f)` in `results.js`, `window._openRelated(id, itemData)` helper (looks up live `S.flaggedData` first, falls back to API response for history items).
- **Email address and Danish phone number detection** — all three scan engines (M365, Google Workspace, local/SMB/SFTP) can now flag files and messages containing email addresses or Danish phone numbers in addition to CPR numbers. Detection is opt-in per profile: two new toggle options **Scan for email addresses** and **Scan for phone numbers** (default off) appear in the scan options panel and profile editor. When enabled, matches are stored as `email_count` / `phone_count` on each DB row and surfaced as colour-coded badges in list view, grid view, and the preview panel. Email regex requires a structurally valid address (`local@domain.tld`); phone regex covers 8-digit Danish numbers with optional `+45`/`0045` prefix and common spacing patterns. Both are deduplicated before counting. Requires DB migration (adds two INTEGER columns to `flagged_items`; applied automatically on first startup via `_MIGRATIONS`).
- **SFTP as a 4th file connector** — SFTP servers can now be added as file sources alongside local folders, SMB shares, and cloud sources. A new `SFTPScanner` class in `sftp_connector.py` implements the same `iter_files()` interface as `FileScanner`, so `run_file_scan()`, SSE broadcasting, DB persistence, card building, scheduled scans, and exports work without changes. Supports password auth and SSH private key auth (RSA, Ed25519, ECDSA, DSS); passphrases stored in the OS keychain. Key files uploaded via `POST /api/file_sources/upload_key` and stored in `~/.gdprscanner/sftp_keys/` with `chmod 600`. SFTP sources appear with a 🔒 icon in the sources panel. Requires `paramiko>=3.4` (optional — scanner falls back gracefully if not installed). New source-type selector (Local / Network (SMB) / SFTP) replaces the SMB path-prefix auto-detection in the add-source form.
- **`POST /api/file_sources/upload_key`** — new endpoint that validates and stores an SSH private key file, returning a `key_path` for use in the source definition.
- **SFTP entry in export SOURCE_MAP** — Excel and Article 30 exports render SFTP sources as "🔒 SFTP" with a purple tint (`EDE9F7`), consistent with the existing per-source tab and summary table logic.
### Fixed
- **File source form placeholders untranslated** — all nine placeholder texts in the Add source and Edit source forms (source name, path, SMB host/user, SFTP host/user/path, passphrase) were hardcoded English strings. Nine new `data-i18n-placeholder` keys added to `en.json`, `da.json`, and `de.json`; all 12 affected `<input>` elements now carry `data-i18n-placeholder` attributes.
- **"Name" and "Auth" labels untranslated in SFTP form** — the source-name label and the Auth toggle label in the add-source panel had no `data-i18n` attributes. Added keys `m365_fsrc_name` (DA: "Navn") and `m365_fsrc_sftp_auth` (same across languages). The name label used an inner `<span data-i18n>` to preserve the required-field `*` indicator, which would have been clobbered by a `data-i18n` on the outer `<label>` element. The same clobber bug was fixed for the `m365_fsrc_label` usage in the edit form.
- **Password field placeholder showed "Stored in OS keychain" in English** — added translation key `m365_fsrc_pw_keychain_placeholder` (DA: "Gemt i OS-nøglering") and applied `data-i18n-placeholder` to the three password inputs across both forms (SMB add, SFTP add, SMB edit).
---
## [1.6.24] — 2026-04-25
### Fixed
- **Scheduler UI showed untranslated English strings** — frequency labels ("Daily", "Weekly", "Monthly"), "Next:", "Running...", "Disabled", and both empty-state messages ("No scheduled scans yet." / "No scheduled runs yet") were hardcoded English strings in `scheduler.js` instead of using `t()`. All six call sites in `schedLoad()`, `schedRenderJobs()`, and `schedLoadHistory()` now call `t()` with the appropriate key. Three new translation keys added to `en.json`, `da.json`, and `de.json`: `m365_sched_no_jobs`, `m365_sched_running`, `m365_sched_disabled`.
---
## [1.6.23] — 2026-04-21
### Added
- **Video file metadata scanning**`.mp4`, `.mov`, `.m4v`, `.avi`, `.mkv`, `.wmv`, `.flv`, `.webm` files are now included in all scan sources (M365 OneDrive/SharePoint/Teams, Google Drive, local/SMB). No frame or audio analysis is performed; only container metadata is extracted: GPS coordinates (iPhone/Android QuickTime `©xyz` atom, ISO 6709 format), author/artist, title, comment/description, and recording date. A smartphone recording with an embedded GPS location is flagged with the `gps_location` special category, exactly like a geotagged photo. AVI metadata (RIFF INFO `INAM`/`IART`/`ICMT`) is parsed without any external library. Requires `mutagen>=1.47` (added to `requirements.txt`).
- **Audio file metadata scanning**`.mp3`, `.flac`, `.ogg`, `.m4a`, `.aac`, `.wma`, `.wav`, `.opus`, `.aiff` files are now scanned for PII-bearing tags across all sources. Extracted fields: title, artist, album artist, composer, lyricist, conductor, author, copyright, comment, description. No audio content is transcribed. Uses `mutagen.File(easy=True)` which normalises tag formats across ID3 (MP3), MPEG-4 (M4A/AAC), Vorbis (FLAC/OGG), and ASF (WMA) into a unified lowercase-key interface. A voice recording saved with a student's name in the artist tag will be flagged with `exif_pii`. Fixed a silent bug in `_extract_audio_metadata` where `mutagen.File(io.BytesIO(content), filename)` was passing the BytesIO as the `filename` positional argument; corrected to `mutagen.File(fileobj=..., filename=...)`.
- **Audio and video test fixtures**`tests/fixtures/local_files/generate_fixtures.py` now generates 6 new fixtures: `14_audio_artist_pii.mp3`, `15_audio_artist_pii.flac` (artist name → flag), `16_audio_no_pii.mp3`, `17_audio_no_pii.flac` (no tags → no flag), `18_video_gps.mp4` (GPS + artist → flag), `19_video_no_pii.mp4` (no tags → no flag). Total fixtures: 19 (14 flagged, 5 negative).
### Fixed
- **Audio and video files not appearing in local/SMB file scan**`file_scanner.py` maintained its own hardcoded `DEFAULT_EXTENSIONS` set that was never updated when video and audio extensions were added to `cpr_detector.SUPPORTED_EXTS`. Fixed by importing `SUPPORTED_EXTS` from `cpr_detector` directly; `DEFAULT_EXTENSIONS` is now an alias for it. `cpr_detector.SUPPORTED_EXTS` is the single source of truth for all scan sources (M365, Google Drive, local, SMB).
- **Profile copy rename not reflected in left column until modal reopen** — saving a renamed profile via the full editor (`_pmgmtSaveFullEdit`) called `loadProfiles()` to refresh `S._profiles` but never called `_renderProfileMgmt()`, so the left-column list was not repainted. The new name only appeared after closing and reopening the modal. Fixed by calling `_renderProfileMgmt()` immediately after `loadProfiles()` and re-applying the `.active` highlight to the correct row. 10 new route integration tests added for all profile API endpoints; total test count: 182.
---
## [1.6.22] — 2026-04-21
### Added
- **Auto-email after manual scan** — a new **Email report after manual scan** toggle in **Settings → Email report** sends the Excel report to the configured recipients automatically when a manual scan completes. Disabled by default. Stored as `auto_email_manual` in `smtp.json`. Uses the same Graph-first → SMTP-fallback path as scheduled scan auto-email. Only fires when there are flagged items and at least one recipient is saved; errors are logged but never surface to the UI (the scan result is unaffected).
- **Route integration test suite** — 44 new tests in `tests/test_route_integration.py` covering security-sensitive and data-correctness paths: viewer token CRUD, role and user scope enforcement on `GET /api/db/flagged`, bulk disposition isolation, viewer PIN set/verify/rate-limit/clear, interface PIN gate and multi-step flows, scan lock release on `run_scan()` exception, and `GET /api/db/sessions` shape and ordering. Total test count: 172.
### Fixed
- **Role scope filter silently returned nothing**`GET /api/db/flagged` filtered rows by `row.get("role")` but the column returned from the DB is `user_role`. Role-scoped viewer tokens (`{"role": "student"}` or `{"role": "staff"}`) therefore excluded every item and returned an empty list. Fixed in `routes/database.py`.
- **Historical session query included newer scans**`gdpr_db.get_session_items(ref_scan_id=N)` used a lower-bounded window (`started_at >= ref.started_at - 300`) with no upper bound, so any scan that started after the historical reference was also returned. Viewing a past session in the history browser would show items from all subsequent scans as well. Fixed by adding an upper bound (`started_at BETWEEN ref.started_at - 300 AND ref.started_at + 300`).
- **Scan button stuck disabled after file scan**`run_file_scan` broadcast a `scan_start` SSE event, which the `scan_start` handler in `_attachSchedulerListeners` intercepted and set `S._m365ScanRunning = true`. When `file_scan_done` fired it checked `!S._m365ScanRunning` before re-enabling the button — finding it still `true`, the button stayed disabled permanently. No `scan_done` (M365) ever arrives to clear the flag. Fixed by removing the `scan_start` broadcast from `run_file_scan`; the `scan_phase "Files — …"` event immediately following already sets `_fileScanRunning` correctly via the phase-source detection in `_attachScanListeners`.
- **`TypeError: unhashable type: 'dict'` during file and M365 scans** — `_distinct_cprs = list(dict.fromkeys(cprs))` in both scan paths treated `cprs` as a list of strings, but `extract_matches` returns a list of dicts (`{"formatted": "…", "page": …, …}`). The deduplication crashed on the first file that contained CPR numbers, aborting the scan loop. Fixed in both `run_file_scan` (line 251) and `run_scan` (line 1100) by keying on `c["formatted"]`: `list(dict.fromkeys(c["formatted"] for c in cprs))`.
- **Profile applied early lost user selection and source checkboxes** — two startup race conditions: (1) Profiles with `user_ids = "all"` applied before the M365 user list had loaded ran `.forEach()` on an empty array (no-op); when `loadUsers()` completed it defaulted all users to `selected = false` with nothing to override, leaving the accounts panel completely unchecked. Fixed by adding a `_pendingProfileAllUsers` deferred flag mirroring the existing `_pendingProfileUserIds` mechanism — `loadUsers()` applies it after populating `S._allUsers`. (2) If the profile was selected in the narrow window before `_loadFileSources()` returned and rendered the sources panel, `_applyProfile()` iterated zero checkboxes and the source selection was silently discarded; a subsequent `renderSourcesPanel()` call then re-rendered all sources as checked (their default). Fixed by calling `renderSourcesPanel()` in `_applyProfile()` when no source checkboxes are present in the DOM yet — same guard already used in `loadUsers()`.
---
## [1.6.21] — 2026-04-20
### Added
- **Local-file scan test fixtures**`tests/fixtures/local_files/` contains 13 ready-made files (`.txt`, `.csv`, `.docx`, `.xlsx`) covering every detection scenario: CPR with explicit label, mod-11valid CPR without label, post-2007 CPR with/without context keyword, protected number (day+40), multiple CPRs in one file, mixed PII (CPR + email + Art. 9 health data), and three true-negative cases (clean content, invoice false-positive, post-2007 serial number without context). All CPR numbers are mathematically valid; false-positive fixtures are verified to produce zero hits. Run `generate_fixtures.py` to regenerate the binary files.
- **Interface PIN** — optional session-level authentication gate for the main scanner interface. Set a 48 digit PIN in **Settings → Security → Interface PIN**; anyone reaching `http://host:5100` is redirected to `/login` and must enter the PIN before accessing scan controls, settings, or results. Viewer tokens and the `/view` route are completely unaffected — reviewers continue to use their own auth chain. The PIN is stored as a salted SHA-256 hash in `config.json`. Brute-force protection: 5 failed attempts per IP locks out for 5 minutes. A `POST /api/interface/logout` endpoint clears the session. PIN management via `GET/POST/DELETE /api/interface/pin`.
### Fixed
- **"Vælg" (select mode) button did nothing** — `toggleSelectMode`, `toggleCardSelect`, `selectAllVisible`, and `applyBulkDisposition` were defined inside an ES module but never assigned to `window`, so all `onclick` attributes calling them silently failed. Added the four missing `window.*` exports at the bottom of `results.js`.
- **Progress counter frozen at M365 total during Google/file scan** — the `scan_progress` handler in `scan.js` only updated `progressStats` and `progressEta` for `source === "m365"`. When M365 finished first, the counter stayed at its final value (e.g. "15083 / 15083 ETA 0s") for the entire duration of the Google and file scans. Fixed in two places: `scan_done` now clears the stats/ETA elements immediately when another scan is still running; `scan_progress` for Google/file sources now shows a running `"X scanned"` count (using the `scanned` field those engines already send) and clears ETA, but only while M365 is not running — M365 stats continue to dominate during concurrent scans.
- **PDF OCR kills process on large files**`document_scanner` previously called `convert_from_path()` once for the entire PDF before the processing loop, allocating all page images in memory simultaneously. A 50-page A4 PDF at 300 DPI required ~1.3 GB in a single allocation, triggering the OS OOM killer. Fixed by rendering one page at a time with `convert_from_path(first_page=N, last_page=N)` inside the loop across `scan_pdf`, `redact_fitz_pdf`, and `redact_pdf`. Peak OCR memory is now bounded to roughly one page (~26 MB at 300 DPI) regardless of document length.
- **No bulk disposition tagging** — each result card had to be opened individually to set a disposition. Added a Select mode (filter bar "Vælg" button) that reveals per-card checkboxes. Selecting one or more items shows a bulk tag bar at the bottom of the grid with a disposition dropdown and Apply button. Calls `POST /api/db/disposition/bulk`; updates all selected items in-memory and clears the selection. "Select all visible" / "Deselect all" toggle available in the bar. Hidden in viewer mode.
- **No disposition progress summary** — added a thin stats bar between the filter bar and the grid showing total · unreviewed · retain · delete · % reviewed. Updates after every single or bulk disposition save and after each grid render. Unreviewed count is highlighted in red until everything is tagged; turns green at 100%.
- **Google Drive always did a full scan** — Drive scanning in `routes/google_scan.py` used `conn.iter_drive_files()` on every run, re-downloading every file regardless of what changed. Added Google Drive delta scan using the Drive Changes API. When `delta` is enabled in scan options, the first run records a Changes API start page token per user (`gdrive:{email}` key in `delta.json`). Subsequent runs call `conn.get_drive_changes(user_email, token)` and only process files that have been added or modified since the last scan. Invalid or expired tokens fall back to a full scan automatically. Token save loads the current `delta.json` fresh before writing to avoid racing with concurrent M365 token saves. `google_scan_done` SSE event now includes `delta` and `delta_sources` fields.
- **No memory guard before OCR page renders** — added `_ocr_mem_ok()` check (`psutil.virtual_memory().available >= 500 MB`) before each page render in all three OCR paths. Pages that would exceed the threshold are skipped and recorded as `"skipped"` in `page_methods` with a printed warning rather than crashing the scan.
---
## [1.6.20] — 2026-04-18
### Fixed
- **Graph `sendMail` reported as failure despite email being delivered**`_post()` in `m365_connector.py` called `r.json()` unconditionally after `raise_for_status()`. The Graph `sendMail` endpoint returns HTTP 202 with an empty body on success, causing `json.JSONDecodeError: Expecting value: line 1 column 1 (char 0)`. This was caught by the `smtp_test` exception handler and surfaced as an error even though the email had been sent. Fixed by returning `r.json() if r.content else {}` so any Graph endpoint that responds with no body (sendMail, delete operations, etc.) is handled correctly.
- **Graph error hidden when SMTP host not configured** — when Graph failed and no SMTP host was saved, `smtp_test` returned the generic "No SMTP host configured" message, swallowing the actual Graph error. The `if not host` branch now surfaces the Graph exception text alongside the Mail.Send permission guidance so the real cause is visible.
- **Gmail vs Google Workspace SMTP error messages** — the auth failure handler now detects whether the username is a personal Gmail address (`@gmail.com`) or a Google Workspace custom-domain account, and shows a different message for each. Personal Gmail: existing App Password troubleshooting steps. Google Workspace: explains that SMTP access is controlled by the Workspace admin console (2-Step Verification policy, SMTP relay service), not the user's personal security settings.
---
## [1.6.19] — 2026-04-18
### Fixed
- **Gmail SMTP error message misleading when App Password already in use** — the auth failure handler in both `smtp_test` and `send_report` unconditionally told the user to "create an App Password", even when they were already using one. Gmail returns the same `535` / `Username and Password not accepted` error for a wrong app password, a revoked app password, spaces left in the 16-character code, or a wrong username — none of which are helped by the old message. The Gmail branch now lists the three most common causes (spaces in the code, revoked password, wrong username) and still links to the App Password page to generate a new one. The Microsoft personal account branch is unchanged.
---
## [1.6.18] — 2026-04-18
### Fixed
- **Art.30 and Excel exports missing GWS and local/SMB sources** — two silent failures caused Google Workspace and file-scan results to be absent from all exports after a page reload.
- `routes/google_scan.py`: called `_db.end_scan()` (method does not exist on `GDPRDb` — the correct name is `finish_scan`). The resulting `AttributeError` was swallowed by the bare `except Exception: pass` guard, so `finished_at` was never written on GWS scan records. Since `get_session_items()` requires `finished_at IS NOT NULL`, every GWS scan was permanently invisible to both export functions.
- `routes/google_scan.py`: emitted `"scan_done"` at completion instead of `"google_scan_done"`, causing the M365 done handler to fire for Google scans and breaking the SSE teardown logic.
- `scan_engine.py` (`run_file_scan`): called `_db.begin_scan(sources=…, user_count=0, options=source)` with keyword arguments, but `begin_scan(self, options: dict)` only accepts a single positional dict. The `TypeError` was caught silently, leaving `_db_scan_id = None`; all subsequent `save_item` calls were skipped, so local and SMB items were never written to the database.
---
## [1.6.17] — 2026-04-18
### Added
- **Scan history browser** — results from any past scan session can now be reviewed without running a new scan. On page load, when no scan is running, the last completed session is automatically loaded into the results grid. A **History** banner appears above the filter bar showing the session date, scanned sources, and item count. A **Sessions** button in the banner opens a dropdown listing all past sessions newest-first, each showing date, time, source labels, item count, and Delta / Latest badges. Clicking a session loads its items. A **Latest scan** button (shown only when browsing a past session) jumps back to the most recent session. Starting a new scan exits history mode and takes over the grid with live SSE results. Session cache is invalidated on each scan completion so the picker always reflects the true state of the database.
- `gdpr_db.py` — new `get_sessions(limit, window_seconds)` groups all completed scans by the 300-second concurrent-scan window and returns session summaries newest-first. `get_session_items()` gains an optional `ref_scan_id` parameter to anchor the session window to any past scan.
- `routes/database.py` — new `GET /api/db/sessions`; `GET /api/db/flagged` now accepts `?ref=<scan_id>` to serve items for a specific historical session.
- `static/js/history.js` (new) — `loadHistorySession(refScanId)`, `openHistoryPicker()`, `closeHistoryPicker()`, `exitHistoryMode()`, `invalidateHistoryCache()` all exposed on `window`.
- `state.js``_historyRefScanId: null` tracks which session is currently displayed (`null` = live/SSE).
- `results.js` — initial status check calls `loadHistorySession(null)` instead of `loadLastScanSummary()`.
- `scan.js``startScan()` calls `exitHistoryMode()`; all three `*_done` handlers call `invalidateHistoryCache()`.
- **User-scoped viewer tokens (#34)** — viewer token links can now be restricted to a specific person so the recipient sees only their own flagged files, across both M365 and Google Workspace. The Share modal's scope selector gains a **User** option that opens a searchable name autocomplete backed by the already-loaded `S._allUsers` list. Typing filters by display name or email; each row shows the person's full name, role badge, and all associated email addresses (M365 UPN and GWS email shown together for dual-platform users). Selecting a name fills the input with the display name and stores both email addresses internally. Scope is stored as `{"user": ["alice@m365.dk", "alice@gws.dk"], "display_name": "Alice Smith"}`. Server-side enforcement in `GET /api/db/flagged` filters `WHERE account_id IN (list)` so items from either platform are included. The viewer header shows the person's full name in a locked identity badge (`#viewerIdentityBadge`); `#filterRole` is hidden. Token rows in the Active links list show the display name badge. Free-text email entry still works as a fallback when no accounts are loaded. File-scan items (`account_id = ""`) never appear in user-scoped views — consistent with the existing role-scope behaviour.
---
## [1.6.16] — 2026-04-18
### Added
- **User-scoped viewer tokens (#34)** — viewer token links can now be restricted to a specific person so the recipient sees only their own flagged files, across both M365 and Google Workspace. The Share modal's scope selector gains a **User** option that opens a searchable name autocomplete backed by the already-loaded `S._allUsers` list. Typing filters by display name or email; each row shows the person's full name, role badge, and all associated email addresses (M365 UPN and GWS email shown together for dual-platform users). Selecting a name fills the input with the display name and stores both email addresses internally. Scope is stored as `{"user": ["alice@m365.dk", "alice@gws.dk"], "display_name": "Alice Smith"}`. Server-side enforcement in `GET /api/db/flagged` filters `WHERE account_id IN (list)` so items from either platform are included. The viewer header shows the person's full name in a locked identity badge (`#viewerIdentityBadge`); `#filterRole` is hidden. Token rows in the Active links list show the display name badge. Free-text email entry still works as a fallback when no accounts are loaded. File-scan items (`account_id = ""`) never appear in user-scoped views — consistent with the existing role-scope behaviour.
---
## [1.6.15] — 2026-04-12
### Added
- **Role-scoped viewer tokens** — viewer token links can now be restricted to a single role so the recipient can only see student or staff items. A new **Role scope** dropdown (All roles / Ansatte / Elever) in the Share modal is selected when creating a token. The scope is stored as `"scope": {"role": "student"|"staff"}` in `viewer_tokens.json`. Enforcement is two-layered: `GET /api/db/flagged` filters items server-side using `session["viewer_scope"].role` set at token validation time; the `#filterRole` dropdown in the viewer is pre-set and hidden so the constraint cannot be bypassed client-side. Tokens without a scope field (existing tokens, PIN sessions) remain unrestricted. Role badge (Ansatte / Elever) shown on each scoped token row in the Active links list.
- **Role filter in results + role-scoped exports** — a new **Role** dropdown in the filter bar (All roles / Ansatte / Elever) narrows the results grid to staff or student items. Clicking **Excel** or **Art.30** while a role is selected exports only that group — the `?role=student|staff` param is forwarded to both export endpoints. `_build_excel_bytes()` and `_build_article30_docx()` now accept a `role` param; all internal sheets (GPS, External transfers, Art.30 staff/student tables) respect the filter. Filenames get an `_elever` or `_ansatte` suffix.
- **Scan filter options for student environments** — two new profile options reduce noise when scanning student accounts:
- **Ignore GPS in images** (`skip_gps_images`) — images whose only PII signal is an embedded GPS coordinate are not flagged. Smartphones embed location in every camera photo by default, generating large numbers of low-priority flags in school contexts. GPS data is still extracted and shown in the detail card when the image is flagged by another signal (faces, EXIF author/comment). Applies to M365, Google, and file scans.
- **Min. CPR count per file** (`min_cpr_count`, default 1) — a file is only flagged if it contains at least this many *distinct* CPR numbers. Set to 2 to avoid reporting a student's own consent form or registration document (one CPR) while still flagging class lists and grade sheets with multiple students' CPRs. Deduplication is by value — a CPR repeated 10 times counts as 1 distinct number. Applies to M365, Google, and file scans.
- Both options are saved in profiles and editable in the Profile Manager editor.
- **GitHub Actions CI/CD — macOS build**`.github/workflows/build.yml` now also builds a macOS `.app` bundle (`macos-15`, Apple Silicon ARM64) on every push to `main` and on `v*` tags. Released as `GDPRScanner_macos_arm64.zip`. (Originally `macos-13` / Intel, changed when GitHub retired that runner.)
### Fixed
- **OneDrive 404 errors during delta scans**`GET /users/{id}/drive/root/delta` returns 404 for users with no OneDrive licence, a disabled service plan, a drive that was never provisioned (account never signed in), or a suspended account. Previously these 404s fell through to `requests.raise_for_status()` and were caught by the generic `except Exception` handler in `_scan_user_onedrive`, broadcasting a red `scan_error` card. Full scans never showed the error because `_iter_drive_folder_for` has a bare `except Exception: return`. Fixed by adding `M365DriveNotFound(M365Error)` to `m365_connector.py`, raising it from `_get()` on HTTP 404, and handling it explicitly in `_scan_user_onedrive` with a `scan_phase` broadcast ("OneDrive (user): not provisioned — skipped") before the generic exception handler.
- **CI — Windows artifact never uploaded** — PyInstaller `--onedir` puts the exe inside `dist/GDPRScanner/`, not at `dist/*.exe`. The artifact glob never matched, so no Windows build appeared in releases. A PowerShell packaging step now zips `dist\GDPRScanner\` into `GDPRScanner_windows_x64.zip` (mirroring the existing Linux step).
- **`EFFORT_ESTIMATE.md`** — build effort estimate document covering component-by-component hour breakdowns and complexity drivers for the project.
- **Settings → Security tab** — new dedicated pane in the Settings modal. Admin PIN and Viewer PIN groups moved here from the General tab, which now contains only Appearance and About. The Share modal's **Configure** button navigates directly to the Security tab.
- **Viewer mode layout** — the sidebar, log panel, and progress bar are now hidden in viewer mode so results fill the full window width. The `🔍 GDPRScanner` brand is shown in the top-left of the topbar (replacing the sidebar header) at the same size and weight as the normal sidebar title.
- **Share modal — Revoke / Copy buttons broken**`JSON.stringify(token)` produced a double-quoted string that terminated the surrounding `onclick="…"` HTML attribute early, so neither button fired its handler. Both now pass the token as a single-quoted JS string literal, which is safe for the hex token format.
- **Viewer PIN — Clear PIN rejected with "current PIN is incorrect"** — clicking **Clear PIN** without first typing in the Current PIN field sent an empty string to the server, which correctly rejected it. A client-side guard now validates the field is non-empty before sending the request, and focuses the input with an inline error message if it is empty.
- **Share modal — all UI strings now translated** — the Share results modal and Viewer PIN settings group were fully hardcoded in English. All visible strings are now backed by i18n keys (`share_*`, `viewer_pin_*`) in `en.json`, `da.json`, and `de.json`.
- **Excel / ART.30 export — Gmail and Google Drive missing from summary**`by_source` was built from flagged items only, so sources that produced zero hits were silently skipped. Both the Excel Summary sheet and the ART.30 "Breakdown by source" table now include every source that was actually scanned, showing `0` items and `0` CPR hits where nothing was found. New `GDPRDb.get_session_sources()` method reads the `sources` JSON column from all scans in the current session window to determine which sources ran.
- **Scan never finishes when M365 + Google run concurrently**`scan_done` (M365 finished) was closing the SSE connection immediately via `S.es.close()`, even when `S._googleScanRunning` or `S._fileScanRunning` was still true. The `google_scan_done` / `file_scan_done` events therefore never arrived, leaving the progress bar stuck at 100% indefinitely. SSE teardown is now deferred until the last concurrent scan completes: `scan_done` only closes the connection if neither Google nor File is still running; `google_scan_done` and `file_scan_done` close it when they are the final scan to finish.
---

View File

@ -16,11 +16,27 @@ python -m pytest tests/ -q
**Split modules:** `scan_engine.py` (M365 + file scan), `sse.py` (SSE broadcast), `checkpoint.py`, `app_config.py` (all persistence), `cpr_detector.py`
**Blueprints** in `routes/` — see `routes/CLAUDE.md` for state/SSE rules.
**Google Drive delta scan** — `routes/google_scan.py` reads `scan_opts.get("delta", False)` (same flag as M365). Per user, delta key is `f"gdrive:{user_email}"` stored in `~/.gdprscanner/delta.json` alongside M365 tokens. First delta-enabled scan fetches all files then records a Changes API start page token via `conn.get_drive_start_token(user_email)`. Subsequent scans call `conn.get_drive_changes(user_email, token)` and update the token. Invalid/expired tokens fall back to full scan automatically.
**Google connector write-back** — `google_connector.py` exposes `get_drive_file_mime`, `download_drive_file_by_id`, `update_drive_file` on both connectors for in-place Drive redaction. These use `DRIVE_WRITE_SCOPES` (`drive`, not `drive.readonly`) — the service-account delegation must include this scope or the call raises 403.
**SFTP connector** — `sftp_connector.py` provides `SFTPScanner` with the same `iter_files()` interface as `FileScanner`. `run_file_scan()` in `scan_engine.py` checks `source.get("source_type") == "sftp"` and instantiates `SFTPScanner`; the rest of the pipeline is source-agnostic. Auth: `"password"` via OS keychain; `"key"` from `~/.gdprscanner/sftp_keys/<uuid>`. `SFTP_OK` flag guards graceful degradation if `paramiko` is not installed. Single-file I/O: `_ssh_connect()`, `read_file(remote_path)`, `write_file(remote_path, content)` — do not duplicate SSH setup outside these methods.
**Shared content processing** — all three scan engines funnel downloaded bytes through `cpr_detector._scan_bytes(content, filename)`. `scan_engine.py` uses `_scan_bytes_timeout` for PDFs (subprocess + hard timeout). Do not duplicate file-type handling in per-source code.
**`cpr_detector.SUPPORTED_EXTS` is the single source of truth** for which file extensions are scanned. `file_scanner.py` imports it as `DEFAULT_EXTENSIONS`. Do not maintain a separate extension list anywhere else.
**`_scan_bytes` injection pattern** — `scan_engine.py` defines no-op stubs at module level (avoids circular import). `gdpr_scanner.py` overwrites them at startup. `routes/google_scan.py` resolves them lazily via `gdpr_scanner.__getattr__`. Do not import them directly in those modules.
**Blueprints** in `routes/` — see `routes/CLAUDE.md` for SSE constraints, export, preview, scheduler, NER, audit log, viewer, software update, and other route-specific rules.
**Self-update (server only)** — `routes/updates.py` powers **Settings → General → Software update**: git fetch → ff-only merge → conditional `pip install``os.execv` restart (same PID; marks fds close-on-exec first so Werkzeug's inheritable listening socket doesn't leak and squat the port). Only enabled for git checkouts (`_supported()` is false for frozen desktop builds). `update_gdpr.sh` is the CLI/cron equivalent. Refused while a scan runs; optional daily auto-update thread (`config.json["auto_update"]`). Restart keeps port 5100 (the port probe uses `SO_REUSEADDR` + a 10s grace). See `routes/CLAUDE.md` → "Software update".
**Frontend:** `templates/index.html` (SPA), `static/style.css` (all styles), `static/js/*.js` (11 ES modules + `state.js`). `static/app.js` is an archived monolith — no longer loaded.
**Data dir** `~/.gdprscanner/`: `scanner.db`, `config.json`, `settings.json`, `schedule.json`, `token.json`, `delta.json`, `checkpoint.json`, `smtp.json`, `machine_id` (**never delete** — Fernet key), `role_overrides.json`, `google_sa.json`, `google.json`, `src_toggles.json`, `app.lock`, `viewer_tokens.json`
**Checkpoint / resume** — all three scan engines save progress to `~/.gdprscanner/checkpoint_{prefix}.json` every 25 items. Prefixes: `m365`, `google`, `file_{source_id}`. Use `_cp_path(prefix)` — do not hard-code filenames. The Scan button calls `checkCheckpoint(() => startScan(false))` so a resume banner is offered before any grid clearing. `POST /api/scan/clear_checkpoint` globs and deletes all `checkpoint_*.json` files.
**Data dir** `~/.gdprscanner/`: `scanner.db`, `config.json` (also holds `claude_api_key`/`claude_ner` and the `auto_update` flag), `settings.json`, `schedule.json`, `token.json`, `delta.json`, `checkpoint_m365.json`, `checkpoint_google.json`, `checkpoint_file_*.json`, `smtp.json`, `machine_id` (**never delete** — Fernet key), `role_overrides.json`, `google_sa.json`, `google.json`, `src_toggles.json`, `app.lock`, `viewer_tokens.json`. Static files are served with `SEND_FILE_MAX_AGE_DEFAULT=0` (ETag revalidation) so the UI is fresh after a self-update — do not re-add long static caching.
## Non-obvious files
@ -30,57 +46,70 @@ python -m pytest tests/ -q
| `routes/state.py` | Shared mutable state + scan locks (not a typical Flask state file) |
| `routes/google_scan.py` | Google scan execution lives here, not in `google_connector.py` |
| `routes/viewer.py` | Viewer token + PIN API; also owns brute-force rate-limit state |
| `static/js/viewer.js` | Share modal, token CRUD, viewer PIN settings UI |
| `static/js/viewer.js` | Share modal, token CRUD, viewer PIN settings UI. Also defines `window._copyText` (HTTP-safe clipboard helper reused by `log.js`) |
| `lang/da.json` | Primary language — source of truth is `en.json` |
| `build_gdpr.py` | Desktop app builder; contains embedded `LAUNCHER_CODE` for PyInstaller |
| `routes/updates.py` | Self-update routes + `os.execv` restart with fd-cleanup; git-checkout only |
| `update_gdpr.sh` | CLI/cron self-update (fetch, ff-merge, deps, service restart) |
| `docs/setup/ZORAXY_SETUP.md` | HTTPS via Zoraxy reverse proxy (LAN-only, Let's Encrypt DNS-01) |
## Tests
128 tests in `tests/`. No integration tests for Flask routes or live M365/Google connections.
215 tests in `tests/`. No integration tests for live M365/Google connections.
## Viewer mode (#33) — routes/viewer.py + static/js/viewer.js
**`tests/test_updates.py`** — 12 tests for the software-update routes (`routes/updates.py`). All git interaction goes through a mocked `_git()`; `_schedule_restart` is patched so no test re-execs the process, and `gdpr_db.log_audit_event` is patched so no test writes the real database. Includes `_mark_fds_cloexec` (the socket-leak guard for the restart).
Read-only access for DPOs and reviewers. Key invariants:
**`tests/test_google_scan.py`** — 19 tests for the Google Workspace scan module. Route tests for `GET /api/google/scan/users`, `POST /api/google/scan/start`, `POST /api/google/scan/cancel`. Engine tests for `_run_google_scan` using synchronous invocation with mocked `broadcast`, `_scan_bytes`, `checkpoint.*`, `scan_engine._with_disposition`, and `gdpr_db.get_db`. The `clean_google_state` autouse fixture releases `_google_scan_lock` and clears `_google_scan_abort` after each test.
- **`/view` auth chain** — token (`?token=`) → session cookie (`session["viewer_ok"]`) → PIN form (if PIN configured) → 403. Never skip this order.
- **`window.VIEWER_MODE`** — injected by Jinja2 in `index.html`. `auth.js` reads it at startup; adds `viewer-mode` class to `<body>`. All hide rules are CSS (`body.viewer-mode …`), not scattered JS checks — except `delBtn` in the card builder which is also guarded in JS. Hidden in viewer mode: `.sidebar` (entire left panel), `#logWrap`, `#progressBar`, scan/stop/profile/bulk-delete buttons, share button.
- **`viewer_tokens.json` format** — stored as `{"tokens": [...], "__pin__": {"hash": "…", "salt": "…"}}`. The old bare-list format is migrated transparently on first write. Do not write the file as a bare list.
- **`app.secret_key`** — derived from `machine_id` bytes so Flask sessions survive restarts. Set once at startup in `gdpr_scanner.py`; do not override it.
- **`GET /api/db/flagged`** — returns `get_session_items()` (last completed scan session, joined with dispositions). Used exclusively by `_loadViewerResults()` in `results.js`. Do not confuse with `get_flagged_items()` (single scan_id, no disposition join).
- **Rate-limit state** (`_pin_attempts` dict in `routes/viewer.py`) — in-memory only, resets on server restart. Intentional — a restart clears lockouts without a persistent store.
- **Token onclick attributes** — Copy/Revoke buttons in `_renderTokenList()` pass the token as a single-quoted JS string literal (`'\'' + tok.token + '\''`), never via `JSON.stringify`. `JSON.stringify` produces double-quoted strings that break the surrounding `onclick="…"` HTML attribute.
- **Settings Security pane** — Admin PIN and Viewer PIN groups live in `stPaneSecurity`, not `stPaneGeneral`. `switchSettingsTab('security')` in `sources.js` triggers both `stLoadPinStatus()` and `stLoadViewerPinStatus()`. The Share modal Configure button opens `openSettings('security')`.
- **`stClearViewerPin` guard** — validates that the current-PIN field is non-empty client-side before sending the DELETE request; shows an inline error and focuses the field if empty.
- **Share link base URL**`_getShareBaseUrl()` in `viewer.js` fetches `/api/local_ip` (returns the machine's LAN IP via a UDP probe to `8.8.8.8`) and substitutes it so copied links are routable from other machines. Falls back to `window.location.origin` on error. Both `createShareLink` and `copyTokenLink` are `async` and `await` this helper. Do not revert to a bare `window.location.origin` — that produces `127.0.0.1` links useless to remote viewers.
- **Flask binds to `0.0.0.0`**`gdpr_scanner.py` default `--host`, `m365_launcher.py`, and `build_gdpr.py` all use `host="0.0.0.0"`. Internal loopback URLs (urllib exports, webview window, port probe) intentionally keep `127.0.0.1` — do not change those to `0.0.0.0`.
**`tests/test_route_integration.py`** — 54 Flask test-client tests covering security-sensitive paths: viewer token CRUD and scope validation, `GET /api/db/flagged` role/user scope enforcement, bulk disposition isolation, viewer PIN (set/verify/rate-limit/change/clear), interface PIN gate (multi-step flows require `session["interface_ok"] = True` after PIN set), scan lock release on `run_scan()` exception, `GET /api/db/sessions` shape and ordering, profile routes CRUD and rename. Uses a tmp-path `ScanDB` monkeypatched into `routes.database._get_db` — tests never touch the real database.
## Sources panel resize — static/js/log.js + sources.js
**Local-file scan fixtures** — `tests/fixtures/local_files/` holds 19 files (14 flagged, 5 true negatives). `generate_fixtures.py` regenerates the binary files. Audio fixtures need 2 silent MPEG frames so mutagen can sync; FLAC uses a hand-packed STREAMINFO + Vorbis comment block.
- **`_fitSourcesPanel()`** — called at the end of every `renderSourcesPanel()` call. Clears the panel's inline height, reads `scrollHeight` (natural content height), then either restores a saved smaller preference from `localStorage` (`gdpr_sources_h`) or pins the height to `scrollHeight`. This keeps the panel exactly as tall as needed to show all sources.
- **`_initSourcesResize()`** — attaches pointer-drag to `#sourcesResizeHandle`. On `pointerdown` it captures `scrollHeight` as the hard max; drag up shrinks, drag down is capped at that max. Saves to `localStorage` on release; clears the key if the user drags back to full height.
- **Do not add a fixed `max-height` or `height` to `#sourcesPanel` in HTML** — height is controlled entirely by `_fitSourcesPanel()` at runtime.
- **Do not call `_fitSourcesPanel()` before the panel has rendered**`scrollHeight` will be 0. The call in `renderSourcesPanel()` is the correct hook; `_initSourcesResize()` only sets up the drag handler.
**`_CPR_PREFIX_NOISE` in `.docx` fixtures** — `scan_docx` concatenates all run texts with no separators. The fixture generator appends a trailing `" "` to every value run so CPRs are always surrounded by word boundaries. Do not remove this trailing space — the detection will silently regress.
## Scan filter options — scan_engine.py
All options live in the profile `options` dict and apply to **all three scan engines** (M365, Google, file scan).
- **`skip_gps_images` (bool, default `false`)** — images whose only PII is GPS coordinates are not flagged. GPS data still stored in `exif` field if flagged by another signal.
- **`min_cpr_count` (int, default `1`)** — minimum distinct CPR numbers before flagging. Deduplication uses `list(dict.fromkeys(c["formatted"] for c in cprs))` — do not revert to `dict.fromkeys(cprs)` (raises `TypeError: unhashable type: 'dict'`). Files with faces or EXIF PII are still flagged regardless.
- **`cpr_only` (bool, default `false`)** — skip items whose only hits are email addresses, phone numbers, faces, or EXIF/GPS metadata.
- **`ocr_lang` (str, default `"dan+eng"`)** — Tesseract language packs. Threaded through `_scan_bytes`/`_scan_bytes_timeout``document_scanner` and the PDF-OCR subprocess worker. Cache key already includes `lang`.
- **File scan** reads options from `source` dict keys directly. **M365 scan** reads from `scan_opts = options.get("options", {})`. Both paths apply the same `_cpr_qualifies` / `_exif_has_pii` logic.
- **UI:** sidebar `#optSkipGps`, `#optMinCpr`, `#optCprOnly`, `#optOcrLang`; profile editor `#peOptSkipGps`, `#peOptMinCpr`, `#peOptCprOnly`, `#peOptOcrLang`. All saved/loaded by `profiles.js`.
## Memory management — scan_engine.py
Large M365 tenants can generate enormous memory pressure. Key rules to preserve:
- **Email body stripped at collection time**`_scan_user_email` stores body as `msg["_precomputed_body"]`, deletes `msg["body"]` and `msg["bodyPreview"]`. Processing loop reads `meta.pop("_precomputed_body", "")`. Do not re-add `body` to `$select` without also stripping it.
- **`body_excerpt`** — 500-char plain-text preview stored per flagged email; flows into `flagged_items`, checkpoint JSON, and DB. Do not remove before broadcasting — needed for preview on checkpoint resume.
- **`work_items``deque` before processing** — drained via `popleft()` so each item's memory is released immediately. Do not convert back to a list.
- **`del content` / `del body_text`** — raw bytes and body text deleted immediately after use. Both hit and no-hit paths have explicit deletes.
- **PDF OCR rendered page-by-page**`convert_from_path(first_page=N, last_page=N)` inside the loop; only one page image in memory at a time. Do NOT revert to a bulk call — triggers OOM on large PDFs.
- **OCR memory guard**`_ocr_mem_ok()` checks `psutil.virtual_memory().available >= 500 MB` before each page render.
- **Memory guard**`psutil.virtual_memory().available` checked before each M365 file download; skips if < 300 MB free.
- **Email body stripped at collection time**`_scan_user_email` calls `conn.get_message_body_text(msg)`, stores the result as `msg["_precomputed_body"]`, then deletes `msg["body"]` and `msg["bodyPreview"]` before appending to `work_items`. The processing loop reads `meta.pop("_precomputed_body", "")`. Do not re-add `body` to the `$select` query without also stripping it here.
- **`work_items``deque` before processing** — converted with `deque(work_items)` and drained via `popleft()` so each item's memory is released immediately after processing. Do not convert back to a list or iterate with `enumerate()`.
- **`del content` in file branch** — raw download bytes are deleted as soon as `content.decode()` is done (before NER/PII counting). Both the hit and no-hit paths have explicit `del content`.
- **`del body_text` in email branch** — deleted after `_broadcast_card` call.
- **PDF OCR images freed page-by-page** — in `document_scanner.scan_pdf`, `images[page_num-1] = None` immediately after OCR. Do not cache or accumulate page images.
- **Memory guard**`psutil.virtual_memory().available` checked before each M365 file download; scan skips the file if < 300 MB free.
## Scan history browser — gdpr_db.py
- **`get_sessions(limit=50, window_seconds=300)`** — groups `scans` rows by 300 s window. Groups built ascending, returned descending. `ref_scan_id` is the highest `scan_id` in each group. Do not change window size independently of `get_session_items`.
- **`get_session_items(ref_scan_id=N)`** — anchors 300 s window to that scan's `started_at`. Window is **symmetric**: `started_at BETWEEN ref.started_at - 300 AND ref.started_at + 300`. Do not revert to a one-sided lower bound.
- **`get_related_items(item_id, ref_scan_id, window_seconds=300)`** — self-joins `cpr_index` to find items sharing ≥1 CPR hash. Uses same 300 s symmetric window — do not change independently.
- **`account_name` (display name) is persisted** (migration 11) so DB-loaded cards show the user badge. Legacy rows predating it have `account_name=''` — the frontend `_accountPill` resolves a fallback and still shows the group badge from `user_role`. `save_item` must keep writing `card["account_name"]` (both M365 and Google cards carry it).
- **Scans must be finalised or their items are invisible**`get_session_items`, `get_open_items`, and `latest_scan_id` all filter on `finished_at IS NOT NULL`. The file scan finalises in a `finally`; M365 (`run_scan`) and Google (`_run_google_scan`) `return` early on abort, so each now calls `finish_scan` before that abort-return. A process kill (deploy/OOM/crash) mid-scan still strands a scan → **`finalize_orphan_scans()`** runs once at server startup (`gdpr_scanner.py` `__main__`, before the scheduler) and finalises every `finished_at IS NULL` scan (safe because nothing is scanning at boot). Do not add a scan-results query that ignores `finished_at` instead of fixing finalisation.
- **`get_open_items()`** — returns every flagged item with **no action taken**, across **all** scans (not just the latest session window). "Open" = no `dispositions` row, or one whose `status='unreviewed'`. Because `flagged_items` PK is `(id, scan_id)`, the same item recurs per scan; the query dedupes by `id`, keeping the row from the highest finished `scan_id`. This powers the **default landing view** so items don't drop out of sight once a newer scan opens a fresh session.
- **`GET /api/db/flagged`** — **with `?ref=N`**`get_session_items(ref_scan_id=N)` (history mode); **without ref**`get_open_items()` (default + viewer). Viewer scope enforcement applies to both. Do not change the no-ref `get_session_items()` default elsewhere (`export.py`, `scan_scheduler.py` still rely on latest-session for the current scan's report/email).
- See `static/js/CLAUDE.md` for the frontend history browser behaviour and `sse_replay_done` retry fix.
## Global gotchas
- **Pattern matching in Python** — when using `str.replace()` to patch JS/HTML, whitespace and quote style must match exactly. Use `in` check first and print if not found.
- **`__getattr__` on modules** — only resolves `module.name` access from outside, not bare name lookups inside function bodies. Always import directly.
- **`JSON.stringify` inside `onclick="…"` attributes** — produces double-quoted strings that terminate the HTML attribute early. Use single-quoted JS string literals instead, or `data-*` attributes read from the handler.
- **`JSON.stringify` inside `onclick="…"` attributes** — produces double-quoted strings that terminate the HTML attribute early. Use single-quoted JS string literals instead, or `data-*` attributes read from the handler. When the object is embedded as an `onclick` payload, also `.replace(/"/g,'&quot;')` it (matches the delete/redact button pattern) so a `"` in a filename can't break out.
- **Escape scan-derived strings before `innerHTML`** — file names, account/display names, folders, and source labels come from scanned content and may contain markup. Pass them through `esc()` (in `results.js`) before embedding in `innerHTML` or `title=`/`alt=` attributes. Server-side SVG/HTML built from request params (e.g. `_placeholder_svg` for `/api/thumb`) must use `_html_esc`. Skipping either re-introduces stored/reflected XSS.
- **Secrets at rest use the machine-keyed Fernet** — the SMTP password and Claude API key are encrypted via `app_config._encrypt_password` / `_decrypt_password`. New secret-bearing config fields must follow the same pattern; read them through a decrypting accessor (e.g. `get_claude_api_key()`), never `_load_config().get(...)` directly.
## Directory-scoped rules
- `routes/CLAUDE.md` — SSE constraints, scan_progress source field, file_sources, Python gotchas
- `static/js/CLAUDE.md` — profile dropdown, progress bar phase parsing, JS gotchas
- `routes/CLAUDE.md` — SSE constraints, M365 exceptions, export, preview, audit log, email, scheduler, Claude NER, viewer route, Python gotchas
- `static/js/CLAUDE.md` — profile dropdown, progress bar, SSE teardown, history browser, CPR cross-referencing, sources panel resize, viewer JS, JS gotchas
- `templates/CLAUDE.md` — CSS variable names, sizing rules, badge standard, design rules
- `lang/CLAUDE.md` — i18n conventions

View File

@ -1,15 +1,16 @@
# Contributing to GDPR Scanner
Thank you for considering a contribution. This project helps organisations find
and manage personal data in Microsoft 365 tenants. Contributions that improve
compliance coverage, reliability, and usability are very welcome.
and manage personal data across Microsoft 365 (Exchange, OneDrive, SharePoint,
Teams), Google Workspace (Gmail, Google Drive), and local/SMB file systems.
Contributions that improve compliance coverage, reliability, and usability are
very welcome.
---
## Before You Start
- Check the [open issues](../../issues) and [SUGGESTIONS.md](SUGGESTIONS.md) to
see if your idea is already tracked
- Check the [open issues](../../issues) to see if your idea is already tracked
- For large features, open an issue first to discuss the approach — this avoids
wasted effort if the direction doesn't fit
- Security vulnerabilities: see [SECURITY.md](SECURITY.md) — do not file public issues
@ -31,16 +32,16 @@ pip install -r requirements.txt
# Danish NER model (optional — needed for name/address detection)
python -m spacy download da_core_news_lg
# Run the Document Scanner
python server.py
# Run the GDPRScanner
# Start the scanner (serves on http://0.0.0.0:5100)
python gdpr_scanner.py
# Run the test suite
python -m pytest tests/ -q
```
You will need a Microsoft Azure app registration with the permissions described
in the README to test GDPRScanner against a real tenant. A developer tenant
is available for free via the [Microsoft 365 Developer Program](https://developer.microsoft.com/microsoft-365/dev-program).
To test against a real M365 tenant you will need a Microsoft Azure app
registration with the permissions described in the README. A free developer
tenant is available via the [Microsoft 365 Developer Program](https://developer.microsoft.com/microsoft-365/dev-program).
---
@ -48,8 +49,7 @@ is available for free via the [Microsoft 365 Developer Program](https://develope
- Bug fixes
- Improved CPR false-positive reduction
- New language files (see `lang/en.lang` for the key list)
- Items from [SUGGESTIONS.md](SUGGESTIONS.md) — check the status column first
- New language files (see `lang/en.json` for the key list)
- Performance improvements for large tenants
- Docker / deployment improvements
- Documentation fixes
@ -65,7 +65,7 @@ is available for free via the [Microsoft 365 Developer Program](https://develope
- All personal data (CPR numbers) must be SHA-256 hashed before storage — never store or log raw CPR values
- Wrap Graph API calls in try/except and handle `M365PermissionError` gracefully
**JavaScript (embedded in the Flask templates)**
**JavaScript (`static/js/*.js` — ES modules)**
- `const` / `let` — no `var`
- `async/await` over `.then()` chains
- All user-visible strings must have a `data-i18n` key so translations work
@ -78,9 +78,9 @@ is available for free via the [Microsoft 365 Developer Program](https://develope
## Adding a Language
1. Copy `lang/en.lang` to `lang/xx.lang` (ISO 639-1 code)
1. Copy `lang/en.json` to `lang/xx.json` (ISO 639-1 code)
2. Translate all values — keys must stay identical
3. Test by setting `~/.m365_scanner_lang` to `xx` and restarting
3. Test by writing `xx` to `~/.gdprscanner/lang` and restarting
---
@ -88,10 +88,12 @@ is available for free via the [Microsoft 365 Developer Program](https://develope
1. Fork the repository and create a branch: `git checkout -b feature/my-feature`
2. Make your changes and test them
3. Run a syntax check: `python -m py_compile gdpr_scanner.py m365_connector.py gdpr_db.py`
4. Update `README.md` if your change adds or changes user-visible behaviour
5. Open a pull request with a clear description of what it does and why
6. Link to the relevant issue or SUGGESTIONS.md item if applicable
3. Run the test suite: `python -m pytest tests/ -q`
4. Run a syntax check on the modules you touched, e.g.:
`python -m py_compile gdpr_scanner.py scan_engine.py app_config.py gdpr_db.py`
5. Update `README.md` if your change adds or changes user-visible behaviour
6. Open a pull request with a clear description of what it does and why
7. Link to the relevant issue if applicable
We aim to review pull requests within one week.

View File

@ -7,7 +7,7 @@ All Python modules used in the GDPR Scanner project, with a short explanation of
### Web server
| Module | Purpose |
|---|---|
| `flask` | Web server and API routing for both the GDPRScanner UI |
| `flask` | Web server and API routing for the GDPRScanner UI |
### Microsoft 365 authentication and API
| Module | Purpose |
@ -15,39 +15,64 @@ All Python modules used in the GDPR Scanner project, with a short explanation of
| `msal` | Microsoft Authentication Library — handles OAuth2 device code flow (delegated) and client credentials (application) for Microsoft Graph API access |
| `requests` | HTTP client used for all Microsoft Graph API calls |
### Google Workspace scanning
| Module | Purpose |
|---|---|
| `google-auth` | Service account authentication and domain-wide delegation for Google APIs |
| `google-auth-httplib2` | HTTP transport adapter for `google-auth` |
| `google-api-python-client` | Gmail API, Google Drive API, and Admin Directory API client |
### SMB / file system scanning
| Module | Purpose |
|---|---|
| `smbprotocol` | SMB2/3 network share scanning without requiring a mounted drive — used for Windows file server sources |
| `keyring` | OS keychain credential storage for SMB passwords |
| `python-dotenv` | `.env` file fallback for headless SMB credentials when no keychain is available |
### PDF handling
| Module | Purpose |
|---|---|
| `pdfplumber` | Text extraction from PDFs with a selectable text layer — fast and accurate for native PDFs |
| `pdf2image` | Converts PDF pages to images (via Poppler) for OCR processing of scanned/image-based PDFs |
| `pytesseract` | Python wrapper for the Tesseract OCR engine — extracts text from rasterised PDF pages and images |
| `pypdf` | PDF metadata reading and low-level page manipulation |
| `reportlab` | Fallback PDF redaction via overlay rendering — used when PyMuPDF is unavailable |
| `pymupdf` (fitz) | Physically removes the text layer from PDFs — preferred GDPR-compliant redaction method |
| `pdf2image` *(optional)* | Converts PDF pages to images (via Poppler) for OCR processing of scanned/image-based PDFs |
| `pytesseract` *(optional)* | Python wrapper for the Tesseract OCR engine — extracts text from rasterised PDF pages and images |
| `pypdf` *(optional)* | PDF metadata reading and low-level page manipulation — used in the `document_scanner.py` redaction path |
| `reportlab` *(optional)* | Fallback PDF redaction via overlay rendering — used when PyMuPDF is unavailable |
> Optional packages are not in `requirements.txt`. Install them manually if you need OCR or the standalone `document_scanner.py` CLI.
### Document formats
| Module | Purpose |
|---|---|
| `python-docx` | Read and write `.docx` Word documents; also used to generate the Article 30 Register of Processing Activities report |
| `openpyxl` | Read and write `.xlsx` Excel files — used for the scan result export workbook |
| `img2pdf` | Converts images to PDF for archiving redacted output |
### Image processing and face detection
| Module | Purpose |
|---|---|
| `opencv-python` (cv2) | Face detection in images via Haar cascade classifiers; also used for face blurring during anonymisation |
| `numpy` | Array operations required internally by OpenCV |
| `Pillow` (PIL) | Image manipulation — thumbnail generation, format conversion, image resizing |
| `Pillow` (PIL) | Image manipulation — thumbnail generation, format conversion, EXIF metadata extraction |
### NLP / Named Entity Recognition
| Module | Purpose |
|---|---|
| `spacy` | NLP engine for Danish Named Entity Recognition — detects person names, addresses, and organisations in text. Requires the `da_core_news_lg` model (~500 MB) |
### Archive scanning
### Encryption
| Module | Purpose |
|---|---|
| `py7zr` | 7-Zip archive support — allows the scanner to inspect `.7z` compressed files |
| `cryptography` | Fernet symmetric encryption — encrypts SMTP passwords at rest in `~/.gdprscanner/smtp.json`; the Fernet key is derived from `~/.gdprscanner/machine_id` |
### Scheduling
| Module | Purpose |
|---|---|
| `APScheduler` | In-process background scheduler — drives the scheduled scan feature (`schedule.json`). Uses `BackgroundScheduler` with `CronTrigger` |
### System monitoring
| Module | Purpose |
|---|---|
| `psutil` | Available-memory probe in `scan_engine.py` — skips file downloads when free RAM drops below 300 MB to prevent OOM crashes on large tenants |
### Desktop app packaging
| Module | Purpose |
@ -64,16 +89,17 @@ All Python modules used in the GDPR Scanner project, with a short explanation of
### Data storage
| Module | Purpose |
|---|---|
| `sqlite3` | SQLite database — stores scan results, CPR index (hashed), dispositions, deletion audit log, and scan history in `~/.gdpr_scanner.db` |
| `sqlite3` | SQLite database — stores scan results, CPR index (hashed), dispositions, deletion audit log, and scan history in `~/.gdprscanner/scanner.db` |
| `json` | Config files, checkpoint files, language files, API request/response serialisation |
| `zipfile` | Database export/import archive creation and reading; also used in the PyInstaller build process |
| `csv` | CSV file scanning support in the Document Scanner |
| `csv` | CSV file scanning support |
### Security and hashing
| Module | Purpose |
|---|---|
| `hashlib` | SHA-256 hashing of CPR numbers before storage — raw CPR values are never written to the database |
| `secrets` | Cryptographically secure random values (used in auth state parameters) |
| `secrets` | Cryptographically secure random values — used for viewer token generation and auth state parameters |
| `uuid` | UUID generation for viewer tokens and scan session identifiers |
### File system and paths
| Module | Purpose |
@ -85,8 +111,9 @@ All Python modules used in the GDPR Scanner project, with a short explanation of
### Networking and email
| Module | Purpose |
|---|---|
| `smtplib` | SMTP email delivery for the headless report feature — supports STARTTLS and SMTPS/SSL |
| `smtplib` | SMTP email delivery for the scheduled report feature — supports STARTTLS and SMTPS/SSL |
| `email` | Email message construction (MIME) for the SMTP report feature |
| `socket` | UDP probe to determine the machine's LAN IP address — used to build routable share links for viewer tokens |
### Text and pattern matching
| Module | Purpose |
@ -99,12 +126,13 @@ All Python modules used in the GDPR Scanner project, with a short explanation of
| `threading` | Background scan thread so the Flask web UI stays responsive during long scans |
| `queue` | Server-Sent Events message queue — passes scan results from the background thread to the browser |
| `concurrent.futures` | `ProcessPoolExecutor` for parallel OCR processing of multi-page PDFs |
| `gc` | Explicit garbage collection after large scan batches to release memory promptly |
### I/O and streams
| Module | Purpose |
|---|---|
| `io` | In-memory byte streams for generating Excel and Word documents without writing to disk |
| `struct` | Binary data unpacking (used in some PDF processing paths) |
| `struct` | Binary data unpacking used in some PDF processing paths |
### Date and time
| Module | Purpose |
@ -117,15 +145,15 @@ All Python modules used in the GDPR Scanner project, with a short explanation of
|---|---|
| `platform` | Detects the operating system for macOS/Windows-specific code paths |
| `subprocess` | Launches Tesseract and Poppler as external processes for OCR and PDF rendering |
| `argparse` | CLI argument parsing for `--headless`, `--reset-db`, `--export-db`, `--import-db` etc. |
| `sys` | Python runtime access — sys.exit(), sys.path, sys.version |
| `argparse` | CLI argument parsing for `--headless`, `--reset-db`, `--export-db`, `--import-db`, etc. |
| `sys` | Python runtime access — `sys.exit()`, `sys.path`, `sys.version` |
| `os` | Environment variables and low-level file operations |
| `logging` | Application-level logging — routes warnings and errors to stderr and rotating file handlers |
### Encoding and serialisation
| Module | Purpose |
|---|---|
| `base64` | Encodes thumbnail images as base64 strings for embedding in JSON API responses |
| `struct` | Binary format parsing used in some document processing paths |
---

View File

@ -0,0 +1,10 @@
{
"folders": [
{
"path": "."
},
{
"path": "."
}
]
}

View File

@ -102,7 +102,7 @@ tests/ pytest test suite — 112 tests, all should pass.
**Settings stats show 0 (Scanned / Flagged / Scans)**
`routes/database.py``db_stats()` — queries `flagged_items` and `scans` directly
→ Stats populate from existing DB on app start — no re-scan needed
→ If still 0 after a completed scan: check `~/.gdpr_scanner.db` exists and is not empty
→ If still 0 after a completed scan: check `~/.gdprscanner/scanner.db` exists and is not empty
**File scan results not persisting to DB**
`scan_engine.py``run_file_scan()` — must call `_db.begin_scan()` not `start_scan()`

67
OSS_LANDSCAPE.md Normal file
View File

@ -0,0 +1,67 @@
# Open Source Landscape — GDPR / PII Document Scanners
An overview of existing open source tools in the same space as GDPRScanner, and where the gaps are.
---
## Summary
No open source project covers the same combination of M365 + Google Workspace connectors, Danish CPR detection, and GDPR Article 30 reporting in a single web UI. The closest commercial equivalent is [PII Tools](https://pii-tools.com) (closed source, SaaS).
---
## Existing open source tools
### [Microsoft Presidio](https://github.com/microsoft/presidio)
A well-maintained PII detection *library* (not an application) from Microsoft. Supports custom recognisers — a CPR pattern could be added. Covers text, images, and structured data via NLP + regex pipelines. No M365/GWS connectors, no UI, no reports, no scheduling. You would have to build the entire scanning application around it. ~9k GitHub stars.
### [Octopii](https://github.com/redhuntlabs/Octopii)
Local filesystem / S3 / Apache open-directory scanner using OCR + NLP + regex. Detects passports, government IDs, emails, and addresses in image and document files. No cloud connectors, no CPR awareness, no web UI.
### [pdscan](https://github.com/ankane/pdscan) / [piicatcher](https://github.com/tokern/piicatcher)
CLI tools that scan *databases* and data warehouses for PII columns using column-name heuristics and NLP sampling. No file storage scanning, no email, no cloud connectors.
### "GDPR scanners" on GitHub
Projects such as [baudev/gdpr-checker-backend](https://github.com/baudev/gdpr-checker-backend), [dev4privacy/gdpr-analyzer](https://github.com/dev4privacy/gdpr-analyzer), [mammuth/gdpr-scanner](https://github.com/mammuth/gdpr-scanner), and [City-of-Helsinki/GDPR-compliance-scanner](https://github.com/City-of-Helsinki/GDPR-compliance-scanner) are all **website and cookie compliance** scanners. They check whether a domain sets tracking cookies without consent — a completely different problem.
### CPR libraries
Several small libraries exist for validating or generating Danish CPR numbers ([mathiasvr/danish-ssn](https://github.com/mathiasvr/danish-ssn), [anhoej/cprr](https://github.com/anhoej/cprr), [ekstroem/DKcpr](https://github.com/ekstroem/DKcpr)). None of them are document or cloud-storage scanners.
---
## Commercial products that do cover it
| Product | M365 | GWS | CPR | Article 30 | Open source |
|---|---|---|---|---|---|
| [PII Tools](https://pii-tools.com) | ✅ | ✅ | ❌ | ❌ | ❌ |
| BigID | ✅ | ✅ | ❌ | ❌ | ❌ |
| Varonis | ✅ | partial | ❌ | ❌ | ❌ |
| Spirion | ✅ | ❌ | ❌ | ❌ | ❌ |
PII Tools is the most direct commercial equivalent: Graph API + GWS service account connectors, document scanning, web UI. Closed source, SaaS pricing targeted at enterprise.
---
## Capability comparison
| Capability | GDPRScanner | Presidio | Octopii | Commercial |
|---|---|---|---|---|
| M365 (Exchange / OneDrive / SharePoint / Teams) | ✅ | ❌ | ❌ | ✅ |
| Google Workspace (Gmail / Drive) | ✅ | ❌ | ❌ | ✅ |
| Local / SMB / SFTP | ✅ | ❌ | partial | ✅ |
| Danish CPR with modulus-11 validation | ✅ | plugin only | ❌ | ❌ |
| Email address + phone number detection | ✅ | ✅ | ✅ | ✅ |
| GDPR Article 30 report generation | ✅ | ❌ | ❌ | partial |
| Disposition tagging + bulk deletion | ✅ | ❌ | ❌ | partial |
| Scheduled scans | ✅ | ❌ | ❌ | ✅ |
| Checkpoint / resume | ✅ | ❌ | ❌ | unknown |
| Read-only viewer / share links | ✅ | ❌ | ❌ | partial |
| Web UI for non-technical staff | ✅ | ❌ | ❌ | ✅ |
| Danish-language UI | ✅ | ❌ | ❌ | ❌ |
| Open source | ✅ | ✅ | ✅ | ❌ |
---
## What makes GDPRScanner unique
The combination of Danish CPR specificity (modulus-11 validation, date sanity checks), M365 + Google Workspace connectors in a single tool, and GDPR Article 30 output is the gap no open source project fills. The Danish public-sector target audience (schools, municipalities) also drives requirements — role classification (student/staff), Danish-language UI, municipal data retention rules — that no general-purpose PII tool addresses.

194
README.md
View File

@ -1,8 +1,13 @@
# GDPRScanner
Scans Microsoft 365, Google Workspace, and local/network file systems for Danish
CPR numbers and personal data (PII). Produces GDPR compliance reports and supports
Article 30 record-keeping obligations.
Scans Microsoft 365, Google Workspace, local/network file systems, and SFTP servers
for Danish CPR numbers and personal data (PII). Produces GDPR compliance reports and
supports Article 30 record-keeping obligations.
---
> **Work in progress — not ready for production use.**
> This project is under active development and has not been formally tested or audited for production deployment. It is shared publicly for transparency and collaboration. Use at your own risk.
---
@ -27,7 +32,7 @@ an IDE with intelligent completion. The result is the author's work.
- **Folder path in results** — each email result shows its full folder path (e.g. `Inbox / Ansøgninger pædagog SFO`) in the card and in Excel export
- **Delete items** — flagged results can be deleted directly from the UI, individually or in bulk
- **CPR false-positive reduction** — strict CPR validation
- **Excel export** — multi-tab `.xlsx` report with per-source breakdown, auto-filters, and URL hyperlinks. Columns include: Name, CPR Hits, Face count, GPS (✔ if GPS in EXIF), Special category, EXIF author, Folder, Account, Role, Disposition, Date Modified, Size (KB), URL. A dedicated **GPS locations** sheet lists all items with GPS coordinates including a Google Maps link. Separate tabs for Outlook (Exchange), OneDrive, SharePoint, Teams, Gmail, Google Drive, local folders, and SMB/network shares. Summary sheet shows counts by source and GPS item total. When M365, Google Workspace, and file scans run concurrently, all results are captured in the export — not just the last completed scan
- **Excel export** — multi-tab `.xlsx` report with per-source breakdown, auto-filters, and URL hyperlinks. Columns include: Name, CPR Hits, Face count, GPS (✔ if GPS in EXIF), Special category, EXIF author, Folder, Account, Role, Disposition, Date Modified, Size (KB), URL. A dedicated **GPS locations** sheet lists all items with GPS coordinates including a Google Maps link. Separate tabs for Outlook (Exchange), OneDrive, SharePoint, Teams, Gmail, Google Drive, local folders, SMB/network shares, and SFTP. Summary sheet shows counts by source and GPS item total. When M365, Google Workspace, and file scans run concurrently, all results are captured in the export — not just the last completed scan
- **Progressive streaming** — results stream card-by-card via Server-Sent Events as the scan runs
- **Token auto-refresh** — expired tokens are detected and silently refreshed mid-scan without interrupting the UI
- **Incremental / resumable scans** — interrupted scans save a checkpoint; the next run resumes from where it stopped rather than starting over
@ -41,10 +46,13 @@ an IDE with intelligent completion. The result is the author's work.
- **Account name on cards** — when scanning multiple users, each card displays the owner's display name so results from different mailboxes are instantly distinguishable
- **Retention policy enforcement** — flag items older than a configurable retention period with a Overdue badge; supports both rolling and fiscal-year-aligned cutoffs (e.g. Bogføringsloven Dec 31); headless auto-delete via `--retention-years`
- **Data subject lookup** — find all flagged items containing a specific CPR number across all scans; CPR is SHA-256 hashed before querying — never stored in plaintext
- **Disposition tagging** — compliance officers can tag each flagged item with a legal basis (retain / delete-scheduled / deleted) directly from the preview panel
- **Read-only viewer mode** — share scan results with a DPO or manager via a secure token URL (`/view?token=…`) or a numeric PIN; viewers see the full results grid and disposition panel but cannot scan, delete, or change settings
- **CPR cross-referencing** — clicking any flagged card with CPR hits shows a "Related documents" section listing other items from the same scan session that share at least one CPR number, ordered by number of shared CPRs. Clicking any entry opens it in the preview panel. Works in live mode and history mode. Powered by a SQL self-join on the `cpr_index` table — no new data collection required
- **Disposition tagging** — compliance officers can tag each flagged item with a legal basis (retain / delete-scheduled / deleted) directly from the preview panel; **bulk disposition tagging** lets you select multiple cards with checkboxes and apply a disposition to all of them at once. A stats bar above the grid shows total · unreviewed · retain · delete counts and the percentage reviewed
- **Interface PIN** — optional session-level PIN that gates the main scanner interface (`/`). Set a 48 digit PIN in **Settings → Security → Interface PIN**; unauthenticated visitors are redirected to `/login`. The `/view` viewer route and all viewer API endpoints are exempt — reviewers are unaffected. Salted SHA-256 hash; brute-force protection (5 attempts / 5 min per IP)
- **Read-only viewer mode** — share scan results with a DPO or manager via a secure token URL (`/view?token=…`) or a numeric PIN; viewers see the full results grid and disposition panel but cannot scan, delete, or change settings. Tokens can be **role-scoped** (Ansatte / Elever) so a recipient only sees items for their group, or **user-scoped** so an individual employee only sees their own flagged files (supports dual M365 + Google Workspace identity)
- **Article 30 report** — one-click export of a structured Word document (`.docx`) satisfying the GDPR Article 30 register of processing activities obligation
- **SQLite results database** — scan results, CPR index, PII breakdown, disposition decisions, and scan history are persisted to `~/.gdprscanner/scanner.db` alongside the JSON cache, enabling cross-scan queries and trend tracking
- **Software updates from the UI** — check for and install new versions from **Settings → General → Software update**, or enable automatic daily updates; the app restarts itself in place (see [Software updates](#software-updates) below)
- **Built-in user manual** — click the **?** button in the top bar to open the manual in a dedicated window. Available in Danish and English. Printable via the browser's print function. Served from `MANUAL-DA.md` / `MANUAL-EN.md` at `/manual?lang=da|en` — always in sync with the installed version, no internet required. In the packaged desktop app the manual opens as a native pywebview window; in the browser it opens as a popup.
---
@ -73,7 +81,7 @@ The sidebar sources panel lists all configured scan sources. Click **Sources** t
**Google Workspace tab** — Two authentication modes: **Workspace** (service account with domain-wide delegation — scans all users) and **Personal account** (OAuth 2.0 device-code flow — scans the signed-in account only). Once connected, per-source toggles control whether Gmail and/or Google Drive appear in the sidebar panel and are included in scans. See [GOOGLE_SETUP.md](docs/setup/GOOGLE_SETUP.md) for setup instructions.
**File sources tab** — Add local folder paths or SMB/CIFS network shares with a name, path, and optional SMB credentials. Each saved source appears as a checkbox in the sidebar panel (local, SMB/network). Use the **Edit** button on each row to update credentials or rename a source without deleting it.
**File sources tab** — Add local folder paths, SMB/CIFS network shares, or SFTP servers. A pill selector (Local / Network / SFTP) switches the form fields. SFTP sources require host, port, username, remote path, and auth type (password or private key). SSH private keys are uploaded via the UI, validated with paramiko, and stored in `~/.gdprscanner/sftp_keys/` with `600` permissions; passwords and passphrases are stored in the OS keychain. Each saved source appears as a checkbox in the sidebar panel. Use the **Edit** button on each row to update credentials or rename a source without deleting it.
**Skipped automatically:** `.recycle`, `.sync`, `.btsync`, `.trash`, `.git`, `node_modules`, `System Volume Information`, and other system/sync folders. Hidden directories (`.` prefix) are skipped too.
@ -123,9 +131,10 @@ A date-from picker limits the scan to items modified after the selected date. Qu
| Scan attachments | On | Scan PDF/Word/Excel attachments inside emails |
| Max attachment size | **20 MB** | Skip attachments larger than this threshold |
| Max emails per user | **2000** | Cap per mailbox to avoid very long scans |
| **Δ Delta scan** | Off | Fetch only changed items since the last scan (see [Delta scan](#delta-scan) below) |
| **Δ Delta scan** | Off | Fetch only changed items since the last scan — hover the **?** for details (see [Delta scan](#delta-scan) below) |
| ** Scan photos for faces** | Off | Detect faces in image files and flag as Art. 9 biometric data — hover the **?** for details (see [Photo scanning](#photo--biometric-scanning) below) |
| **Scan photos for faces** | Off | Detect faces in image files and flag as Art. 9 biometric data — hover the **?** for details (see [Photo scanning](#photo--biometric-scanning) below) |
| **Ignore GPS in images** | Off | Skip images whose only PII signal is an embedded GPS coordinate. Useful for student scans where smartphones embed location in every camera photo. GPS is still shown in the detail card if the image is flagged for another reason (faces, EXIF author). |
| **Min. CPR count per file** | **1** | Only flag a file if it contains at least this many *distinct* CPR numbers. Set to 2 to suppress false positives in student scans (e.g. a student's own consent form with a single CPR) while still reporting class lists and grade sheets with multiple CPRs. |
| **Retention policy** | Off | Flag items older than N years — hover the **?** for details (see [Retention policy](#retention-policy-enforcement)) |
#### Results grid
@ -144,7 +153,11 @@ Each flagged item appears as a card showing:
- **Ext.** / **** badge — external email recipient or externally shared file (Art. 4446 transfer risk)
- **delete button** — appears on hover (grid view) or always visible (list view)
**Filter bar** — always visible above both the results grid and the preview panel. Narrow results by source, disposition, transfer risk, and risk level:
**Disposition stats bar** — always visible above the results grid when items are loaded. Shows: Total · Unreviewed · Retain · Delete · percentage reviewed. Updates live after every disposition save.
**Select mode** — click **Vælg** in the filter bar to enter bulk-selection mode. Per-card checkboxes appear; a bulk tag bar at the bottom of the grid shows the count of selected items, a **Select all visible** button, a disposition dropdown, and an **Apply** button. Click **Done** to exit select mode.
**Filter bar** — always visible above both the results grid and the preview panel. Narrow results by source, disposition, transfer risk, risk level, and role:
| Filter | Options |
|---|---|
@ -152,6 +165,20 @@ Each flagged item appears as a card showing:
| Disposition | All / Unreviewed / Retain (legal/legitimate/contract) / Delete-scheduled / Deleted |
| Transfer risk | All / External recipient / External share / Shared |
| Risk level | All risk levels / Art. 9 special category / Photos / biometric |
| **Role** | **All roles / Ansatte (staff) / Elever (students)** |
The Role filter also scopes exports — selecting **Elever** before clicking **Excel** or **Art.30** produces a report containing only student items. The exported filename gets an `_elever` or `_ansatte` suffix so recipients can distinguish the files.
#### Scan history browser
Review results from any past scan session without running a new scan. A **Sessions** button appears in the banner above the results grid once a scan has completed.
- Click **Sessions** to open the session picker — lists all past scans with date, sources, and item count. Each entry shows a **Δ** badge for delta scans and a **Latest** badge for the most recent session.
- Click any session row to load its results into the grid. A history banner replaces the progress bar, showing the session date, sources scanned, and item count.
- **Latest scan** button in the banner jumps back to the most recent session.
- Starting a new scan automatically exits history mode and switches to live SSE results.
- All filters, dispositions, and exports work normally while browsing history — the Role filter and viewer-scope enforcement still apply.
- Viewer tokens work with history mode: `GET /api/db/flagged?ref=N` applies scope filtering the same way as the live endpoint.
#### Delete items
@ -182,6 +209,11 @@ The **⬇ Excel** button exports all current results to a `.xlsx` file (`m365_sc
| OneDrive | Flagged OneDrive files |
| SharePoint | Flagged SharePoint files |
| Teams | Flagged Teams files |
| Gmail | Flagged Gmail messages |
| Google Drive | Flagged Google Drive files |
| Local | Flagged local-folder files |
| Network | Flagged SMB/NAS files |
| SFTP | Flagged SFTP server files |
In macOS app builds, the export opens a native Save dialog instead of a browser download.
@ -196,7 +228,7 @@ Configure email delivery in **Settings → Email report**. Click **Save** to sto
| SMTP host | e.g. `smtp.office365.com`, `smtp.gmail.com` |
| Port | `587` for STARTTLS (default), `465` for SMTPS/SSL |
| Username | SMTP login — usually your sender email address |
| Password | Saved to `~/.gdpr_scanner_smtp.json` (permissions 600). Encrypted at rest using Fernet — key in `~/.gdpr_scanner_machine_id` (chmod 0o600, never share) |
| Password | Saved to `~/.gdprscanner/smtp.json` (permissions 600). Encrypted at rest using Fernet — key in `~/.gdprscanner/machine_id` (chmod 0o600, never share) |
| Graph API | When connected to M365, email is sent via `/me/sendMail` (delegated) or `/users/{sender}/sendMail` (app mode) — no SMTP password needed. Requires `Mail.Send` Graph permission with admin consent. |
| From address | Sender address (defaults to username if blank) |
| STARTTLS | Enable STARTTLS on port 587 (recommended) |
@ -236,13 +268,13 @@ The checkpoint is keyed by a hash of the scan configuration (sources + users + d
### Delta scan
Delta scan uses the Microsoft Graph `/delta` API to fetch only items that have **changed since the last scan**, dramatically reducing Graph API quota usage and scan time on large tenants.
Delta scan uses the Microsoft Graph `/delta` API (M365) and the Google Drive **Changes API** (Google Workspace) to fetch only items that have **changed since the last scan**, dramatically reducing API quota usage and scan time on large tenants.
#### How it works
1. Run one **full scan** first (Delta checkbox off) — this establishes baseline delta tokens
2. Tick **Δ Delta scan** and run again — only items added, modified, or deleted since the previous scan are fetched and CPR-scanned
3. Delta tokens are saved automatically to `~/.gdpr_scanner_delta.json` after each successful scan
3. Delta tokens are saved automatically to `~/.gdprscanner/delta.json` after each successful scan
4. To force a full rescan, click **Clear tokens** under the checkbox (or delete the file)
Delta tokens are stored **per-source**:
@ -253,9 +285,12 @@ Delta tokens are stored **per-source**:
| `sharepoint:{drive_id}` | One SharePoint document library |
| `teams:{drive_id}` | One Teams channel file store |
| `email:{user_id}:{folder_id}` | One mail folder for one user |
| `gdrive:{email}` | One Google Workspace user's Google Drive |
If a token expires (Graph returns HTTP 410 Gone), that source falls back to a full collection automatically and a fresh token is saved. Other sources are unaffected.
If a user's OneDrive returns HTTP 404 during a delta scan (no licence assigned, service plan disabled, or drive never provisioned because the account has never signed in), the user is silently skipped with a grey log entry — no red error card is shown. Full scans already skipped these users silently; delta scans now behave the same way.
Deleted items returned by delta (items with a `deleted` or `@removed` marker) are skipped during CPR scanning.
After each delta scan, the log panel shows:
@ -305,7 +340,7 @@ Scan results are persisted to `~/.gdprscanner/scanner.db` (SQLite) automatically
| `dispositions` | Compliance officer decisions per item |
| `scan_history` | Aggregated stats per scan for trend tracking |
**API endpoints:** `GET /api/db/stats`, `GET /api/db/trend`, `GET /api/db/scans`, `POST /api/db/subject`, `GET /api/db/overdue`, `POST /api/db/disposition`, `GET /api/db/disposition/<id>`
**API endpoints:** `GET /api/db/stats`, `GET /api/db/trend`, `GET /api/db/scans`, `POST /api/db/subject`, `GET /api/db/overdue`, `POST /api/db/disposition`, `GET /api/db/disposition/<id>`, `GET /api/db/sessions`, `GET /api/db/flagged`
If `gdpr_db.py` is not present, the scanner falls back to JSON-only mode silently.
@ -339,6 +374,12 @@ Every flagged item can be tagged with a compliance decision from the preview pan
Dispositions are saved to the `dispositions` table in the SQLite database and included in the Article 30 report.
#### Bulk disposition tagging
Click **Vælg** in the filter bar to enter select mode. Per-card checkboxes appear. Select individual cards or use **Select all visible** to select every card matching the current filters. Choose a disposition from the bulk tag bar at the bottom of the grid and click **Apply** — the selected items are updated in a single request to `POST /api/db/disposition/bulk`. Click **Done** to exit select mode.
A **disposition stats bar** above the results grid shows totals at a glance and updates after every save.
---
### Retention policy enforcement
@ -458,6 +499,49 @@ python gdpr_scanner.py --import-db ~/compliance/gdpr_export_2026.zip --import-mo
---
### Software updates
When the app runs from a git checkout (the normal server install), it can update itself. The **Settings → General → Software update** group offers:
- **Check for updates** — fetches the upstream repository and shows either "You are running the latest version" or the list of pending commits
- **Install update** — fast-forwards the checkout, reinstalls dependencies if `requirements.txt` changed, and restarts the app in place; the browser waits for the server to come back and reloads automatically
- **Install updates automatically** — optional toggle; a background thread checks once a day and installs unattended
Safety guarantees:
- Updating is **refused while any scan is running** — manual attempts get a clear message, and the auto-updater simply retries on its next hourly tick, so a scheduled scan is never killed mid-run
- Local edits on the server are **auto-stashed** (kept, never discarded) before the merge; the merge is fast-forward-only, so a diverged checkout stops the update instead of creating a merge mess
- Every applied update is recorded in the **compliance audit log** (`app_update`, old → new commit)
- The restart re-execs the process with the same PID, so it works identically under systemd and when launched via `start_gdpr.sh`
The Settings group is hidden in the packaged desktop app (no git checkout to update) — desktop users update by installing a new build.
**CLI / cron equivalent** — `update_gdpr.sh` performs the same update from a shell:
```bash
./update_gdpr.sh # update if upstream has new commits, restart service
./update_gdpr.sh --check # report pending commits, change nothing
```
It restarts a `gdprscanner.service` systemd unit if one exists (override the name with `GDPR_SERVICE=…`) and is quiet when already up to date, so it is safe to run from cron:
```bash
# /etc/cron.d/gdprscanner-update — nightly at 04:00
0 4 * * * root /opt/gdprscanner/update_gdpr.sh >> /var/log/gdpr_update.log 2>&1
```
API endpoints: `GET /api/update/check`, `POST /api/update/apply`, `GET/POST /api/update/settings`.
---
### HTTPS / reverse proxy
The scanner itself serves plain HTTP. For encrypted transport on a LAN — recommended, since scan results contain CPR numbers — put it behind a TLS-terminating reverse proxy and bind the app to loopback (`--host 127.0.0.1`) so the proxy is the only way in. Share links automatically follow the HTTPS hostname, and the browser Clipboard API (Copy buttons) works natively in a secure context.
See [ZORAXY_SETUP.md](docs/setup/ZORAXY_SETUP.md) for a complete walkthrough: Zoraxy, Let's Encrypt via DNS-01 challenge (required when the hostname resolves to a private IP), proxy rule, and the scanner-specific verification steps.
---
### Article 30 report
The **Art.30** button in the filter bar generates a GDPR **Article 30 Register of Processing Activities** as a Word document (`.docx`).
@ -482,16 +566,32 @@ The document is dated and can be stored as evidence of ongoing compliance activi
---
### Building the M365 app
### Building the desktop app
`build_gdpr.py` packages `gdpr_scanner.py` + `m365_connector.py` + `lang/` into a standalone native app — same PyInstaller / pywebview approach as `build.py`.
`build_gdpr.py` packages `gdpr_scanner.py` + `m365_connector.py` + `lang/` into a standalone native app using PyInstaller + pywebview.
```bash
python build_gdpr.py # build for the current platform
python build_gdpr.py --icons-only # regenerate icon_m365.icns / icon_m365.ico
python build_gdpr.py --icons-only # regenerate icon_gdpr.icns / icon_gdpr.ico
```
> **Note:** Same cross-compilation restriction applies — must build on the target platform.
| Platform | Output | Native window |
|---|---|---|
| macOS | `dist/GDPRScanner.app` | WKWebView |
| Windows | `dist/GDPRScanner/GDPRScanner.exe` | WebView2 (Edge) |
| Linux | `dist/GDPRScanner/GDPRScanner` | GTK WebKit |
> **Cross-compilation is not supported** — build on the target platform, or use the pre-built binaries from the [GitHub Releases](../../releases) page.
**GitHub Actions** builds all three platforms automatically on every push to `main` and on `v*` tags. Pre-built zips are attached to each release:
| File | Platform |
|---|---|
| `GDPRScanner_windows_x64.zip` | Windows 10/11 x64 |
| `GDPRScanner_linux_x86_64.zip` | Ubuntu 22.04+ / Debian |
| `GDPRScanner_macos_x86_64.zip` | macOS 12+ Intel / Apple Silicon (Rosetta) |
> **macOS Gatekeeper:** the app is unsigned. On first launch right-click → **Open** to bypass the security warning.
---
@ -544,26 +644,58 @@ python gdpr_scanner.py # GDPRScanner on port 5100 (auto-increments if in use)
### Test suite
GDPRScanner ships with a `pytest` test suite covering the CPR detection engine, configuration layer, checkpoint persistence, and the SQLite database.
GDPRScanner ships with a `pytest` test suite covering the CPR detection engine, configuration layer, checkpoint persistence, the SQLite database, and security-sensitive Flask routes.
```bash
pip install pytest
pytest tests/
```
**112 tests across 4 modules — all expected to pass.**
**212 tests across 8 modules — all expected to pass.**
| Module | Tests | Covers |
|---|---|---|
| `tests/test_document_scanner.py` | 36 | `is_valid_cpr`, `extract_matches`, `scan_docx`, `scan_xlsx`, `_scan_bytes` — CPR detection, false-positive suppression, binary crash safety |
| `tests/test_document_scanner.py` | 37 | `is_valid_cpr`, `extract_matches`, `scan_docx`, `scan_xlsx`, `_scan_bytes` — CPR detection, false-positive suppression, binary crash safety |
| `tests/test_app_config.py` | 34 | i18n loading, Article 9 keyword detection, config round-trip, admin PIN, profiles CRUD, Fernet encryption |
| `tests/test_checkpoint.py` | 18 | Checkpoint key stability, save/load/clear, wrong-key isolation, delta token round-trip |
| `tests/test_db.py` | 24 | Scan lifecycle, CPR hash-only storage, data subject lookup, dispositions, export/import cycle |
| `tests/test_db.py` | 23 | Scan lifecycle, CPR hash-only storage, data subject lookup, dispositions, export/import cycle |
| `tests/test_routes.py` | 16 | Core route behaviour — scan status/start/stop, DB stats, dispositions, Excel and Article 30 export |
| `tests/test_route_integration.py` | 54 | Viewer token CRUD, role/user scope enforcement, bulk disposition isolation, viewer PIN, interface PIN gate, scan lock release on failure, session history ordering, profile routes CRUD and rename |
| `tests/test_google_scan.py` | 19 | Google scan routes (users/start/cancel) and `_run_google_scan` engine with mocked connector, checkpoints, and DB |
| `tests/test_updates.py` | 11 | Software-update routes — check/apply with mocked git, scan-running refusal, dirty-tree auto-stash, requirements reinstall, settings round-trip |
Each new module (`cpr_detector.py`, `app_config.py`, `checkpoint.py`, `gdpr_db.py`) is importable in isolation without Flask or MSAL — tests run without any cloud credentials or a running server.
Each unit-test module (`cpr_detector.py`, `app_config.py`, `checkpoint.py`, `gdpr_db.py`) is importable in isolation without Flask or MSAL — tests run without any cloud credentials or a running server.
The test suite should be run before every release and after any change to `document_scanner.py`, `cpr_detector.py`, or `gdpr_db.py`. CPR detection is the legal core of the tool — a false negative means a real GDPR violation goes undetected.
#### Local-file scan fixtures
`tests/fixtures/local_files/` provides 19 files for end-to-end testing of the file scanner via the UI or `file_scanner.py`. Drop the folder as a local source and run a scan — all 14 PII-bearing files should be flagged and all 5 negative-case files should produce zero hits.
| File | Format | Expected | Scenario |
|---|---|---|---|
| `01_cpr_with_context_label.txt` | TXT | Flag | CPR with explicit `CPR-nummer:` label |
| `02_cpr_mod11_valid_bare.txt` | TXT | Flag | mod-11valid CPR without any context keyword |
| `03_cpr_post2007_with_context.txt` | TXT | Flag | Post-2007 birth (fails mod-11), detected via `Personnummer:` keyword |
| `04_multiple_cprs.txt` | TXT | Flag | 3 distinct CPR numbers in one staff-records file |
| `05_student_register.csv` | CSV | Flag | 8 students incl. one protected-address (day+40) CPR |
| `06_employee_list.csv` | CSV | Flag | 5 employees with CPRs |
| `07_protected_number.txt` | TXT | Flag | Protected CPR (`410172-1200`, day+40 encoding) |
| `08_mixed_pii.txt` | TXT | Flag | CPR + email + phone + GDPR Art. 9 health category |
| `09_cpr_in_docx.docx` | DOCX | Flag | 2 CPRs in a Word document (paragraph format) |
| `10_clean_no_pii.txt` | TXT | **No flag** | Meeting minutes — no personal data |
| `11_false_positive_invoice.txt` | TXT | **No flag** | Invoice: CPR-shaped numbers suppressed by `faktura`/`varenr` context |
| `12_post2007_no_context.txt` | TXT | **No flag** | Equipment serial that looks like a post-2007 CPR but has no context keyword |
| `13_cpr_in_xlsx.xlsx` | XLSX | Flag | Excel workbook with two sheets: students + employees |
| `14_audio_artist_pii.mp3` | MP3 | Flag | ID3 artist/title tags with a personal name → `exif_pii` |
| `15_audio_artist_pii.flac` | FLAC | Flag | Vorbis comment artist/title tags with a personal name → `exif_pii` |
| `16_audio_no_pii.mp3` | MP3 | **No flag** | Empty ID3 header — no metadata tags |
| `17_audio_no_pii.flac` | FLAC | **No flag** | FLAC with no Vorbis comment block |
| `18_video_gps.mp4` | MP4 | Flag | QuickTime GPS coordinates (Copenhagen) + artist tag → `gps_location` + `exif_pii` |
| `19_video_no_pii.mp4` | MP4 | **No flag** | Minimal MP4 container with no metadata |
All CPR numbers are mathematically valid (verified against `is_valid_cpr`). Run `generate_fixtures.py` inside the venv to regenerate all binary files after any changes. Requires `python-docx`, `openpyxl`, and `mutagen` (all included in `requirements.txt`).
### Roadmap
See [SUGGESTIONS.md](SUGGESTIONS.md) for the full feature roadmap with implementation status.
@ -575,21 +707,22 @@ See [SUGGESTIONS.md](SUGGESTIONS.md) for the full feature roadmap with implement
| File | Description |
|---|---|
| `gdpr_scanner.py` | Flask entry point — scan orchestration, SSE route (`/api/scan/stream`), root route |
| `scan_engine.py` | M365 and local/SMB scan logic — `run_scan()`, `run_file_scan()` |
| `scan_engine.py` | M365 and local/SMB/SFTP scan logic — `run_scan()`, `run_file_scan()` |
| `app_config.py` | All persistence — profiles, settings, SMTP config, lang loading, Fernet encryption |
| `sse.py` | SSE broadcast queue and `_current_scan_id` |
| `checkpoint.py` | Mid-scan checkpoint save/load, `_checkpoint_key()` |
| `cpr_detector.py` | CPR pattern matching and validation |
| `cpr_detector.py` | CPR pattern matching and validation. Defines `SUPPORTED_EXTS` — the single source of truth for which file extensions are scanned across all sources (M365, Google Drive, local/SMB). Also contains `VIDEO_EXTS` and `AUDIO_EXTS` subsets and the metadata extractors `_extract_video_metadata` / `_extract_audio_metadata`. |
| `document_scanner.py` | Core scanning, redaction, OCR, NER, and PII detection engine |
| `gdpr_db.py` | SQLite persistence layer — scan results, CPR index, PII hits, dispositions, scan history |
| `m365_connector.py` | Microsoft Graph API client — auth, token refresh, email/OneDrive/SharePoint/Teams fetchers, delete methods |
| `google_connector.py` | Google Workspace API client — Gmail, Drive, Admin SDK |
| `file_scanner.py` | Unified local + SMB/CIFS file iterator — `FileScanner.iter_files()` yields `(path, bytes, metadata)`. SMB reads use a 1-slot sliding-window `ThreadPoolExecutor` (`PREFETCH_WINDOW=1`) with a 60-second per-file timeout. |
| `file_scanner.py` | Unified local + SMB/CIFS file iterator — `FileScanner.iter_files()` yields `(path, bytes, metadata)`. SMB reads use a 1-slot sliding-window `ThreadPoolExecutor` (`PREFETCH_WINDOW=1`) with a 60-second per-file timeout. `DEFAULT_EXTENSIONS` is imported from `cpr_detector.SUPPORTED_EXTS` (not a local hardcoded set) so the scannable extension list stays in sync automatically. |
| `sftp_connector.py` | SFTP file iterator — `SFTPScanner.iter_files()` yields the same `(path, bytes, metadata)` tuple as `FileScanner`. Uses paramiko (`AutoAddPolicy`); supports password auth and private-key auth (RSA / Ed25519 / ECDSA / DSS). Passwords and key passphrases are stored in the OS keychain; key files live in `~/.gdprscanner/sftp_keys/`. Gracefully degrades when paramiko is not installed (`SFTP_OK` flag). |
| `scan_scheduler.py` | In-process APScheduler wrapper — multi-job scheduled scan engine |
| `templates/index.html` | Single-page HTML shell — Jinja2 template. Two variables: `app_version`, `lang_json`. |
| `static/style.css` | All application CSS — custom properties, layout, components, light/dark themes |
| `static/js/state.js` | Shared mutable state module (`export const S`) — imported by all 11 feature modules |
| `static/js/*.js` | 11 ES modules: `ui`, `log`, `users`, `auth`, `profiles`, `scan`, `results`, `sources`, `scheduler`, `connector`, `viewer` |
| `static/js/state.js` | Shared mutable state module (`export const S`) — imported by all 12 feature modules |
| `static/js/*.js` | 12 ES modules: `ui`, `log`, `users`, `auth`, `profiles`, `scan`, `results`, `sources`, `scheduler`, `connector`, `viewer`, `history` |
| `static/app.js` | Archived JS monolith — no longer loaded |
| `routes/__init__.py` | Blueprint package marker |
| `routes/state.py` | Shared mutable state (`connector`, `flagged_items`, `LANG`, scan locks) — imported by all blueprints |
@ -604,12 +737,15 @@ See [SUGGESTIONS.md](SUGGESTIONS.md) for the full feature roadmap with implement
| `routes/email.py` | `/api/smtp/*` and `/api/send_report` |
| `routes/database.py` | `/api/db/*`, `/api/admin/*`, `/api/preview`, `/api/thumb` |
| `routes/export.py` | `/api/export_excel`, `/api/export_article30`, `/api/delete_bulk` |
| `routes/viewer.py` | `/view`, `/api/viewer/tokens`, `/api/viewer/pin` — read-only viewer mode: token + PIN auth, share-link management |
| `routes/viewer.py` | `/view`, `/api/viewer/tokens`, `/api/viewer/pin` — read-only viewer mode: token + PIN auth, share-link management, role-scoped and user-scoped tokens |
| `routes/app_routes.py` | `/api/about`, `/api/langs`, `/api/lang`, `/manual` |
| `routes/updates.py` | `/api/update/*` — software update check/apply, auto-update background thread |
| `update_gdpr.sh` | CLI/cron self-update script — fetch, fast-forward merge, dependency reinstall, service restart |
| `docs/manuals/MANUAL-EN.md` | End-user manual in English (15 sections) — served at `/manual?lang=en` |
| `docs/manuals/MANUAL-DA.md` | End-user manual in Danish (15 sections) — served at `/manual?lang=da` |
| `docs/setup/M365_SETUP.md` | Step-by-step Microsoft 365 setup guide |
| `docs/setup/GOOGLE_SETUP.md` | Step-by-step Google Workspace setup guide |
| `docs/setup/ZORAXY_SETUP.md` | HTTPS via Zoraxy reverse proxy — LAN-only deployment with Let's Encrypt DNS-01 |
| `build_gdpr.py` | PyInstaller build script — generates `m365_launcher.py`, packages desktop app |
| `lang/en.json` | English translations (source of truth) |
| `lang/da.json` | Danish translations (primary language) |

View File

@ -54,10 +54,10 @@ Out of scope:
## Data Handling Notes for Security Researchers
- CPR numbers are stored in the SQLite database as **SHA-256 hashes only** — never in plaintext
- SMTP passwords are stored in `~/.gdpr_scanner_smtp.json` with chmod 600
- Microsoft OAuth tokens are stored in the MSAL token cache in `~/.gdpr_scanner_config.json`
- Scan results are stored locally in `~/.gdpr_scanner.db` — never transmitted externally
- The web UI binds to `127.0.0.1` by default — it is not designed to be exposed to the internet
- SMTP passwords are stored in `~/.gdprscanner/smtp.json` with chmod 600
- Microsoft OAuth tokens are stored in the MSAL token cache in `~/.gdprscanner/token.json`
- Scan results are stored locally in `~/.gdprscanner/scanner.db` — never transmitted externally
- The web UI binds to `0.0.0.0` by default so reviewers on the LAN can reach it — it is not designed to be exposed to the internet. For encrypted transport, put it behind a TLS-terminating reverse proxy and bind the app to loopback with `--host 127.0.0.1` — see [docs/setup/ZORAXY_SETUP.md](docs/setup/ZORAXY_SETUP.md)
---

File diff suppressed because it is too large Load Diff

161
TODO.md
View File

@ -1,11 +1,35 @@
# TODO — Pending features and sustainability
Quick overview of what's still to be done. Full details in [SUGGESTIONS.md](SUGGESTIONS.md).
Quick overview of what's still to be done.
---
## Recently completed
### Bulk disposition tagging + disposition stats ✅
Select mode (filter bar "Vælg" button) reveals per-card checkboxes. Bulk tag bar appears at bottom of grid when items are selected; a single disposition dropdown + Apply sends `POST /api/db/disposition/bulk`. Stats bar shows total · unreviewed · retain · delete · % reviewed and updates after every save.
---
### Google Drive delta scan ✅
Drive scanning now uses the Google Drive Changes API when `delta` is enabled in scan options. First run records a start page token per user (`gdrive:{email}` in `delta.json`). Subsequent runs fetch only changed/new files. Invalid tokens fall back to a full scan automatically. Token save is load-then-merge to avoid overwriting concurrent M365 delta token writes.
---
### Auto-email after scheduled scan ✅ (already existed)
The scheduler already has an "Email report automatically" checkbox (`auto_email` flag in job config). `_send_email_report()` in `scan_scheduler.py` handles it after each scheduled scan completes — tries Microsoft Graph first, falls back to SMTP. Enable it in the scheduler settings panel.
---
### PDF OCR OOM kills on large documents ✅
`document_scanner` called `convert_from_path()` for the whole PDF before the processing loop, allocating all page images at once. A 50-page A4 at 300 DPI required ~1.3 GB in a single shot — enough to trigger the OS OOM killer.
Fixed in `scan_pdf`, `redact_fitz_pdf`, and `redact_pdf`:
- Replaced bulk pre-render with `convert_from_path(first_page=N, last_page=N)` inside the loop — one page in memory at a time
- Added `_ocr_mem_ok()` guard (checks `psutil.virtual_memory().available >= 500 MB`) before each render; pages that fail the check are skipped and recorded as `"skipped"` in `page_methods` with a printed warning
---
### Memory exhaustion during large M365 scans ✅
Six root causes fixed in `scan_engine.py` and `document_scanner.py`:
- Email body HTML stripped at collection time (`body` key deleted from each message dict before it enters `work_items`; plain text stored as `_precomputed_body` instead)
@ -41,6 +65,141 @@ Full spec in SUGGESTIONS.md §29.
A shareable URL (token-protected) or numeric PIN that gives a DPO, school principal, or compliance coordinator read-only access to the results grid — with disposition tagging but without scan controls, credentials, or delete access. Full spec in SUGGESTIONS.md §33.
**Size:** Medium · **Priority:** Medium
### OneDrive 404 errors — investigate and handle appropriately ✅
404 on `drive/root/delta` during delta scans was being broadcast as a red `scan_error`. Root cause: `_get()` hit `raise_for_status()` for 404s, which fell through to the generic `except Exception` handler in `_scan_user_onedrive`. The full-scan path silently swallowed the same 404 via `except Exception: return` in `_iter_drive_folder_for`.
Fixed by adding `M365DriveNotFound(M365Error)` exception, raising it from `_get()` on 404, and catching it explicitly in `_scan_user_onedrive` with a lower-severity `scan_phase` broadcast ("OneDrive (user): not provisioned — skipped") instead of a red error card.
---
### #34 — User-scoped viewer tokens ✅
Viewer token scope extended to `{"user": ["m365@…", "gws@…"], "display_name": "Alice Smith"}`, filtering `flagged_items` by `account_id IN (list)`. Lets a single employee see only their own flagged files across both M365 and Google Workspace.
**Implemented:**
1. Scope format — `user` is a list of email strings (one per platform); `display_name` stored for UI display. Legacy single-string format coerced to list automatically.
2. Token creation UI — scope-type selector (`All` / `Role` / `User`) reveals either the role select or a searchable name autocomplete. Autocomplete filters `S._allUsers` by display name or email; rows show name + both emails for dual-platform users. Selected user's full name fills the input; both emails stored in the scope.
3. `GET /api/db/flagged` — filters `WHERE account_id IN (scope.user set)`, covering items from both platforms.
4. Viewer header — `#viewerIdentityBadge` shows `scope.display_name` (full name); `#filterRole` hidden.
5. `POST /api/viewer/tokens` — validates all entries in `scope.user` contain `@`; rejects combined `role`+`user` scope.
6. Token list — shows display name badge; falls back to emails joined with `, `.
**Size:** Small · **Priority:** Medium
---
### Scan history browser ✅
Review results from any past scan session without running a new scan.
**Implemented:**
1. `gdpr_db.py``get_sessions(limit=50, window_seconds=300)`: groups `scans` rows into 300 s windows (same logic as `get_session_items`), returns newest-first list with `ref_scan_id` (highest scan_id in group), timestamps, sources set, flagged count, total scanned, and a delta flag.
2. `gdpr_db.py``get_session_items(ref_scan_id=N)`: when `ref_scan_id` given, anchors the 300 s window to that scan's `started_at` instead of the latest scan.
3. `GET /api/db/sessions` (new endpoint in `routes/database.py`) — returns the sessions list; viewer-mode sessions share the same `GET /api/db/flagged?ref=N` endpoint with scope enforcement intact.
4. `static/js/history.js` (new module) — `loadHistorySession(refScanId)`, `openHistoryPicker()`, `closeHistoryPicker()`, `exitHistoryMode()`, `invalidateHistoryCache()` all exposed on `window.*`. Session cache (`_sessions`) invalidated by all `*_done` SSE handlers so the picker stays fresh after a new scan.
5. History banner (`#historyBanner`) — shows session date/time, sources, item count; "Sessions" button opens picker dropdown; "Latest scan" button appears only when not already viewing the latest.
6. Auto-load on page load — `results.js` calls `window.loadHistorySession?.(null)` when the SSE watchdog detects `!status.running`; `null` resolves to the latest completed session.
7. Live→history transition: clicking a session in the picker sets `S._historyRefScanId` and shows the banner. History→live transition: `startScan()` calls `window.exitHistoryMode?.()`.
---
### Gmail SMTP error message when App Password already in use ✅
The `535` auth error from Gmail fires for wrong app password, revoked app password, spaces in the 16-char code, and wrong username — all indistinguishable at the SMTP level. The old message unconditionally told users to "create an App Password", which is unhelpful when they already have one. Both the `smtp_test` and `send_report` error handlers now emit a Gmail-specific message that lists the three common causes and links to the App Password page for regeneration.
---
### Interface PIN ✅
Optional session-level authentication gate for the main scanner interface. Set in **Settings → Security → Interface PIN**. When set, any request to the main UI or API redirects to `/login` until the correct PIN is entered. `/view` and all viewer auth routes are exempt. Salted SHA-256 hash stored in `config.json`. Rate-limited: 5 failures per IP per 5 minutes.
---
### OCR language override ✅
Tesseract language pack(s) used for scanned PDFs and images are now configurable per profile. Option `ocr_lang` (default `dan+eng`). Presets: `dan+eng`, `dan`, `eng`, `dan+eng+deu`, `dan+eng+swe`, `dan+eng+fra`. Threaded through `_scan_bytes`/`_scan_bytes_timeout``document_scanner.scan_pdf`/`scan_image` and the spawned PDF-OCR subprocess. OCR result cache keys include `lang` so per-language results are cached independently. Sidebar select `#optOcrLang`; profile editor `#peOptOcrLang`.
---
### CPR-only mode ✅
New scan option `cpr_only` (default `false`). When enabled, items whose only hits are email addresses, phone numbers, detected faces, or EXIF/GPS metadata are skipped — only items with at least one qualifying CPR number are flagged. Implemented as a compact short-circuit at each engine's flagging gate. Sidebar toggle `#optCprOnly`; profile editor `#peOptCprOnly`.
Also added `min_cpr_count` (default `1`) — minimum number of **distinct** CPR numbers required before a file is flagged. Files with faces or EXIF PII are still flagged regardless of this threshold.
---
### Skip GPS images ✅
Scan option `skip_gps_images` (default `false`). When enabled, images whose only PII is GPS coordinates are not flagged. GPS data is still stored in the card `exif` field if the item is flagged by another signal. Sidebar toggle `#optSkipGps`; profile editor `#peOptSkipGps`.
---
### CPR cross-referencing (related documents) ✅
The preview panel now shows a "Related documents" section listing other items in the same scan session that share ≥1 CPR number. Clicking any related item opens its preview. Implemented as a query-time self-join on the existing `cpr_index` table — no new data collection needed. `GET /api/db/related/<item_id>?ref=N` returns rows ordered by shared CPR count descending.
---
### Email preview on checkpoint resume ✅
A 500-character plain-text body excerpt (`body_excerpt`) is now stored per flagged email at broadcast time and persisted in the DB. When the preview modal opens for an email item, this excerpt is shown immediately without requiring a live Graph/Gmail connection. Enables email preview to work correctly after a server restart and checkpoint resume.
---
### Built-in file redaction ✅
Local files (`.docx`, `.xlsx`, `.csv`, `.txt`) can be redacted in-place: CPR numbers are replaced by `██████-████` / `█` blocks, the card is removed from the grid, and a `"redacted"` disposition is logged. The ✂ button appears on redactable local file cards (hidden in viewer mode and for resolved items). File is written to a temp path in the same directory before `shutil.move` to avoid cross-device rename failures.
---
### Date-range scoping for viewer tokens ✅
Viewer tokens can now carry `valid_from` and/or `valid_to` fields (YYYY-MM-DD). `GET /api/db/flagged` filters out items whose `modified` date falls outside the range. All three scope dimensions (role, user, date-range) are independent and combinable. The share modal exposes `#shareValidFrom` / `#shareValidTo` date inputs. Token list shows a green date-range badge when a range is present.
---
### Re-scan diff ✅
When viewing a history session, items present in the immediately preceding session but absent from the current one are shown below a `.resolved-divider` separator with a green ✓ Resolved badge (opacity dimmed). These resolved items are grid-only — they are not added to `S.flaggedData` and cannot be bulk-selected or exported. The history banner shows a resolved count when applicable.
---
### Tests for Google Workspace scan engine ✅
19 tests added in `tests/test_google_scan.py` covering: `GET /api/google/scan/users`, `POST /api/google/scan/start`, `POST /api/google/scan/cancel`, and `_run_google_scan` engine internals. Uses synchronous invocation with mocked `broadcast`, `_scan_bytes`, `checkpoint.*`, and `gdpr_db.get_db`. The `clean_google_state` autouse fixture releases `_google_scan_lock` and clears `_google_scan_abort` after each test.
---
### Compliance audit log ✅
Every significant admin action is written to an immutable `audit_log` table in the scanner database. Recorded events: profile save/delete, viewer token create/revoke, viewer/interface/admin PIN set/change/clear, file source add/update/delete, scheduler job save/delete, scan start/stop, SMTP config save, single and bulk disposition changes, item delete, and item redact. Each record stores a Unix timestamp, action key, human-readable detail, and client IP. `GET /api/audit_log` returns newest-first (max 1000; filterable by `?action=`). Visible in Settings → **Audit Log** tab; refreshes when the tab is opened. `log_audit_event()` helper in `gdpr_db.py` silently no-ops if the DB is unavailable.
---
### Scheduled report-only email job ✅
Scheduler jobs can now be configured as "report only" (toggle `#schedReportOnly`). The job skips the scan entirely and emails the latest results already in the database. If the in-memory result list is empty (e.g. after a server restart), results are loaded from DB via `get_session_items()`. M365 auth is not required — email is sent Graph-first if authenticated, SMTP otherwise. Jobs fail with a clear error if no scan results are available. The job list card shows a blue "Report only" badge. Enabling report-only automatically checks "Email report automatically" and dims the Profile field (unused for report-only runs).
---
### SFTP as a 4th file connector ✅
Scan SFTP servers (SSH File Transfer Protocol) alongside local, SMB, and cloud sources. A new `SFTPScanner` class in `sftp_connector.py` implements the same `iter_files()` interface as `FileScanner`, so `run_file_scan()` and everything downstream (SSE, DB, export, scheduling) is unchanged. Auth supports password and SSH private key (+ optional passphrase). Key files stored in `~/.gdprscanner/sftp_keys/`. SFTP sources appear in the file sources panel with a 🔒 icon, are profile-aware, and are included in scheduled scans automatically.
**Files changed:** `sftp_connector.py` (new), `scan_engine.py`, `routes/sources.py`, `app_config.py`, `static/js/sources.js`, `templates/index.html`, `lang/en|da|de.json`, `routes/export.py`, `requirements.txt`
---
### Checkpoint / resume for Google and File scans ✅
Extended the M365 checkpoint/resume mechanism to all three scan engines. Each engine writes its own file (`checkpoint_m365.json`, `checkpoint_google.json`, `checkpoint_file_{source_id}.json`) every 25 items. Previously found cards are re-emitted via SSE on resume so the grid repopulates before new items arrive. The Scan button now checks for a checkpoint before clearing the grid, so the resume banner appears even without a page reload. `POST /api/scan/checkpoint` returns a per-engine breakdown; `POST /api/scan/clear_checkpoint` wipes all `checkpoint_*.json` files. `checkpoint.py` functions gained a `prefix` keyword (default `"m365"`); M365 call sites are unchanged.
---
### Extended document anonymisation (redaction beyond local DOCX/XLSX/CSV/TXT)
Currently the ✂ redact button only works for local files with extensions `.docx`, `.xlsx`, `.csv`, `.txt`. Several valuable cases are not yet covered:
**1. PDF redaction for local files** ✅ — `redact_pdf_secure` (PyMuPDF physical redaction) wired to `_REDACT_EXTS` and the ✂ button. Falls back to reportlab overlay if PyMuPDF is absent.
**2. OneDrive / SharePoint / Teams file redaction** ✅ — `put_drive_item_content()` added to `m365_connector.py`; `redact_item()` in `routes/export.py` extended with a cloud branch: download via Graph, redact to a local temp file, re-upload via PUT. Supports DOCX, XLSX, PDF. ✂ button shown on cloud cards with supported extensions.
**3. Google Drive file redaction** ✅ — `get_drive_file_mime`, `download_drive_file_by_id`, `update_drive_file` added to both `GoogleWorkspaceConnector` and `PersonalGoogleConnector`. `redact_item()` extended with a `gdrive` branch: check MIME type (rejects Google Docs/Sheets), download bytes, redact locally, upload back via `files().update()`. Requires `drive` scope (not `drive.readonly`) on the service-account delegation. ✂ button shown on Drive cards with DOCX/XLSX/PDF extension.
**4. SMB / SFTP file redaction** ✅ — `write_file(remote_path, content)` added to `SFTPScanner`; `write_smb_file(path, content, user, password, domain)` added to `file_scanner.py`. `redact_item()` extended with `sftp` and `smb` branches: download via native protocol, redact locally, write back. Source config matched from `_load_file_sources()`. SFTP requires the item to still be in `state.flagged_items` (in-session only). ✂ button shown on SMB/SFTP cards with DOCX/XLSX/CSV/TXT/PDF extension.
**5. Email body redaction (Exchange / Gmail)** — overwrite the message body via Graph `PATCH /messages/{id}` or Gmail API. High effort and high risk: HTML formatting must be preserved, inline images handled, and a mistake permanently corrupts the email. **Recommendation: skip** — deleting the email is a safer and simpler GDPR response for emails containing CPR numbers.
**Priority order:** PDF (1) first since it reuses existing code. Cloud files (24) on demand.
**Size:** Small (PDF) · Medium (cloud/SMB/SFTP) · **Priority:** Medium
---
### #32 — Windowed mode for Profiles, Sources, and Settings ✗ Won't do
The workflow is sequential (configure → scan → review), not parallel — there is no realistic scenario where a modal and the results grid need to be open simultaneously. The Sources panel is already visible in the sidebar. Option A (the least-work path) still loads the full 3800-line JS stack twice. Closed.

View File

@ -1 +1 @@
1.6.14
1.7.9

View File

@ -276,6 +276,44 @@ def _admin_pin_is_set() -> bool:
return bool(_get_admin_pin_hash())
# ── Interface PIN ─────────────────────────────────────────────────────────────
# Salted SHA-256, stored in config.json under "interface_pin".
# When set, the main web interface requires PIN authentication before the
# index page or any /api/* route is accessible (viewer routes are exempt).
_INTERFACE_PIN_KEY = "interface_pin"
def get_interface_pin_hash() -> "dict | None":
"""Return the stored interface PIN hash dict, or None if not set."""
return _load_config().get(_INTERFACE_PIN_KEY)
def set_interface_pin(pin: str) -> None:
import secrets as _sec
if not pin:
raise ValueError("PIN must not be empty")
salt = _sec.token_hex(16)
h = _hashlib.sha256((salt + pin).encode()).hexdigest()
cfg = _load_config()
cfg[_INTERFACE_PIN_KEY] = {"hash": h, "salt": salt}
_save_config(cfg)
def verify_interface_pin(pin: str) -> bool:
"""Return True if *pin* matches the stored hash."""
meta = get_interface_pin_hash()
if not meta:
return False
return _hashlib.sha256((meta["salt"] + pin).encode()).hexdigest() == meta["hash"]
def clear_interface_pin() -> None:
cfg = _load_config()
cfg.pop(_INTERFACE_PIN_KEY, None)
_save_config(cfg)
def _load_config() -> dict:
if _CONFIG_FILE.exists():
try:
@ -291,6 +329,43 @@ def _save_config(cfg: dict):
pass
# ── Claude NER config ─────────────────────────────────────────────────────────
def get_claude_config() -> dict:
cfg = _load_config()
return {
"enabled": bool(cfg.get("claude_ner", False)),
"api_key_set": bool(cfg.get("claude_api_key", "")),
}
def save_claude_config(enabled: bool, api_key: "str | None" = None) -> None:
cfg = _load_config()
cfg["claude_ner"] = bool(enabled)
if api_key is not None:
# Encrypt at rest with the machine-keyed Fernet (same as the SMTP
# password). Falls back to plaintext only if cryptography is missing.
cfg["claude_api_key"] = _encrypt_password(api_key) if api_key else ""
_save_config(cfg)
def get_claude_api_key() -> str:
"""Return the decrypted Claude API key (handles legacy plaintext)."""
return _decrypt_password(_load_config().get("claude_api_key", ""))
# ── Software update config ────────────────────────────────────────────────────
def get_update_config() -> dict:
return {"auto_update": bool(_load_config().get("auto_update", False))}
def save_update_config(auto_update: bool) -> None:
cfg = _load_config()
cfg["auto_update"] = bool(auto_update)
_save_config(cfg)
# ── Profile storage (15a) ─────────────────────────────────────────────────────
_SETTINGS_PATH = _DATA_DIR / "settings.json"
_SRC_TOGGLES_PATH = _DATA_DIR / "src_toggles.json"
@ -506,6 +581,8 @@ def _save_role_overrides(overrides: dict) -> None:
# ── File source settings (#8) ─────────────────────────────────────────────────
_FILE_SOURCES_PATH = _DATA_DIR / "file_sources.json"
_SFTP_KEYS_DIR = _DATA_DIR / "sftp_keys"
_SFTP_KEYS_DIR.mkdir(exist_ok=True)
def _load_file_sources() -> list:
@ -530,6 +607,32 @@ def _save_file_sources(sources: list) -> None:
except Exception as e:
logger.error("[file_sources] write failed: %s", e)
def _resolve_sftp_credentials(source: dict) -> dict:
"""Return a copy of source with password/passphrase resolved from keychain.
Callers (run_file_scan, upload_key endpoint) should use this rather than
reading keychain credentials themselves, so the lookup logic stays in one place.
"""
try:
from sftp_connector import get_sftp_password
except ImportError:
return source
resolved = dict(source)
keychain_key = source.get("keychain_key") or None
host = source.get("sftp_host", "")
user = source.get("sftp_user", "")
if not resolved.get("sftp_password"):
resolved["sftp_password"] = get_sftp_password(host, user, keychain_key)
if not resolved.get("sftp_passphrase"):
# Passphrase stored under a distinct account name
passphrase_key = (keychain_key + ":passphrase") if keychain_key else None
resolved["sftp_passphrase"] = get_sftp_password(host, user, passphrase_key)
return resolved
# ── Viewer tokens ────────────────────────────────────────────────────────────
# Read-only viewer tokens allow sharing scan results with a DPO or compliance
# officer without exposing scan controls or credentials. Each token is a
@ -558,12 +661,14 @@ def _save_viewer_tokens(tokens: list) -> None:
logger.error("[viewer_tokens] write failed: %s", e)
def create_viewer_token(label: str = "", expires_days: int | None = None) -> dict:
def create_viewer_token(label: str = "", expires_days: int | None = None, scope: dict | None = None) -> dict:
"""Generate a new viewer token, persist it, and return the token dict.
Args:
label: Human-readable description (e.g. "DPO review April 2026").
expires_days: Days until expiry. None = no expiry.
scope: Optional access scope, e.g. {"role": "student"} or {"role": "staff"}.
Empty dict / None means unrestricted.
"""
import secrets as _secrets
token = _secrets.token_hex(32) # 64-char URL-safe hex string
@ -571,6 +676,7 @@ def create_viewer_token(label: str = "", expires_days: int | None = None) -> dic
entry: dict = {
"token": token,
"label": label or "",
"scope": scope or {},
"created_at": now,
"expires_at": now + expires_days * 86400 if expires_days else None,
"last_used_at": None,
@ -707,7 +813,7 @@ def clear_viewer_pin() -> None:
# ── SMTP password encryption ─────────────────────────────────────────────────
# The SMTP password is encrypted at rest using Fernet symmetric encryption.
# The encryption key is derived from a stable machine-specific UUID stored in
# ~/.gdpr_scanner_machine_id. This key is only usable on the same machine —
# ~/.gdprscanner/machine_id. This key is only usable on the same machine —
# the encrypted password cannot be decrypted if the config file is copied to
# another host.
@ -772,6 +878,13 @@ def _load_smtp_config() -> dict:
cfg = json.loads(_SMTP_CONFIG_PATH.read_text(encoding="utf-8"))
if cfg.get("password"):
cfg["password"] = _decrypt_password(cfg["password"])
# Normalise legacy key names written by an older settings-tab UI
# (`user`/`starttls`) to the canonical keys every reader expects
# (`username`/`use_tls`), so configs saved before the fix still work.
if "username" not in cfg and "user" in cfg:
cfg["username"] = cfg["user"]
if "use_tls" not in cfg and "starttls" in cfg:
cfg["use_tls"] = cfg["starttls"]
return cfg
except Exception:
pass

View File

@ -15,7 +15,9 @@ logger = logging.getLogger(__name__)
_DATA_DIR = Path.home() / ".gdprscanner"
_DATA_DIR.mkdir(exist_ok=True)
_CHECKPOINT_PATH = _DATA_DIR / "checkpoint.json"
def _cp_path(prefix: str) -> Path:
return _DATA_DIR / f"checkpoint_{prefix}.json"
def _checkpoint_key(options: dict) -> str:
"""Stable hash of the scan options — used to detect when a checkpoint
@ -27,7 +29,7 @@ def _checkpoint_key(options: dict) -> str:
}, sort_keys=True)
return hashlib.sha256(sig.encode()).hexdigest()[:16]
def _save_checkpoint(key: str, scanned_ids: set, flagged: list, meta: dict) -> None:
def _save_checkpoint(key: str, scanned_ids: set, flagged: list, meta: dict, *, prefix: str = "m365") -> None:
"""Write checkpoint to disk. Called periodically during scanning."""
try:
payload = {
@ -36,28 +38,31 @@ def _save_checkpoint(key: str, scanned_ids: set, flagged: list, meta: dict) -> N
"flagged": flagged,
"meta": {k: v for k, v in meta.items() if k != "options"},
}
tmp = _CHECKPOINT_PATH.with_suffix(".tmp")
path = _cp_path(prefix)
tmp = path.with_suffix(".tmp")
tmp.write_text(json.dumps(payload, ensure_ascii=False, default=str), encoding="utf-8")
tmp.replace(_CHECKPOINT_PATH)
tmp.replace(path)
except Exception as e:
logger.error("[checkpoint] save failed: %s", e)
def _load_checkpoint(key: str) -> dict | None:
def _load_checkpoint(key: str, *, prefix: str = "m365") -> dict | None:
"""Load checkpoint if it matches the current scan key. Returns None on mismatch or error."""
try:
if not _CHECKPOINT_PATH.exists():
path = _cp_path(prefix)
if not path.exists():
return None
payload = json.loads(_CHECKPOINT_PATH.read_text(encoding="utf-8"))
payload = json.loads(path.read_text(encoding="utf-8"))
if payload.get("key") != key:
return None
return payload
except Exception:
return None
def _clear_checkpoint() -> None:
def _clear_checkpoint(*, prefix: str = "m365") -> None:
try:
if _CHECKPOINT_PATH.exists():
_CHECKPOINT_PATH.unlink()
path = _cp_path(prefix)
if path.exists():
path.unlink()
except Exception:
pass

View File

@ -5,12 +5,14 @@ Provides:
_scan_bytes(content, filename) dispatch to correct scanner by file type
_scan_text_direct(text) scan a plain text string
_extract_exif(content, filename) extract PII-bearing EXIF tags from images
_extract_video_metadata(content, fn) extract PII-bearing metadata from video files
_extract_audio_metadata(content, fn) extract PII-bearing tags from audio files
_detect_photo_faces(content, fn) count faces in an image (OpenCV)
_get_pii_counts(text) NER-based PII type counts
_make_thumb(content, filename) JPEG thumbnail as base64 string
_placeholder_svg(ext, name) SVG file-type icon
Globals SCANNER_OK, PIL_OK, PHOTO_EXTS, SUPPORTED_EXTS, ds, PILImage, LANG,
Globals SCANNER_OK, PIL_OK, PHOTO_EXTS, VIDEO_EXTS, AUDIO_EXTS, SUPPORTED_EXTS, ds, PILImage, LANG,
and _check_special_category are injected at startup by gdpr_scanner.py via
`from cpr_detector import *` AFTER those names are defined. This keeps the
module cleanly importable in isolation for unit tests (#26) while preserving
@ -20,6 +22,7 @@ from __future__ import annotations
import base64
import hashlib
import io
import re
import tempfile
import threading
from pathlib import Path
@ -47,11 +50,17 @@ except ImportError:
PILImage = None # type: ignore[assignment]
PIL_OK = False
VIDEO_EXTS = {
".mp4", ".mov", ".m4v", ".avi", ".mkv", ".wmv", ".flv", ".webm",
}
AUDIO_EXTS = {
".mp3", ".flac", ".ogg", ".m4a", ".aac", ".wma", ".wav", ".opus", ".aiff", ".aif",
}
SUPPORTED_EXTS = {
".pdf", ".docx", ".doc", ".xlsx", ".xlsm", ".csv",
".txt", ".eml", ".msg",
".jpg", ".jpeg", ".png", ".bmp", ".tiff", ".tif", ".webp",
}
} | VIDEO_EXTS | AUDIO_EXTS
PHOTO_EXTS = {
".jpg", ".jpeg", ".png", ".bmp", ".tiff", ".tif", ".webp", ".heic", ".heif",
}
@ -190,49 +199,226 @@ def _extract_exif(content: bytes, filename: str) -> dict:
return result
def _extract_video_metadata(content: bytes, filename: str) -> dict:
"""Extract PII-bearing metadata from a video file.
"""Detect faces in an image file using OpenCV Haar cascades.
Returns the same structure as _extract_exif so callers can treat both
identically:
gps {lat, lon, lat_ref, lon_ref, maps_url} or None
pii_fields {label: value} for title/artist/comment/description
author str or None
datetime str or None
device str or None
has_pii bool
Returns the number of faces detected, or 0 if cv2 is unavailable,
the file is not a supported image format, or decoding fails.
Face detection is intentionally strict (minNeighbors=8, min_size=80px) to
reduce false positives on background textures, labels, and artwork.
Haar cascades are tuned for compliance flagging, not exhaustive detection. (#9)
MP4/MOV/M4V: reads QuickTime/MPEG-4 tags via mutagen (no system deps).
GPS is extracted from the ©xyz QuickTime atom (ISO 6709 string written by
iPhones and Android devices: "+55.6763+012.5681+005.000/").
AVI: parses the RIFF INFO list chunk without any external library.
All other extensions: returns empty result immediately.
"""
if not SCANNER_OK:
return 0
try:
cv2_mod = getattr(ds, "_get_cv2", None)
if cv2_mod is None:
return 0
cv2, np = ds._get_cv2()
if cv2 is None or np is None:
return 0
except Exception:
return 0
result: dict = {"gps": None, "pii_fields": {}, "author": None,
"datetime": None, "device": None, "has_pii": False}
ext = Path(filename).suffix.lower()
try:
# Decode image bytes → cv2 BGR array
arr = np.frombuffer(content, dtype=np.uint8)
img = cv2.imdecode(arr, cv2.IMREAD_COLOR)
if img is None:
# imdecode failed (e.g. HEIC without codec) — try PIL fallback
if PIL_OK:
try:
from PIL import Image as _PILImg
import io as _io
pil_img = _PILImg.open(_io.BytesIO(content)).convert("RGB")
pil_arr = np.array(pil_img)
img = cv2.cvtColor(pil_arr, cv2.COLOR_RGB2BGR)
except Exception:
return 0
else:
return 0
if ext in {".mp4", ".mov", ".m4v"}:
_extract_mp4_tags(content, result)
elif ext == ".avi":
_extract_avi_info(content, result)
faces = ds.detect_faces_cv2(img, min_size=80, neighbors=8)
return len(faces)
return result
def _extract_mp4_tags(content: bytes, result: dict) -> None:
"""Populate result dict from MPEG-4/QuickTime container tags via mutagen."""
try:
import mutagen.mp4
tags = mutagen.mp4.MP4(io.BytesIO(content)).tags
if not tags:
return
# Text fields that may contain personal data
_tag_label = {
"©nam": "Title",
"©cmt": "Comment",
"©des": "Description",
"desc": "Description",
"©lyr": "Lyrics",
}
for tag, label in _tag_label.items():
val = tags.get(tag)
if val:
text = str(val[0]).strip() if isinstance(val, list) else str(val).strip()
if len(text) >= _EXIF_PII_MIN_LEN:
result["pii_fields"][label] = text
result["has_pii"] = True
# Author — prefer ©ART (artist), fall back to album artist
for tag in ("©ART", "aART"):
val = tags.get(tag)
if val:
author = str(val[0]).strip() if isinstance(val, list) else str(val).strip()
if len(author) >= _EXIF_PII_MIN_LEN:
result["author"] = author
result["pii_fields"]["Artist"] = author
result["has_pii"] = True
break
# Recording date
val = tags.get("©day")
if val:
result["datetime"] = str(val[0]).strip() if isinstance(val, list) else str(val).strip()
# Device (QuickTime-specific tags written by iPhones)
make = tags.get("©mak")
model = tags.get("©mod")
if make or model:
result["device"] = " ".join(
str(v[0] if isinstance(v, list) else v).strip()
for v in (make, model) if v
)
# GPS — QuickTime ©xyz atom: "+55.6763+012.5681+005.000/" (ISO 6709)
import re as _re
for gps_tag in ("©xyz", "com.apple.quicktime.location.ISO6709"):
val = tags.get(gps_tag)
if val:
gps_str = str(val[0] if isinstance(val, list) else val).strip()
m = _re.match(r'([+-]\d+\.?\d*)([+-]\d+\.?\d*)', gps_str)
if m:
lat = round(float(m.group(1)), 7)
lon = round(float(m.group(2)), 7)
result["gps"] = {
"lat": lat,
"lon": lon,
"lat_ref": "N" if lat >= 0 else "S",
"lon_ref": "E" if lon >= 0 else "W",
"maps_url": f"https://www.google.com/maps?q={lat},{lon}",
}
result["has_pii"] = True
break
except Exception:
return 0
pass
def _extract_avi_info(content: bytes, result: dict) -> None:
"""Populate result dict from RIFF INFO list chunk in an AVI file."""
try:
import struct
if len(content) < 12 or content[:4] != b"RIFF":
return
# Walk top-level RIFF chunks looking for the INFO LIST
i = 12
while i + 8 <= len(content):
chunk_id = content[i:i+4]
chunk_size = struct.unpack_from("<I", content, i + 4)[0]
if chunk_id == b"LIST" and content[i+8:i+12] == b"INFO":
_parse_riff_info(content, i + 12, i + 8 + chunk_size, result)
break
i += 8 + chunk_size + (chunk_size & 1) # RIFF chunks are word-aligned
except Exception:
pass
def _parse_riff_info(content: bytes, start: int, end: int, result: dict) -> None:
import struct
_info_labels = {
b"INAM": "Title",
b"IART": "Artist",
b"ICMT": "Comment",
b"ISBJ": "Subject",
b"ICRD": "Date",
}
i = start
while i + 8 <= end and i + 8 <= len(content):
sub_id = content[i:i+4]
sub_size = struct.unpack_from("<I", content, i + 4)[0]
label = _info_labels.get(sub_id)
if label:
raw = content[i+8 : i+8+sub_size]
val = raw.decode("utf-8", errors="replace").strip("\x00 ")
if val and len(val) >= _EXIF_PII_MIN_LEN:
result["pii_fields"][label] = val
result["has_pii"] = True
if label == "Artist" and not result["author"]:
result["author"] = val
if label == "Date" and not result["datetime"]:
result["datetime"] = val
i += 8 + sub_size + (sub_size & 1)
def _extract_audio_metadata(content: bytes, filename: str) -> dict:
"""Extract PII-bearing tags from an audio file.
Returns the same structure as _extract_exif / _extract_video_metadata.
No GPS extraction GPS is not embedded in audio containers in practice.
Uses mutagen.File(easy=True) which normalises tags to lowercase keys for
MP3 (ID3), M4A/AAC (MPEG-4), FLAC, OGG Vorbis, and AIFF. WMA/ASF tags
use mixed-case keys (e.g. "Title", "Author") these are lowercased during
normalisation so the same extraction logic covers all formats.
"""
result: dict = {"gps": None, "pii_fields": {}, "author": None,
"datetime": None, "device": None, "has_pii": False}
try:
import mutagen
f = mutagen.File(fileobj=io.BytesIO(content), filename=filename, easy=True)
if not f or not f.tags:
return result
# Normalise all tags to {lowercase_key: str_value} regardless of format
def _strval(v):
return str(v[0] if isinstance(v, list) and v else v).strip()
tags: dict[str, str] = {
k.lower(): _strval(v) for k, v in f.tags.items()
}
# Fields that may contain personal names or descriptions
_pii_keys = {
"title": "Title",
"artist": "Artist",
"albumartist": "Album Artist",
"composer": "Composer",
"lyricist": "Lyricist",
"conductor": "Conductor",
"author": "Author",
"copyright": "Copyright",
"comment": "Comment",
"description": "Description",
# WMA/ASF mixed-case keys survive as lowercase after normalisation
"wm/albumartist": "Album Artist",
"wm/composer": "Composer",
"wm/conductor": "Conductor",
"wm/lyrics": "Lyrics",
}
seen: set[str] = set() # avoid duplicate label entries
for key, label in _pii_keys.items():
val = tags.get(key, "")
if val and len(val) >= _EXIF_PII_MIN_LEN and label not in seen:
result["pii_fields"][label] = val
result["has_pii"] = True
seen.add(label)
# Author — most specific personal name field wins
for key in ("artist", "author", "albumartist", "wm/albumartist", "composer"):
val = tags.get(key, "")
if val and len(val) >= _EXIF_PII_MIN_LEN:
result["author"] = val
break
# Recording / release date
for key in ("date", "year", "wm/year"):
val = tags.get(key, "")
if val:
result["datetime"] = val
break
except Exception:
pass
return result
def _detect_photo_faces(content: bytes, filename: str) -> int:
"""Detect faces in an image file using OpenCV Haar cascades.
@ -277,67 +463,151 @@ def _detect_photo_faces(content: bytes, filename: str) -> int:
return 0
def _scan_bytes(content: bytes, filename: str, poppler_path=None) -> dict:
"""Scan raw bytes for CPRs. Returns scanner result dict."""
_EMAIL_RE = re.compile(
r'\b[a-zA-Z0-9][a-zA-Z0-9._%+\-]*@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}\b'
)
_PHONE_RE = re.compile(
r'(?:'
r'(?:\+45|0045)[\s\-]?[2-9]\d{3}[\s\-]?\d{4}' # +45/0045 DDDD DDDD
r'|(?:\+45|0045)[\s\-]?[2-9]\d(?:[\s\-]\d{2}){3}' # +45/0045 DD DD DD DD
r'|\b[2-9]\d{7}\b' # 8 consecutive digits
r'|\b[2-9]\d{3}[\s\-]\d{4}\b' # DDDD DDDD
r'|\b[2-9]\d(?:[\s\-]\d{2}){3}\b' # DD DD DD DD
r')'
)
def _extract_text_from_bytes(content: bytes, filename: str) -> str:
"""Extract plain text from file bytes for email/phone pattern matching.
Returns empty string for binary media files (photos, video, audio) and
on any parse error callers must never raise from this function.
"""
ext = Path(filename).suffix.lower()
try:
if ext in {".txt", ".csv", ".eml", ".msg"}:
return content.decode("utf-8", errors="replace")
if ext in {".docx", ".doc"}:
from docx import Document as _Doc
doc = _Doc(io.BytesIO(content))
parts = [p.text for p in doc.paragraphs]
for tbl in doc.tables:
for row in tbl.rows:
for cell in row.cells:
parts.append(cell.text)
return "\n".join(parts)
if ext in {".xlsx", ".xlsm"}:
import openpyxl as _xl
wb = _xl.load_workbook(io.BytesIO(content), read_only=True, data_only=True)
parts = [
str(cell.value)
for ws in wb.worksheets
for row in ws.iter_rows()
for cell in row
if cell.value is not None
]
wb.close()
return " ".join(parts)
if ext == ".pdf":
import pdfplumber as _pp
with _pp.open(io.BytesIO(content)) as pdf:
parts = [p.extract_text() or "" for p in pdf.pages]
return "\n".join(parts)
except Exception:
pass
if ext not in PHOTO_EXTS | VIDEO_EXTS | AUDIO_EXTS:
try:
return content.decode("utf-8", errors="replace")
except Exception:
pass
return ""
def _find_emails_phones(text: str) -> dict:
"""Extract unique email addresses and Danish phone numbers from text.
Returns {"emails": [{"formatted": str}, ...], "phones": [{"formatted": str}, ...]}.
Phones are normalised to digit-only strings (preserving a leading '+').
"""
if not text:
return {"emails": [], "phones": []}
emails = list(dict.fromkeys(m.group(0).lower() for m in _EMAIL_RE.finditer(text)))
phones = list(dict.fromkeys(
('+' + re.sub(r'[\s\-]', '', m.group(0)[1:]) if m.group(0).lstrip().startswith('+')
else re.sub(r'[\s\-]', '', m.group(0)))
for m in _PHONE_RE.finditer(text)
))
return {
"emails": [{"formatted": e} for e in emails],
"phones": [{"formatted": p} for p in phones],
}
def _scan_bytes(content: bytes, filename: str, poppler_path=None, lang: str = "dan+eng") -> dict:
"""Scan raw bytes for CPRs, emails, and phone numbers. Returns result dict."""
if not SCANNER_OK:
return {"cprs": [], "dates": [], "error": "scanner not available"}
return {"cprs": [], "dates": [], "emails": [], "phones": [], "error": "scanner not available"}
ext = Path(filename).suffix.lower()
with tempfile.NamedTemporaryFile(suffix=ext, delete=False) as tmp:
tmp.write(content)
tmp_path = Path(tmp.name)
result: dict = {"cprs": [], "dates": []}
try:
if ext == ".pdf":
# Check if the PDF has a text layer before running full scan_pdf.
# Image-only PDFs (scanned documents) have no text and would trigger
# Tesseract OCR subprocesses that hang indefinitely on some files.
try:
import pdfplumber as _pp, io as _io
with _pp.open(_io.BytesIO(content)) as _pdf:
import pdfplumber as _pp
with _pp.open(io.BytesIO(content)) as _pdf:
has_text = any(ds.is_text_page(p) for p in _pdf.pages)
if not has_text:
return {"cprs": [], "dates": []} # image-only PDF — no CPRs possible
return {"cprs": [], "dates": [], "emails": [], "phones": []}
except Exception:
pass # if pdfplumber fails, fall through to full scan_pdf
return ds.scan_pdf(tmp_path, poppler_path=poppler_path)
result = ds.scan_pdf(tmp_path, poppler_path=poppler_path, lang=lang)
elif ext in {".docx", ".doc"}:
return ds.scan_docx(tmp_path)
result = ds.scan_docx(tmp_path)
elif ext in {".xlsx", ".xlsm"}:
return ds.scan_xlsx(tmp_path)
result = ds.scan_xlsx(tmp_path)
elif ext == ".csv":
return ds.scan_csv(tmp_path)
result = ds.scan_csv(tmp_path)
elif ext == ".txt":
text = content.decode("utf-8", errors="replace")
cprs, dates = ds.extract_matches(text, 1, "text")
return {"cprs": cprs, "dates": dates}
result = {"cprs": cprs, "dates": dates}
elif ext in {".jpg", ".jpeg", ".png", ".bmp", ".tiff", ".tif", ".webp"}:
return ds.scan_image(tmp_path)
result = ds.scan_image(tmp_path, lang=lang)
else:
# Try plain text
try:
text = content.decode("utf-8", errors="replace")
cprs, dates = ds.extract_matches(text, 1, "text")
return {"cprs": cprs, "dates": dates}
result = {"cprs": cprs, "dates": dates}
except Exception:
return {"cprs": [], "dates": []}
pass
except Exception as e:
return {"cprs": [], "dates": [], "error": str(e)}
result = {"cprs": [], "dates": [], "error": str(e)}
finally:
try:
tmp_path.unlink()
except Exception:
pass
ep = _find_emails_phones(_extract_text_from_bytes(content, filename))
result["emails"] = ep["emails"]
result["phones"] = ep["phones"]
return result
def _worker_scan_pdf(pdf_path_str: str, result_q) -> None:
def _worker_scan_pdf(pdf_path_str: str, result_q, lang: str = "dan+eng") -> None:
"""Worker executed in a spawned subprocess — must be a module-level function."""
try:
import document_scanner as _ds
from pathlib import Path as _Path
result_q.put(_ds.scan_pdf(_Path(pdf_path_str)))
result_q.put(_ds.scan_pdf(_Path(pdf_path_str), lang=lang))
except Exception as e:
result_q.put({"cprs": [], "dates": [], "error": str(e)})
def _scan_bytes_timeout(content: bytes, filename: str, timeout: int = 60) -> dict:
def _scan_bytes_timeout(content: bytes, filename: str, timeout: int = 60, lang: str = "dan+eng") -> dict:
"""Like _scan_bytes but runs PDF scanning in a spawned subprocess with a hard timeout.
For non-PDF files delegates straight to _scan_bytes. For PDFs it writes the
@ -347,7 +617,7 @@ def _scan_bytes_timeout(content: bytes, filename: str, timeout: int = 60) -> dic
"""
ext = Path(filename).suffix.lower()
if ext != ".pdf":
return _scan_bytes(content, filename)
return _scan_bytes(content, filename, lang=lang)
import multiprocessing
ctx = multiprocessing.get_context("spawn")
@ -360,7 +630,7 @@ def _scan_bytes_timeout(content: bytes, filename: str, timeout: int = 60) -> dic
try:
with _pdf_subprocess_sem:
q = ctx.Queue()
p = ctx.Process(target=_worker_scan_pdf, args=(tmp_path_str, q))
p = ctx.Process(target=_worker_scan_pdf, args=(tmp_path_str, q, lang))
p.start()
p.join(timeout)
if p.is_alive():
@ -379,19 +649,22 @@ def _scan_bytes_timeout(content: bytes, filename: str, timeout: int = 60) -> dic
def _scan_text_direct(text: str) -> dict:
"""Scan a plain text string for CPRs using extract_matches.
"""Scan a plain text string for CPRs, emails, and phone numbers.
Uses ds.extract_matches() directly rather than ds.scan_text() because
scan_text() calls extract_cpr_and_dates() which is not defined in
document_scanner.py (pre-existing bug).
"""
if not SCANNER_OK or not text:
return {"cprs": [], "dates": []}
if not text:
return {"cprs": [], "dates": [], "emails": [], "phones": []}
ep = _find_emails_phones(text)
if not SCANNER_OK:
return {"cprs": [], "dates": [], **ep}
try:
cprs, dates = ds.extract_matches(text, 1, "text")
return {"cprs": cprs, "dates": dates}
return {"cprs": cprs, "dates": dates, **ep}
except Exception:
return {"cprs": [], "dates": []}
return {"cprs": [], "dates": [], **ep}
def _html_esc(s: str) -> str:
"""HTML-escape a string for safe inline embedding."""
@ -433,6 +706,11 @@ def _placeholder_svg(ext: str, name: str) -> str:
}
bg, label = colors.get(ext, ("#9CA3AF", ext.upper().lstrip(".")))
short = name[:22] + "" if len(name) > 22 else name
# Escape label/name before embedding — served as image/svg+xml, so an
# unescaped value (from the ?name= query param via /api/thumb) would be a
# reflected-XSS vector when the URL is opened directly.
label = _html_esc(label)
short = _html_esc(short)
svg = f"""<svg xmlns="http://www.w3.org/2000/svg" width="280" height="360">
<rect width="280" height="360" fill="{bg}"/>
<rect x="20" y="20" width="240" height="280" rx="8" fill="rgba(255,255,255,0.12)"/>

View File

@ -1,6 +1,6 @@
# GDPR Scanner — Brugermanual
Version 1.6.14
Version 1.7.9
---
@ -33,7 +33,7 @@ Når der er fundet elementer, kan du gennemgå dem, beslutte hvad der skal ske m
**Hvad scanneren gennemgår:**
- Microsoft 365: Exchange e-mail, OneDrive, SharePoint, Teams
- Google Workspace: Gmail, Google Drev
- Lokale og netværksbaserede filmapper (herunder SMB/NAS-drev)
- Lokale og netværksbaserede filmapper (herunder SMB/NAS-drev og SFTP-servere)
**Hvad den finder:**
- CPR-numre
@ -50,16 +50,16 @@ Når der er fundet elementer, kan du gennemgå dem, beslutte hvad der skal ske m
Når du åbner scanneren, er skærmen inddelt i tre områder:
```
┌─────────────────┬──────────────────────────────────────────┐
│ │ Topbjælke: Scan-knap, profiler, handlinger│
│ Venstre panel ├──────────────────────────────────────────┤
┌───────────────────────────────────────────────────────────────┐
│ Topbjælke: Scan-knap, profiler, handlinger
│ Venstre panel ──────────────────────────────────────────────┤
│ │ │
│ - Kilder │ Resultater / scanningsforløb │
│ - Indstillinger│ │
│ - Indstillinger
│ - Konti │ │
│ - Statistik ├──────────────────────────────────────────┤
│ - Statistik ──────────────────────────────────────────────┤
│ │ Aktivitetslog │
└─────────────────┴──────────────────────────────────────────┘
└───────────────────────────────────────────────────────────────┘
```
**Venstre panel** — vælg hvad der skal scannes og hvordan.
@ -104,17 +104,33 @@ Fanen Google Workspace lader dig forbinde en Google Workspace-konto (tidligere G
| Gmail | Alle e-mails i den enkelte brugers indbakke og labels |
| Google Drev | Alle filer ejet af eller delt med den enkelte bruger |
### 3.3 Lokale og netværksbaserede filer
### 3.3 Lokale, netværksbaserede og SFTP-filkilder
Fanen **Filkilder** viser de lokale mapper og netværksdrev, du har konfigureret.
Fanen **Filkilder** viser de lokale mapper, netværksdrev og SFTP-servere, du har konfigureret.
**Sådan tilføjer du en ny filkilde:**
1. Indtast en **Betegnelse** — et navn du kan genkende (f.eks. "Skolens Fællesmappe").
2. Indtast **Stien**:
- Lokal mappe: `~/Dokumenter` eller `/Volumes/Drev`
- Netværksdrev: `//nas-server/delt` eller `\\server\delt`
3. Hvis det er et netværksdrev, udfyldes felterne **SMB-vært**, **Brugernavn** og **Adgangskode** automatisk. Adgangskoden gemmes sikkert i systemets nøglering.
4. Klik på **Tilføj**.
2. Vælg **kildetype** med pillerne øverst i formularen:
**Lokal**
- Indtast **Stien** til mappen: `~/Dokumenter` eller `/Volumes/Drev`.
- Klik på **Tilføj**.
**Netværk (SMB)**
- Indtast **Stien** i UNC-format: `//nas-server/delt` eller `\\server\delt`.
- Udfyld **SMB-vært**, **Brugernavn** og **Adgangskode**. Adgangskoden gemmes sikkert i systemets nøglering.
- Klik på **Tilføj**.
**SFTP**
- Indtast **Vært** (værtsnavn eller IP-adresse på SSH/SFTP-serveren).
- Indtast **Port** (standard 22).
- Indtast **Brugernavn**.
- Indtast **Fjernsti**, der skal scannes (f.eks. `/home/delt` eller `/`).
- Vælg **Godkendelsestype**:
- **Adgangskode** — indtast adgangskoden. Den gemmes sikkert i systemets nøglering.
- **Privat nøgle** — klik på **Upload nøglefil** og vælg din SSH-privatnøgle (OpenSSH- eller PEM-format). Hvis nøglen er beskyttet med en adgangssætning, skal du indtaste den. Nøglefilen gemmes i scannerens datamappe med `600`-rettigheder.
- Klik på **Tilføj**.
Du kan tilføje så mange filkilder, du har brug for. De vil fremgå som valgbare kilder i venstre panel, når du er klar til at scanne.
@ -154,6 +170,10 @@ Scan kun elementer ændret efter en bestemt dato. Hurtige forudindstillinger —
**Maks. e-mails pr. bruger** — stop efter at have scannet dette antal e-mails per person (standard 2.000). Øg det, hvis du har brug for fuld dækning.
**Kun CPR-tilstand** — når aktiveret, flagges kun elementer, der indeholder mindst ét kvalificerende CPR-nummer. Elementer, hvis eneste fund er e-mailadresser, telefonnumre, ansigter eller GPS/EXIF-metadata, springes over. Nyttigt, når du ønsker en fokuseret rapport udelukkende om CPR-eksponering.
**OCR-sprog** — vælg den sprogpakke, Tesseract bruger, når der læses tekst fra scannede PDF-filer og billeder. Standard er `Dansk + Engelsk`, som dækker langt de fleste dokumenter. Skift til en anden forudindstilling, hvis dine dokumenter overvejende er på et andet sprog.
### 4.4 Start scanningen
Klik på den blå **Scan**-knap i topbjælken.
@ -180,6 +200,8 @@ Klik på **▶ Genoptag** for at fortsætte fra det sted, scanningen slap. Klik
## 5. Forstå resultaterne
Når du åbner appen, viser gitteret **alle åbne fund** — alle markerede elementer, der stadig kræver handling (dvs. uden disposition), på tværs af alle dine scanninger og ikke kun den seneste. Efterhånden som du mærker elementer (behold, anonymisér, slet, falsk positiv …), forsvinder de fra denne visning, så det, der står tilbage, er dit udestående arbejde. Hvert element vises én gang med sin nyeste tilstand. Vil du i stedet se en enkelt tidligere scanning, så brug sessionsvælgeren (se *Gennemse tidligere scanningssessioner* nedenfor).
Hvert fundet element vises som et kort. Her er forklaringen på mærker og labels:
### Kildemærker
@ -192,7 +214,8 @@ Hvert fundet element vises som et kort. Her er forklaringen på mærker og label
| Teams | Fundet i en Teams-kanal |
| Gmail | Fundet i en Gmail-postkasse |
| Google Drev | Fundet i Google Drev |
| Lokal / Netværk | Fundet på et filshare |
| Lokal / Netværk | Fundet på et lokalt eller SMB-filshare |
| 🔒 SFTP | Fundet på en SFTP-server |
### Risikoniveau
@ -226,6 +249,19 @@ Brug filterbjælken over resultaterne til at indsnævre visningen:
- **Disposition** — vis elementer efter gennemgangsstatus.
- **Deling** — filtrer på delt / ekstern / alle.
- **Risiko** — vis kun Art. 9, fotos, GPS eller høj-risiko-elementer.
- **Rolle** — vis kun **Ansatte** eller **Elever**. Påvirker også eksporten: klikker du på **Excel** eller **Art.30**, mens en rolle er valgt, indeholder rapporten kun den pågældende gruppe, og filnavnet får suffikset `_elever` eller `_ansatte`.
### Gennemse tidligere scanningssessioner
Når en scanning er afsluttet, kan du gennemse resultaterne fra en tidligere scanningssession uden at køre en ny scanning.
- Klik på **Sessioner**-knappen i historikbanneret (der vises over resultatgitteret, når en scanning er afsluttet) for at åbne sessionsvælgeren.
- Hver række viser dato og tidspunkt, hvilke kilder der blev scannet, og hvor mange elementer der blev fundet. Et **Δ**-mærkat angiver delta-scanninger; **Seneste** markerer den nyeste session.
- Klik på en række for at indlæse den pågældende sessions resultater i gitteret. Et historikbanner erstatter statuslinjen med sessionens oplysninger.
- Klik på **Åbne fund** i banneret for at forlade den tidligere session og vende tilbage til standardvisningen med alle elementer, der stadig kræver handling.
- Start af en ny scanning afslutter automatisk historiktilstanden og skifter til live-resultater.
Alle filtre, eksporter og dispositionsmærkning fungerer normalt, mens du gennemser tidligere sessioner.
---
@ -240,6 +276,7 @@ Forhåndsvisningen viser:
- Alle fundne CPR-numre og deres kontekst
- Øvrige personoplysninger registreret (telefon, e-mailadresse, IBAN mv.)
- Deling og ekstern adgangsinformation
- **Relaterede dokumenter** — hvis andre elementer i samme scanningssession indeholder ét eller flere af de samme CPR-numre, vises de i et "Relaterede dokumenter"-afsnit. Klik på et element for at åbne dets forhåndsvisning. Det gør det nemmere at spore en persons data på tværs af flere filer eller e-mails.
### Angiv en disposition
@ -257,6 +294,46 @@ Hvert element har en **Disposition**-rullemenu i forhåndsvisningspanelet. Vælg
Klik på **Gem** efter valget. En lille **✓ Gemt**-bekræftelse vises.
### Redigér en fil på stedet
En **✂**-knap vises på resultatkort, hvor scanneren kan overskrive filen direkte. Klikker du på den, erstattes alle CPR-numre med `██████-████`-blokke, og handlingen registreres som en `"redacted"`-disposition. Kortet **bevares i gitteret indtil din næste scanning** — det vises nedtonet med et grønt **✏ Redigeret**-mærke, og dets handlingsknapper skjules, så det ikke kan behandles igen. På den måde kan du let se, hvad du har håndteret i sessionen; gitteret genopbygges, næste gang du scanner. Brug denne mulighed, når du ønsker at anonymisere en fil frem for at slette den helt.
Knappen er tilgængelig for følgende kildetyper og formater:
| Kilde | Understøttede formater |
|---|---|
| Lokale filer | DOCX, XLSX, CSV, TXT, PDF |
| Netværksdrev (SMB) | DOCX, XLSX, CSV, TXT, PDF |
| SFTP | DOCX, XLSX, CSV, TXT, PDF |
| OneDrive / SharePoint / Teams | DOCX, XLSX, PDF |
| Google Drev | DOCX, XLSX, PDF |
Knappen er **ikke** tilgængelig for e-mail-elementer (Exchange/Gmail) eller i visningsmode. Google Docs og Sheets, der er eksporteret som DOCX/XLSX under scanning, kan ikke redigeres på stedet — eksportér filen manuelt fra Google først og redigér derefter den hentede kopi.
> **PDF-sikkerhedsnote:** PDF-redigering sker fysisk — CPR-nummerteksten slettes fra PDF-datastrømmen og er ikke blot dækket over med en sort boks. En læser kan ikke gendanne den oprindelige tekst ved at markere under redigeringen eller ved programmatisk inspektion af filen. Billedbaserede (scannede) PDF-filer understøttes også: scanneren lokaliserer CPR-nummeret på sidebilledet via OCR og overskriver det pågældende område fysisk.
> **OneDrive / SharePoint / Teams-note:** Redigering skriver den ændrede fil tilbage via Microsoft Graph API og kræver tilladelsen `Files.ReadWrite.All`. Scanneren anmoder nu automatisk om denne tilladelse ved login. Hvis du har godkendt før denne opdatering, skal du logge ud og logge ind igen (Indstillinger → Microsoft 365 → Log ud), så scanneren henter et nyt token med skriveadgang. Ved app-only-opsætninger (serviceprincipal) skal en Global Administrator tildele applikationstilladelsen `Files.ReadWrite.All` i Azure → App-registreringer → API-tilladelser → Giv administratorsamtykke.
> **Google Drev-note:** Redigering i Google Drev kræver `drive`-scopet på servicekontoens domain-wide delegation (ikke blot `drive.readonly`). Hvis redigeringen fejler med en rettighedsfejl, bedes du kontakte din Google Workspace-administrator for at tilføje scopet `https://www.googleapis.com/auth/drive` til servicekontoens delegation i Admin Console.
> **SFTP-note:** SFTP-redigering er kun tilgængelig for elementer fundet i den aktuelle scansession. Gennemfør en ny scanning, hvis du gennemser historiske resultater.
### Massemarkering af flere elementer på én gang
Hvis du skal anvende den samme disposition på mange elementer, kan du bruge **Vælg-tilstand** i stedet for at åbne hvert kort enkeltvis.
1. Klik på **Vælg** i filterbjælken. Der vises afkrydsningsfelter på hvert resultatkort.
2. Sæt hak ved de elementer, du vil mærke, eller klik på **Vælg alle synlige** i massetag-bjælken nederst på skærmen for at vælge alt, der matcher de aktuelle filtre.
3. Vælg en disposition fra rullemenuen i massetag-bjælken.
4. Klik på **Anvend**. Alle valgte elementer opdateres med det samme.
5. Klik på **Afslut** (eller **Vælg**-knappen igen) for at forlade vælg-tilstanden.
> **Tip:** Brug filterbjælken til f.eks. at afgrænse til alle ikke-gennemgåede elevfund, og klik derefter på **Vælg alle synlige** — så kan du mærke en hel kategori med to klik.
### Dispositionsstatistikbjælke
En tynd statistikbjælke over resultatgitteret viser: **I alt · Ikke gennemgået · Opbevar · Slet** og en **% gennemgået**-angivelse. Den opdateres automatisk efter hvert gem og giver dig et løbende overblik over, hvor langt du er i gennemgangen.
### Find alle elementer for en bestemt person
Klik på **🔍** i venstre panel (under Statistik) for at åbne **Registreret person**-opslaget. Indtast et CPR-nummer, og scanneren finder alle fundne elementer, der indeholder dette nummer. Du kan derefter slette dem alle i ét trin — i overensstemmelse med retten til sletning (GDPR artikel 17).
@ -287,6 +364,8 @@ Klik på **Slet**-knappen i filterbjælken for at åbne massesletningsvinduet.
4. En statuslinje viser sletningerne i realtid. E-mails flyttes til **Slettet post**; filer flyttes til **papirkurven**.
Slettede elementer (uanset om det er en enkelt sletning, en massesletning eller en sletning efter anmodning fra en registreret) **bevares i gitteret indtil din næste scanning** — nedtonet med et rødt **🗑 Slettet**-mærke og med skjulte handlingsknapper — så du kan se, hvad der blev fjernet i sessionen. Hvis en massesletning delvist mislykkes, markeres kun de elementer, serveren faktisk slettede; de, der fejlede, forbliver aktive, så du kan forsøge igen. Gitteret genopbygges, næste gang du scanner.
En fuldstændig revisionslog over alle sletninger (hvad der er slettet, hvornår og hvorfor) medtages i artikel 30-rapporten.
---
@ -323,7 +402,7 @@ Klik på **Profiler** for at åbne profil­administrations­panelet. Her kan du:
Klik på **Excel** i filterbjælken for at downloade de aktuelle resultater som en Excel-projektmappe. Projektmappen indeholder:
- Et oversigtsfaneblad med scanningsdato, antal elementer og kildefordeling.
- Et separat faneblad for hver kildetype (Outlook, OneDrive, SharePoint, Teams, Gmail, Google Drive, Lokal, Netværk).
- Et separat faneblad for hver kildetype (Outlook, OneDrive, SharePoint, Teams, Gmail, Google Drive, Lokal, Netværk, SFTP).
- Alle fundne elementer, herunder kilde, konto, CPR-antal, risikoniveau, delingsstatus og disposition.
Knapperne **Excel** og **Art.30** er altid tilgængelige — også efter genstart af programmet — og eksporterer resultaterne fra den seneste afsluttede scanningssession uden at kræve en ny scanning.
@ -358,15 +437,22 @@ Du kan give en DPO, skoleleder eller compliance-koordinator skrivebeskyttet adga
Klik på **🔗**-knappen øverst til højre i topbjælken for at åbne delingspanelet.
1. Angiv eventuelt en **Betegnelse** for at identificere, hvem linket er til (f.eks. "DPO-gennemgang april 2026").
2. Vælg en **Udløbsdato** — 7 dage, 30 dage, 90 dage, 1 år eller Aldrig.
3. Klik på **Opret**. Der genereres et unikt link: `http://host:5100/view?token=…`
4. Klik på **Kopiér** for at kopiere linket til udklipsholderen, og send det til gennemgangeren.
2. Vælg et **Omfang**:
- **Alle roller** — modtageren ser alle fundne elementer.
- **Ansatte** / **Elever** — modtageren ser kun elementer tilhørende den valgte rollegruppe. Rollefilteret er låst i deres visning.
- **Bruger** — modtageren ser kun elementer tilhørende en bestemt medarbejder. Vælg personen fra søgefeltet; scanneren matcher automatisk både deres M365- og Google Workspace-e-mailadresser. Brug denne mulighed, når du vil give en enkelt medarbejder adgang til sine egne scanningsresultater.
3. Angiv eventuelt et **Datointerval** — brug felterne "Elementer fra" og "Elementer til" for at begrænse modtagerens visning til elementer ændret inden for en bestemt periode. Lad begge felter stå tomme for ingen datobegrænsning.
4. Vælg en **Udløbsdato** — 7 dage, 30 dage, 90 dage, 1 år eller Aldrig.
5. Klik på **Opret**. Formularen ryddes, og det nye link vises øverst i listen **Aktive links** nedenfor, kortvarigt fremhævet.
6. Klik på **Kopiér** i linkets række for at kopiere det til udklipsholderen, og send det til gennemgangeren.
Gennemgangeren åbner linket i en browser. De kan se det fulde resultatgitter og mærke dispositioner, men kan ikke starte scanninger, ændre indstillinger, se loginoplysninger eller slette elementer.
Gennemgangeren åbner linket i en browser. De kan se resultatgitteret (afgrænset til det tilladte rolleomfang) og mærke dispositioner, men kan ikke starte scanninger, ændre indstillinger, se loginoplysninger eller slette elementer.
**Administrer eksisterende links**
Delingspanelet viser alle aktive links. Hver række viser betegnelse, udløbsdato og hvornår linket sidst blev brugt. Klik på **Kopiér** for at kopiere et link igen, eller **Tilbagekald** for at gøre det ugyldigt med det samme.
Delingspanelet viser alle aktive links. Hver række viser betegnelse, rollemærkat (hvis afgrænset), udløbsdato og hvornår linket sidst blev brugt. Klik på **Kopiér** for at kopiere et link igen, eller **Tilbagekald** for at gøre det ugyldigt med det samme.
> **Tip:** I skoler og kommuner er det almindeligt at have separate DPO'er eller compliance-ansvarlige for henholdsvis ansatte og elever. Opret ét afgrænset link til hver — eleve-DPO'en vil kun se elevdata, og ansatte-DPO'en vil kun se ansattedata.
### 10.2 Viewer-PIN
@ -374,7 +460,7 @@ Som alternativ til token-links kan du angive en numerisk PIN-kode (48 cifre)
For at angive eller ændre PIN-koden skal du indtaste den nye kode i feltet **Ny PIN** og klikke på **Gem PIN**. Klik på **Ryd PIN** for at fjerne den.
> **Sikkerhedsnote:** Token-links er mere sikre end en PIN-kode, fordi hvert link kan tilbagekaldes individuelt og har en udløbsdato. Brug PIN-indstillingen kun til betroede interne gennemgangere på dit lokale netværk.
> **Sikkerhedsnote:** Token-links er mere sikre end en PIN-kode, fordi hvert link kan tilbagekaldes individuelt, har en udløbsdato og kan afgrænses til en bestemt rollegruppe. Brug PIN-indstillingen kun til betroede interne gennemgangere på dit lokale netværk, der har brug for adgang til alle resultater.
### 10.3 Hvad gennemgangeren kan gøre
@ -391,6 +477,7 @@ For at angive eller ændre PIN-koden skal du indtaste den nye kode i feltet **Ny
| Slette elementer | Nej |
| Tilgå indstillinger | Nej |
| Oprette eller tilbagekalde viewer-links | Nej |
| Se elementer uden for deres rolleomfang | Nej |
---
@ -409,6 +496,7 @@ Gå til **Indstillinger → Planlægger** for at konfigurere automatiske scannin
7. Aktiver eventuelt:
- **Send rapport automatisk** — send Excel-rapporten pr. e-mail til dine konfigurerede modtagere efter hver scanning.
- **Håndhæv opbevaringspolitik** — slet automatisk elementer ældre end din opbevaringspolitik efter hver scanning.
- **Kun rapport** — spring scanningen over og send blot de seneste resultater fra databasen som e-mail. Nyttigt til regelmæssige opsummerings-e-mails uden at køre en ny scanning. Når aktiveret, kræves ingen profil, og M365-godkendelse er ikke nødvendig.
8. Klik på **Gem**.
Planlæggerikatoren i topbjælken viser dato og tidspunkt for den næste planlagte scanning ("Næste: …").
@ -440,7 +528,17 @@ Klik på **Gem** for at gemme, og klik derefter på **Test** for at sende en tes
> Hvis din konto har MFA (to-faktor-godkendelse) aktiveret, kan du ikke bruge din almindelige adgangskode. Du skal oprette en **app-adgangskode** i din kontos sikkerhedsindstillinger:
> - **Personlig Microsoft-konto**: account.microsoft.com/security → App-adgangskoder
> - **Gmail**: myaccount.google.com → Sikkerhed → 2-trinsbekræftelse → App-adgangskoder
> - **Gmail / Google Workspace**: myaccount.google.com → Sikkerhed → 2-trinsbekræftelse → App-adgangskoder (for Google Workspace-konti skal din administrator først tillade app-adgangskoder eller opsætte et SMTP-relay)
### Send altid via SMTP (spring Microsoft Graph over)
Når scanneren er logget på Microsoft 365, sender den normalt e-mail gennem Microsoft 365 direkte, uden at bruge SMTP-indstillingerne ovenfor. Det er praktisk, men det kan ikke levere til visse adresser — især en adresse på et Google-hostet underdomæne af dit Microsoft 365-domæne, som Microsoft 365 opfatter som intern og kasserer i stilhed (ingen levering, ingen fejl).
Slå **Send altid via SMTP (spring Microsoft Graph over)** til for at tvinge al e-mail — test-e-mails, manuelle rapporter og automatisk e-mail efter scanning — gennem den SMTP-server, du har konfigureret ovenfor. Brug dette, når dine rapporter sendes til en postkasse, som Microsoft 365 ikke kan levere til (f.eks. en Google Workspace-adresse), med `smtp.gmail.com` / `smtp-relay.gmail.com` som SMTP-vært.
### Send rapport efter manuel scanning
Slå **Send rapport efter manuel scanning** til for automatisk at sende rapporten pr. e-mail til dine konfigurerede modtagere, hver gang en manuel scanning er færdig.
### Send en rapport manuelt
@ -480,6 +578,7 @@ Klik på **Nulstil database** for at slette alle scanningsdata, dispositioner og
| Indstilling | Beskrivelse |
|-------------|-------------|
| Tema | Mørkt eller lyst |
| Softwareopdatering | Søg efter og installér nye versioner af scanneren direkte fra browseren, eller slå automatisk daglig opdatering til. Vises kun på serverinstallationer, der kører fra et git-checkout (ikke i skrivebordsappen). Programmet genstarter selv efter installation; opdatering afvises, mens en scanning kører, og næste scanning efter en opdatering fortsætter normalt. |
### Fanen Sikkerhed
@ -487,6 +586,7 @@ Klik på **Nulstil database** for at slette alle scanningsdata, dispositioner og
|-------------|-------------|
| Admin-PIN | Valgfri PIN-kode, der beskytter destruktive handlinger (nulstil database, erstat ved import) |
| Viewer-PIN | Valgfri 48-cifret PIN-kode, der giver alle adgang til `/view` i en browser som skrivebeskyttet gennemganger uden et token-link |
| Interface-PIN | Valgfri 48-cifret PIN-kode, der skal indtastes, inden man får adgang til selve scannerens brugerflade. Alle, der tilgår scanner-URL'en, omdirigeres til en loginside, indtil den korrekte kode er indtastet. Adgang via `/view` er ikke berørt. |
### Avancerede scanningsindstillinger
@ -496,6 +596,31 @@ Disse indstillinger findes i venstre panel under **Indstillinger**:
**Søg efter ansigter i billeder** — langsommere scanning, der registrerer fotografier med genkendelige menneskelige ansigter. Markerer dem som artikel 9 biometriske data. Anbefales til skoler, der opbevarer elevfotos.
**Ignorer GPS i billeder** — når aktiveret, flagges billeder ikke, hvis GPS-koordinater i billedets metadata er det eneste PII-signal. Nyttigt ved scanning af elevkonti: smartphones indlejrer automatisk GPS-koordinater i alle kamerabilleder, hvilket ellers ville generere mange lavprioriterede fund i en skolekontekst. Hvis et billede allerede er flagget af en anden årsag (ansigter, EXIF-forfatterfelter), vises GPS-koordinaterne stadig i detaljekortet.
**Min. CPR-antal pr. fil** — en fil flagges kun, hvis den indeholder mindst dette antal *distinkte* CPR-numre. Standardværdien er 1 (nuværende adfærd). Sæt til 2 for at undgå falske positive ved elevscanninger: en elevs samtykkeerklæring eller indmeldelsesformular indeholder typisk kun elevens eget CPR-nummer, mens en klasselist eller karakteroversigt med flere elevers CPR-numre stadig vil blive rapporteret.
**Kun CPR-tilstand** — når aktiveret, springes elementer uden CPR-numre over (kun e-mailadresser, telefonnumre, ansigter eller GPS/EXIF-data). Brug dette, når du ønsker en rapport, der udelukkende fokuserer på CPR-eksponering.
**OCR-sprog** — vælger den sprogpakke, Tesseract bruger, når der læses tekst fra scannede PDF-filer og billeder. Standard: `Dansk + Engelsk`. Skift til en anden forudindstilling for dokumenter på tysk, svensk eller fransk.
### Fanen AI / NER
Gå til **Indstillinger → AI / NER** for at konfigurere Claude AI-drevet navnegenkendelse.
Som standard bruger scanneren spaCy (en lokal maskinlæringsmodel) til at genkende personnavne, adresser og organisationsnavne i dokumenttekst. Aktivering af Claude NER erstatter dette med kald til Claude Haiku API, som er betydeligt mere nøjagtig — særligt for danske dobbeltefternavne (f.eks. "Hansen-Nielsen"), fremmedsprogede navne og navne uden omgivende kontekst (f.eks. isolerede celler i et regneark).
**Sådan aktiverer du:**
1. Opret en Anthropic API-nøgle på [console.anthropic.com](https://console.anthropic.com).
2. Indsæt nøglen i feltet **Anthropic API-nøgle** og klik på **Gem**.
3. Slå **Aktiver Claude NER**-kontakten til og klik på **Gem** igen.
4. Klik på **Test nøgle** for at bekræfte, at nøglen er gyldig og API'et er tilgængeligt.
**Pris:** Claude Haiku faktureres pr. token efter Anthropics offentliggjorte priser. Et typisk dokument koster en brøkdel af en øre. Scanningsresultater caches pr. dokument, så genskanning af den samme fil aldrig medfører en ny opkrævning.
**Fallback:** Hvis `anthropic`-pakken ikke er installeret, eller API-nøglen mangler, falder scanneren automatisk tilbage til spaCy uden fejl — kontakten har blot ingen effekt.
**Opbevaringspolitik** — når aktiveret, markeres elementer ældre end det angivne antal år som forældet. Regnskabsårets afslutning bestemmer, hvordan skæringsdatoen beregnes:
| Indstilling | Beregning af skæringsdato |
@ -504,6 +629,12 @@ Disse indstillinger findes i venstre panel under **Indstillinger**:
| 31 dec (Bogføringsloven) | Seneste 31. december minus N år |
| 30 jun / 31 mar | Seneste forekomst af den dato minus N år |
### Fanen Revisionslog
Gå til **Indstillinger → Revisionslog** for at se en uforanderlig log over alle væsentlige administrative handlinger i scanneren. Hver post viser tidspunkt, handlingstype, detaljer og klientens IP-adresse. Registrerede hændelser omfatter: gem/slet profil, opret/tilbagekald viewer-token, PIN-ændringer, tilføj/opdater/slet filkilde, gem/slet planlagt job, start/stop scanning, gem SMTP-konfiguration, dispositionsændringer, slet element og redigér element.
Loggen er skrivebeskyttet og gemmes i scannerdatabasen sammen med scanningsresultaterne. Den er inkluderet i databaseeksporter og kan hjælpe dig med at dokumentere ansvarlighed over for en tilsynsmyndighed.
---
## 15. Ofte stillede spørgsmål
@ -515,10 +646,10 @@ Nej. CPR-numre fundet under en scanning gemmes kun som et antal (f.eks. "3 CPR-n
E-mails flyttes til brugerens **Slettet post**-mappe i Exchange — de slettes ikke permanent og kan gendannes af brugeren eller en administrator. Filer flyttes til **papirkurven** i den pågældende tjeneste (OneDrive, SharePoint, filsystem). Permanent sletning kræver en efterfølgende handling af brugeren eller administrator.
**Kan jeg scanne uden at forbinde til Microsoft 365?**
Ja. Du kan scanne lokale og SMB-filshares uden nogen M365- eller Google-forbindelse. Åbn **Kilder**, gå til fanen **Filkilder**, og tilføj dine filstier.
Ja. Du kan scanne lokale mapper, SMB/NAS-drev og SFTP-servere uden nogen M365- eller Google-forbindelse. Åbn **Kilder**, gå til fanen **Filkilder**, og tilføj dine filstier eller SFTP-serveroplysninger.
**Hvad er delta-scanning, og hvornår skal jeg bruge det?**
Delta-scanning bruger Microsoft Graphs ændringstokens til kun at hente elementer ændret siden den seneste scanning. Det er ideelt til regelmæssige (f.eks. ugentlige) compliance-tjek efter, at du har gennemført en fuld basisscan. Aktiver det i afsnittet Indstillinger i venstre panel.
Delta-scanning bruger Microsoft Graphs ændringstokens (for M365) og Google Drive Changes API (for Google Workspace) til kun at hente elementer ændret siden den seneste scanning. Det er ideelt til regelmæssige (f.eks. ugentlige) compliance-tjek efter, at du har gennemført en fuld basisscan. Aktiver det i afsnittet Indstillinger i venstre panel.
**Scanningen stoppede — kan jeg fortsætte, hvor den slap?**
Ja. Når du starter scanningen igen, vil et gult banner tilbyde at genoptage fra kontrolpunktet. Klik på **▶ Genoptag** for at fortsætte. Hvis du foretrækker at starte forfra, klikker du på **Start forfra**.
@ -535,9 +666,21 @@ I kontoafsnittet i venstre panel er der et felt **+ Tilføj konto manuelt**. Ind
**Kører scanneren? Jeg kan ikke se en statuslinje.**
Tjek aktivitetsloggen nederst på skærmen. Hvis en scanning kører, vises der beskeder her. Hvis du ikke ser noget, er scanningen muligvis afsluttet eller ikke startet. Kontrollér også, at du har valgt mindst én kilde og mindst én konto.
**Kan jeg beskytte scanneren med adgangskode, så elever eller kolleger ikke kan tilgå den på netværket?**
Ja. Gå til **Indstillinger → Sikkerhed → Interface-PIN** og angiv en 48-cifret PIN-kode. Fra da af vises alle, der åbner scanner-URL'en i en browser, en loginside og kan ikke komme videre uden den korrekte kode. Interface-PIN er adskilt fra Admin-PIN (der beskytter destruktive handlinger) og Viewer-PIN (der beskytter skrivebeskyttet adgang). Eksisterende viewer-token-links fungerer fortsat uden interface-PIN.
**Kan en gennemganger mærke dispositioner uden adgang til scanningskontrollerne?**
Ja. Brug **🔗 Del**-knappen til at oprette et skrivebeskyttet viewer-link eller angiv en Viewer-PIN under Indstillinger → Sikkerhed. Gennemgangeren åbner linket i sin browser og kan gennemse resultater og mærke dispositioner uden at se loginoplysninger, kilder eller scanningsknapper. Se afsnit 10 for detaljer.
**Kan jeg begrænse et delelink til en bestemt tidsperiode?**
Ja. Brug felterne "Elementer fra" og "Elementer til" i delingspanelet, når du opretter et token-link. Modtageren vil kun se elementer, hvis ændringsdate falder inden for det angivne interval.
**Hvor kan jeg se, hvem der har ændret hvad i scanneren?**
Gå til **Indstillinger → Revisionslog**. Alle væsentlige administrative handlinger logges med tidsstempel, handlingstype, detaljer og IP-adresse.
**Vil aktivering af Claude NER øge omkostningerne væsentligt?**
For en typisk skole- eller kommunescanning er omkostningen ubetydelig — Claude Haiku faktureres i brøkdele af en øre pr. dokument, og resultater caches, så det samme dokument aldrig faktureres to gange. En fuld scanning af 10.000 dokumenter koster typisk under 7 kr. Den største gevinst er i navnetætte dokumenter (klasselister, sagsmapper), hvor spaCy tidligere gik glip af mange navne.
---
*GDPR Scanner v1.6.14 — teknisk opsætning og konfiguration: se README.md*
*GDPR Scanner v1.7.9 — teknisk opsætning og konfiguration: se README.md*

View File

@ -1,6 +1,6 @@
# GDPR Scanner — User Manual
Version 1.6.14
Version 1.7.9
---
@ -33,7 +33,7 @@ When items are found, you can review them, decide what to do with each one (keep
**What it scans:**
- Microsoft 365: Exchange email, OneDrive, SharePoint, Teams
- Google Workspace: Gmail, Google Drive
- Local and network file shares (including SMB/NAS drives)
- Local and network file shares (including SMB/NAS drives and SFTP servers)
**What it finds:**
- CPR numbers (Danish civil registration numbers)
@ -50,16 +50,16 @@ When items are found, you can review them, decide what to do with each one (keep
When you open the scanner, the screen is divided into three areas:
```
┌─────────────────┬──────────────────────────────────────────┐
┌─────────────────┬──────────────────────────────────────────
│ │ Top bar: Scan button, profiles, actions │
│ Left sidebar ├──────────────────────────────────────────┤
│ Left sidebar ├──────────────────────────────────────────
│ │ │
│ - Sources │ Results / scan progress │
│ - Options │ │
│ - Accounts │ │
│ - Stats ├──────────────────────────────────────────┤
│ - Stats ├──────────────────────────────────────────
│ │ Activity log │
└─────────────────┴──────────────────────────────────────────┘
└─────────────────┴──────────────────────────────────────────
```
**Left sidebar** — choose what to scan and how.
@ -104,17 +104,33 @@ The Google Workspace tab lets you connect a Google Workspace (formerly G Suite)
| Gmail | All emails in each user's inbox and labels |
| Google Drive | All files owned by or shared with each user |
### 3.3 Local and Network File Shares
### 3.3 Local, Network, and SFTP File Sources
The **Filkilder** (File Sources) tab lists any local folders or network drives you have configured.
The **Filkilder** (File Sources) tab lists any local folders, network drives, or SFTP servers you have configured.
**To add a new file source:**
1. Enter a **Label** — a friendly name you will recognise (e.g. "Skolens Fællesmappe").
2. Enter the **Path**:
- Local folder: `~/Documents` or `/Volumes/Share`
- Network share: `//nas-server/shared` or `\\server\share`
3. If it is a network share, fill in the **SMB Host**, **Username**, and **Password** that appear automatically. The password is stored securely in your system keychain.
4. Click **Tilføj** (Add).
2. Select the **source type** using the pill selector at the top of the form:
**Local**
- Enter the **Path** to the folder: `~/Documents` or `/Volumes/Share`.
- Click **Tilføj** (Add).
**Network (SMB)**
- Enter the **Path** in UNC format: `//nas-server/shared` or `\\server\share`.
- Fill in the **SMB Host**, **Username**, and **Password** that appear. The password is stored securely in your system keychain.
- Click **Tilføj** (Add).
**SFTP**
- Enter the **Host** (hostname or IP address of the SSH/SFTP server).
- Enter the **Port** (default 22).
- Enter the **Username**.
- Enter the **Remote path** to scan (e.g. `/home/shared` or `/`).
- Choose the **Authentication type**:
- **Password** — enter the password. It is stored securely in your system keychain.
- **Private key** — click **Upload key file** and select your SSH private key (OpenSSH or PEM format). If the key is passphrase-protected, enter the passphrase. The key file is stored in the scanner's data directory with `600` permissions.
- Click **Tilføj** (Add).
You can add as many file sources as you need. Each one will appear as a selectable source in the main sidebar when you are ready to scan.
@ -154,6 +170,10 @@ Only scan items modified after a certain date. Quick presets — **1 år**, **2
**Max emails per user** — stop after scanning this many emails per person (default 2,000). Increase if you need complete coverage.
**CPR-only mode** — when enabled, only items containing at least one qualifying CPR number are flagged. Items whose only hits are email addresses, phone numbers, detected faces, or EXIF/GPS metadata are skipped. Useful when you want a focused CPR-only report without noise from other data types.
**OCR language** — choose the language pack(s) Tesseract uses when reading text from scanned PDFs and images. The default `Danish + English` covers the vast majority of documents. Switch to a different preset if your documents are predominantly in another language.
### 4.4 Start the Scan
Click the blue **Scan** button in the top bar.
@ -180,6 +200,8 @@ Click **▶ Genoptag** to continue from where the scan left off. Click **Start f
## 5. Understanding the Results
When you open the app, the grid shows **all open items** — every flagged item that still needs action (i.e. has no disposition), across all of your scans, not just the most recent one. As you tag items (kept, redacted, deleted, false positive, …) they drop out of this view, so what remains is your outstanding work. Each item appears once, showing its most recent state. To look at a single past scan instead, use the session picker (see *Browsing past scan sessions* below).
Each flagged item appears as a card. Here is what the badges and labels mean:
### Source badges
@ -192,7 +214,8 @@ Each flagged item appears as a card. Here is what the badges and labels mean:
| Teams | Found in a Teams channel |
| Gmail | Found in a Gmail mailbox |
| Google Drive | Found in Google Drive |
| Local / Network | Found on a file share |
| Local / Network | Found on a local or SMB file share |
| 🔒 SFTP | Found on an SFTP server |
### Risk level
@ -226,6 +249,19 @@ Use the filter bar above the results to narrow down what you see:
- **Disposition dropdown** — show items by their review status.
- **Transfer dropdown** — filter by shared / external / all.
- **Risk dropdown** — show only Art. 9, photos, GPS, or high-risk items.
- **Role dropdown** — show only **Ansatte** (staff) or **Elever** (students). Also scopes exports: clicking **Excel** or **Art.30** while a role is selected produces a report containing only that group, with `_elever` or `_ansatte` appended to the filename.
### Browsing past scan sessions
Once a scan has completed, you can review results from any earlier scan session without running a new scan.
- Click the **Sessions** button in the history banner (which appears above the results grid after a scan completes) to open the session picker.
- Each row shows the date and time, which sources were scanned, and how many items were flagged. A **Δ** badge marks delta scans; **Latest** marks the most recent session.
- Click any row to load that session's results into the grid. A history banner replaces the progress bar, showing the session details.
- Click **Open items** in the banner to leave the past session and return to the default view of all items still needing action.
- Starting a new scan automatically exits history mode and switches back to live results.
All filters, exports, and disposition tagging work normally while browsing past sessions.
---
@ -240,6 +276,7 @@ The preview shows:
- All CPR numbers found and their context
- Other personal data detected (phone, email address, IBAN, etc.)
- Sharing and external-access information
- **Related documents** — if other items in the same scan session share one or more CPR numbers with this item, a "Related documents" section lists them. Click any row to open that item's preview. This helps you track the same person's data across multiple files or emails.
### Setting a disposition
@ -255,7 +292,47 @@ Every item has a **Disposition** dropdown in the preview panel. Choose one of:
| Privat brug — uden for scope | Personal item, not in scope for GDPR processing |
| Slettet | Already deleted (set automatically when you delete an item) |
After choosing, click **Gem**. A small **✓ Gemt** confirmation appears.
After choosing, click **Save**. A small **✓ Saved** confirmation appears.
### Redacting a file in-place
A **✂** button appears on result cards where the scanner can overwrite the file directly. Clicking it replaces all CPR numbers with `██████-████` blocks and logs the action as a `"redacted"` disposition. The card is **kept in the grid until your next scan** — it is greyed out, shows a green **✏ Redacted** badge, and its action buttons are hidden so it cannot be processed again. This lets you see at a glance what you handled during the session; the grid is rebuilt the next time you scan. This is useful when you want to sanitise a file rather than delete it entirely.
The button is available for the following source types and formats:
| Source | Supported formats |
|---|---|
| Local files | DOCX, XLSX, CSV, TXT, PDF |
| Network share (SMB) | DOCX, XLSX, CSV, TXT, PDF |
| SFTP | DOCX, XLSX, CSV, TXT, PDF |
| OneDrive / SharePoint / Teams | DOCX, XLSX, PDF |
| Google Drive | DOCX, XLSX, PDF |
The button is **not** available for email items (Exchange/Gmail) or viewer mode. Google Docs and Sheets that were exported as DOCX/XLSX during scanning cannot be redacted in-place — export the file from Google manually first, then redact the downloaded copy.
> **PDF security note:** PDF redaction uses physical removal — the CPR number text is erased from the PDF data stream, not just painted over with a black box. A reader cannot recover the original text by selecting under the redaction or inspecting the file programmatically. Image-based (scanned) PDFs are also supported: the scanner locates the CPR number on the page image via OCR and physically overwrites that region.
> **OneDrive / SharePoint / Teams note:** Redaction writes the modified file back via the Microsoft Graph API and requires the `Files.ReadWrite.All` permission. The scanner now requests this permission automatically during sign-in. If you authenticated before this update, sign out and sign back in (Settings → Microsoft 365 → Sign out) so the scanner obtains a new token with write access. For app-only (service principal) setups, a Global Admin must grant the `Files.ReadWrite.All` application permission in Azure → App registrations → API permissions → Grant admin consent.
> **Google Drive note:** Drive redaction requires the `drive` scope on the service account's domain-wide delegation grant (not just `drive.readonly`). If redaction fails with a permission error, ask your Google Workspace admin to add the `https://www.googleapis.com/auth/drive` scope to the service account delegation in the Admin Console.
> **SFTP note:** SFTP redaction is only available for items found in the current scan session. If you are browsing historical results, re-run the scan first.
### Bulk tagging multiple items at once
If you need to apply the same disposition to many items, use **Select mode** instead of opening each card individually.
1. Click **Vælg** (Select) in the filter bar. Per-card checkboxes appear on every result card.
2. Tick the items you want to tag, or click **Select all visible** in the bulk tag bar at the bottom of the screen to select everything matching the current filters.
3. Choose a disposition from the dropdown in the bulk tag bar.
4. Click **Apply**. All selected items are updated immediately.
5. Click **Done** (or the same **Vælg** button again) to leave select mode.
> **Tip:** Use the filter bar to narrow down to, for example, all unreviewed student items before clicking **Select all visible** — this lets you tag an entire category in two clicks.
### Disposition stats bar
A thin stats bar sits above the results grid showing: **Total · Unreviewed · Retain · Delete** counts and a **% reviewed** figure. It updates automatically after every disposition save, giving you a live overview of how far through the review you are.
### Finding all items for a specific person
@ -287,6 +364,8 @@ Click the **Delete** button in the filter bar to open the bulk delete modal.
4. A progress bar shows deletions as they happen. Emails go to **Deleted Items**; files go to the **recycle bin**.
Deleted items (whether from a single delete, a bulk delete, or a data-subject erasure) are **kept in the grid until your next scan** — greyed out with a red **🗑 Deleted** badge and their action buttons hidden — so you can see what was removed during the session. When a bulk delete partially fails, only the items the server actually deleted are marked; any that failed stay active so you can retry them. The grid is rebuilt the next time you scan.
A full audit log of every deletion (what was deleted, when, and why) is included in the Article 30 report.
---
@ -323,7 +402,7 @@ Click **Profiles** to open the profile management panel. Here you can:
Click **Excel** in the filter bar to download the current results as an Excel workbook. The workbook contains:
- A summary tab with scan date, item counts, and source breakdown.
- A separate tab for each source type (Outlook, OneDrive, SharePoint, Teams, Gmail, Google Drive, Local, Network).
- A separate tab for each source type (Outlook, OneDrive, SharePoint, Teams, Gmail, Google Drive, Local, Network, SFTP).
- Every flagged item, including source, account, CPR count, risk level, sharing status, and disposition.
The **Excel** and **Art.30** buttons are always available — even after restarting the application — and will export the results from the most recent completed scan session without requiring a new scan.
@ -358,15 +437,22 @@ You can give a DPO, school principal, or compliance coordinator read-only access
Click the **🔗** button in the top-right of the top bar to open the Share panel.
1. Optionally enter a **Label** to identify who the link is for (e.g. "DPO review April 2026").
2. Choose an **Expiry** — 7 days, 30 days, 90 days, 1 year, or Never.
3. Click **Create**. A unique link is generated: `http://host:5100/view?token=…`
4. Click **Copy** to copy the link to your clipboard, then send it to the reviewer.
2. Choose a **Scope**:
- **All roles** — the recipient sees all flagged items.
- **Ansatte** / **Elever** — the recipient sees only items belonging to that role group. The role filter is locked in their view.
- **User** — the recipient sees only the items belonging to a specific employee. Select the person from the search box; the scanner matches both their M365 and Google Workspace email addresses automatically. Use this when you want to give an individual employee access to their own scan results.
3. Optionally set a **Date range** — use the "Items from" and "Items until" date fields to limit the recipient to items modified within a specific period. This lets you, for example, create a link covering only last year's scan results. Leave both fields blank for no date restriction.
4. Choose an **Expiry** — 7 days, 30 days, 90 days, 1 year, or Never.
5. Click **Create**. The form clears and the new link appears at the top of the **Active links** list below, briefly highlighted.
6. Click **Copy** on that link's row to copy it to your clipboard, then send it to the reviewer.
The reviewer opens the link in any browser. They see the full results grid and can tag dispositions but cannot start scans, change settings, view credentials, or delete items.
The reviewer opens the link in any browser. They see the results grid (filtered to their permitted scope) and can tag dispositions but cannot start scans, change settings, view credentials, or delete items.
**Managing existing links**
The Share panel lists all active links. Each row shows the label, expiry date, and when the link was last used. Click **Copy** to copy a link again, or **Revoke** to invalidate it immediately.
The Share panel lists all active links. Each row shows the label, role badge (if scoped), expiry date, and when the link was last used. Click **Copy** to copy a link again, or **Revoke** to invalidate it immediately.
> **Tip:** In schools and municipalities it is common to have separate DPOs or compliance officers for staff data and student data. Create one scoped link for each — the student DPO will only ever see student items, and the staff DPO will only see staff items.
### 10.2 Viewer PIN
@ -374,7 +460,7 @@ As an alternative to token links, you can set a numeric PIN (48 digits) in **
To set or change the PIN, enter the new PIN in the **New PIN** field and click **Save PIN**. To remove it, click **Clear PIN**.
> **Security note:** Token links are more secure than a PIN because each link can be individually revoked and has an expiry date. Use the PIN option only for trusted internal reviewers on your local network.
> **Security note:** Token links are more secure than a PIN because each link can be individually revoked, has an expiry date, and can be role-scoped. Use the PIN option only for trusted internal reviewers on your local network who need access to all results.
### 10.3 What the reviewer can do
@ -391,6 +477,7 @@ To set or change the PIN, enter the new PIN in the **New PIN** field and click *
| Delete items | No |
| Access Settings | No |
| Create or revoke viewer links | No |
| See items outside their role scope | No |
---
@ -409,6 +496,7 @@ Go to **Settings → Planlægger** to configure automatic scans.
7. Optionally enable:
- **Send rapport automatisk** — email the Excel report to your configured recipients after each scan.
- **Håndhæv opbevaringspolitik** — automatically delete items older than your retention policy after each scan.
- **Report only** — skip the scan entirely and just email the latest results already in the database. Useful for sending a regular summary email without running a new scan. When enabled, no profile is needed and M365 authentication is not required.
8. Click **Gem** (Save).
The scheduler indicator in the top bar shows the date and time of the next scheduled scan ("Next: …").
@ -440,7 +528,17 @@ Click **Gem** to save, then click **Test** to send a test email and verify the c
> If your account has MFA (two-factor authentication) enabled, you cannot use your regular password. You need to create an **App Password** in your account security settings:
> - **Microsoft personal account**: account.microsoft.com/security → App passwords
> - **Gmail**: myaccount.google.com → Security → 2-Step Verification → App passwords
> - **Gmail / Google Workspace**: myaccount.google.com → Security → 2-Step Verification → App passwords (for Google Workspace accounts your administrator must first allow App Passwords, or set up an SMTP relay)
### Always send via SMTP (skip Microsoft Graph)
When the scanner is signed in to Microsoft 365, it normally sends email through Microsoft 365 directly, without using the SMTP settings above. This is convenient, but it cannot deliver to some addresses — most notably an address on a Google-hosted subdomain of your Microsoft 365 domain, which Microsoft 365 treats as internal and silently discards (no delivery, no error).
Turn on **Send altid via SMTP (spring Microsoft Graph over)** to force all email — test emails, manual reports, and the after-scan auto-email — through the SMTP server you configured above. Use this when your reports go to a mailbox Microsoft 365 won't deliver to (for example a Google Workspace address), with `smtp.gmail.com` / `smtp-relay.gmail.com` as the SMTP host.
### Email report after manual scan
Turn on **Send rapport efter manuel scanning** to automatically email the report to your configured recipients every time a manual scan finishes.
### Sending a report manually
@ -480,6 +578,7 @@ Click **Reset DB** to wipe all scan data, dispositions, and deletion log. This i
| Setting | Description |
|---------|-------------|
| Theme | Dark or light mode |
| Software update | Check for and install new versions of the scanner directly from the browser, or enable automatic daily updates. Only shown on server installations running from a git checkout (not in the desktop app). The app restarts itself after installing; updating is refused while a scan is running, and the next scan after an update continues normally. |
### Security tab
@ -487,6 +586,7 @@ Click **Reset DB** to wipe all scan data, dispositions, and deletion log. This i
|---------|-------------|
| Admin PIN | Optional PIN that protects destructive actions (database reset, replace import) |
| Viewer PIN | Optional 48 digit PIN that lets anyone open `/view` in a browser for read-only access to results without a token link |
| Interface PIN | Optional 48 digit PIN that must be entered before accessing the main scanner interface. Anyone reaching the scanner URL is redirected to a login page until the correct PIN is entered. Viewer access via `/view` is not affected. |
### Advanced scan options
@ -496,6 +596,31 @@ These options are in the left sidebar under **Indstillinger**:
**Scan photos for faces** — slower scan that detects photographs containing recognisable human faces. Flags them as Article 9 biometric data. Recommended for schools storing student photos.
**Ignore GPS in images** — when enabled, images whose only PII signal is an embedded GPS location are not flagged. Useful when scanning student accounts: smartphones embed GPS coordinates in every photo taken with the camera app, which would otherwise generate large numbers of flags that are low-priority for a school context. If an image is already flagged for another reason (faces, EXIF author field), the GPS coordinate is still shown in the detail card.
**Min. CPR count per file** — only flag a file if it contains at least this many *distinct* CPR numbers. The default is 1 (current behaviour). Setting it to 2 avoids false positives in student scans: a student's own consent form or registration document typically contains only their own CPR number, while a class list or grade sheet containing multiple students' CPRs will still be reported.
**CPR-only mode** — when enabled, items with no CPR numbers (only email addresses, phone numbers, faces, or GPS/EXIF data) are skipped entirely. Use this when you want a lean report focused exclusively on CPR exposure.
**OCR language** — selects the Tesseract language pack(s) used when reading scanned PDFs and images. Default: `Danish + English`. Change to a different preset if your documents are in another language (German, Swedish, French presets are available).
### AI / NER tab
Go to **Settings → AI / NER** to configure Claude AI-powered Named Entity Recognition.
By default the scanner uses spaCy (a local machine-learning model) to detect person names, addresses, and organisation names in document text. Enabling Claude NER replaces this with calls to the Claude Haiku API, which is significantly more accurate — especially for Danish hyphenated surnames (e.g. "Hansen-Nielsen"), foreign-origin names, and names that appear without surrounding context (such as isolated cells in a spreadsheet).
**To enable:**
1. Obtain an Anthropic API key from [console.anthropic.com](https://console.anthropic.com).
2. Paste the key into the **Anthropic API key** field and click **Save**.
3. Turn on the **Enable Claude NER** toggle and click **Save** again.
4. Click **Test key** to confirm the key is valid and the API is reachable.
**Cost:** Claude Haiku is charged per token at Anthropic's published rates. A typical document costs less than a fraction of a cent. Scan results are cached per document, so re-scanning the same file never incurs a second charge.
**Fallback:** If the `anthropic` package is not installed or the API key is missing, the scanner automatically falls back to spaCy with no error — the toggle simply has no effect.
**Retention policy** — when enabled, marks items older than the specified number of years as overdue. The fiscal year end setting determines how the cutoff date is calculated:
| Option | Cutoff date calculation |
@ -504,6 +629,12 @@ These options are in the left sidebar under **Indstillinger**:
| 31 dec (Bogføringsloven) | Last 31 December minus N years |
| 30 jun / 31 mar | Last occurrence of that date minus N years |
### Audit Log tab
Go to **Settings → Audit Log** to view an immutable log of all significant admin actions performed in the scanner. Each entry shows the time, action type, detail, and client IP address. Recorded events include: profile save/delete, viewer token create/revoke, PIN changes, file source add/update/delete, scheduler job save/delete, scan start/stop, SMTP config save, dispositions, item delete, and item redact.
The log is read-only and is stored in the scanner database alongside scan results. It is included in database exports and can help you demonstrate accountability to a supervisory authority.
---
## 15. Frequently Asked Questions
@ -515,10 +646,10 @@ No. CPR numbers found during a scan are stored only as a count (e.g. "3 CPR numb
Emails are moved to the user's **Deleted Items** folder in Exchange — they are not permanently deleted and can be recovered by the user or an administrator. Files are moved to the **recycle bin** of the relevant service (OneDrive, SharePoint, file system). A permanent deletion requires a second action by the user or admin.
**Can I scan without connecting to Microsoft 365?**
Yes. You can scan local and SMB file shares without any M365 or Google connection. Open **Sources**, go to the **Filkilder** tab, and add your file paths.
Yes. You can scan local folders, SMB/NAS drives, and SFTP servers without any M365 or Google connection. Open **Sources**, go to the **Filkilder** tab, and add your file paths or SFTP server details.
**What is delta scanning and when should I use it?**
Delta scanning uses Microsoft Graph change tokens to fetch only items modified since the last scan. It is ideal for regular (e.g. weekly) compliance checks after you have done a full baseline scan. Enable it in the Options section of the sidebar.
Delta scanning uses Microsoft Graph change tokens (for M365) and the Google Drive Changes API (for Google Workspace) to fetch only items modified since the last scan. It is ideal for regular (e.g. weekly) compliance checks after you have done a full baseline scan. Enable it in the Options section of the sidebar.
**The scan stopped — can I continue where it left off?**
Yes. When you restart the scan, a yellow banner will offer to resume from the checkpoint. Click **▶ Genoptag** to continue. If you prefer to start over, click **Start fresh**.
@ -535,9 +666,21 @@ In the accounts section of the sidebar, there is an **+ Tilføj konto manuelt**
**Is the scanner running? I cannot see a progress bar.**
Check the activity log at the bottom of the screen. If a scan is running it will show messages there. If you see nothing, the scan may have completed or not started. Also check that you have at least one source ticked and at least one account selected.
**Can I password-protect the scanner so students or colleagues cannot access it on the network?**
Yes. Go to **Settings → Security → Interface PIN** and set a 48 digit PIN. From that point on, anyone who opens the scanner URL in a browser is shown a PIN entry page and cannot proceed without the correct code. This is separate from the Admin PIN (which protects destructive actions) and the Viewer PIN (which protects read-only access). Existing viewer token links still work without the interface PIN.
**Can a reviewer tag dispositions without access to the scan controls?**
Yes. Use the **🔗 Share** button to create a read-only viewer link or set a Viewer PIN in Settings → Security. The reviewer opens the link in their browser and can browse results and tag dispositions without seeing credentials, sources, or scan buttons. See section 10 for details.
**Can I limit a reviewer's link to a specific time period?**
Yes. When creating a token link, use the "Items from" and "Items until" date fields to restrict the link to items modified within that range. The reviewer will only see items whose modification date falls within the window you specified.
**Where can I see who changed what in the scanner?**
Go to **Settings → Audit Log**. Every significant admin action is recorded there with a timestamp, action type, detail, and IP address.
**Will enabling Claude NER increase costs significantly?**
For a typical school or municipality scan the cost is negligible — Claude Haiku charges fractions of a cent per document, and results are cached so the same file is never billed twice. A full scan of 10 000 documents typically costs under $1. The biggest gain is on name-dense documents (class lists, case files) where spaCy previously missed many names.
---
*GDPR Scanner v1.6.14 — for technical setup and configuration see README.md*
*GDPR Scanner v1.7.9 — for technical setup and configuration see README.md*

148
docs/setup/ZORAXY_SETUP.md Normal file
View File

@ -0,0 +1,148 @@
# HTTPS via Zoraxy Reverse Proxy
Step-by-step guide for putting GDPRScanner behind [Zoraxy](https://github.com/tobychui/zoraxy) with a Let's Encrypt certificate, on a LAN-only deployment.
Why bother on an internal network:
- **Encryption in transit** — the scanner streams CPR numbers, document previews, and share links. Serving that over plain HTTP to DPO reviewers is itself a compliance finding.
- **Secure context** — the browser Clipboard API (share-link Copy buttons) only exists on HTTPS or localhost. Over plain HTTP the app falls back to a legacy copy mechanism.
- **A real hostname**`https://gdprscanner.example.dk` instead of `http://10.x.x.x:5100` in share links, bookmarks, and emails.
This guide assumes Zoraxy runs **on the same host** as the scanner. If it runs elsewhere, replace `127.0.0.1:5100` with the scanner host's LAN IP and firewall port 5100 to the Zoraxy host only.
---
## 1. DNS record
Create an A-record for the hostname pointing at the server's **LAN IP**:
```
gdprscanner.example.dk A 10.x.x.x
```
A public DNS record pointing at a private IP is fine — outsiders can resolve the name but cannot route to the address, which is exactly the "LAN-only" goal.
> **Consequence:** because the server is not reachable from the internet, Let's Encrypt's default HTTP-01 challenge cannot work. The certificate **must** be issued via the **DNS-01 challenge** (step 4). If you prefer not to publish the internal IP at all, use an internal/split-horizon DNS record instead — DNS-01 still works since it validates against the public DNS zone, not the server.
---
## 2. Install Zoraxy
```bash
mkdir -p /opt/zoraxy && cd /opt/zoraxy
wget -O zoraxy https://github.com/tobychui/zoraxy/releases/latest/download/zoraxy_linux_amd64
chmod +x zoraxy
```
`/etc/systemd/system/zoraxy.service`:
```ini
[Unit]
Description=Zoraxy reverse proxy
After=network.target
[Service]
WorkingDirectory=/opt/zoraxy
ExecStart=/opt/zoraxy/zoraxy
Restart=always
[Install]
WantedBy=multi-user.target
```
```bash
systemctl daemon-reload && systemctl enable --now zoraxy
```
Open the management UI at `http://<server-ip>:8000` and create the admin account.
> Menu names below may differ slightly between Zoraxy versions — the concepts to look for are: ACME certificate with DNS challenge, host-based proxy rule, TLS on the incoming port.
---
## 3. Incoming port and TLS
In Zoraxy's global settings:
- Set the incoming proxy port to **443** and enable **TLS**.
- Enable **force-redirect port 80 → 443** so plain-HTTP visits upgrade automatically.
---
## 4. Certificate via ACME (DNS-01)
In **TLS / SSL Certificates → ACME**:
1. Enter the hostname (`gdprscanner.example.dk`).
2. Enable the **DNS challenge** and select the DNS provider that hosts your zone (Cloudflare, Simply.com, etc.).
3. Paste the provider's **API token/credentials** — created in the DNS provider's control panel.
4. Request the certificate. Zoraxy renews it automatically.
If your DNS host has no API, Zoraxy can generate a **self-signed certificate** as a fallback — it works, but every client machine must trust it manually. Getting a DNS API token is the better one-time investment.
---
## 5. Proxy rule
**HTTP Proxy → New Proxy Rule**:
| Field | Value |
|---|---|
| Matching hostname | `gdprscanner.example.dk` |
| Target | `127.0.0.1:5100` |
| TLS to target | Off (the scanner speaks plain HTTP locally) |
---
## 6. Close the side doors
**Bind the scanner to loopback** so only Zoraxy can reach Flask. Wherever the scanner is started (systemd unit or `start_gdpr.sh`), add:
```bash
--host 127.0.0.1
```
After a restart, `http://<server-ip>:5100` stops responding by design. The in-app self-update restart preserves the argument.
Optional hardening:
- Add a Zoraxy **Access Rule** whitelisting your LAN CIDR (e.g. `10.0.0.0/8`) on the proxy rule.
- Firewall the Zoraxy **management port 8000** to admin machines only.
---
## 7. Firewall / perimeter checklist
The Zoraxy whitelist (step 6) is an **application-layer** control — a rejected request has still completed the TCP and TLS handshake against your box, and any proxy host you forget to tag is fully exposed. The firewall is the real perimeter. Work this checklist whenever you stand up or replace the edge firewall:
- [ ] **No inbound port-forward unless a service is intentionally public.** A LAN-only deployment needs *zero* inbound forwards — DNS-01 (step 4) is outbound-only, so certificates issue and renew with the firewall fully closed.
- [ ] **If any service is intentionally public** (e.g. a media server), forward **443 only to the Zoraxy host** — never to individual app hosts. Everything then enters through Zoraxy, where the per-host Access Rule decides public vs. private.
- [ ] **The per-host whitelist stays your public/private boundary even with the firewall in place** — it is not made redundant by the firewall. Public hosts use the `default` rule; every internal-only host gets **Local Access Only**.
- [ ] **New proxy hosts default to public.** Zoraxy applies the `default` rule to any host with no rule set, so a freshly-added internal service is reachable the moment it exists. Set its Access Rule to **Local Access Only** *at creation time*.
- [ ] **Management ports are LAN-only.** Zoraxy admin (`:8000`) and any app admin UI must never be forwarded; tag them **Local Access Only** as well.
- [ ] **Verify from off-network.** From a connection outside the LAN (e.g. a phone on mobile data), confirm private hostnames are blocked and only the intentionally-public ones respond:
```bash
curl -v https://gdprscanner.example.dk # should fail/refuse from outside
nmap -Pn -p 80,443,5100 <your-public-IP> # only intentionally-open ports listed
```
---
## 8. Verify the scanner-specific behaviour
1. `https://gdprscanner.example.dk` loads with a valid padlock; `http://` redirects.
2. **Run a scan and watch result cards stream in live** — that is the Server-Sent Events connection (`/api/scan/stream`) passing through the proxy. If progress stalls while the scan log advances, look at proxy buffering/timeout settings.
3. Create a **share link** — it must start with `https://gdprscanner.example.dk/view?token=…`. The app uses the page origin automatically on HTTPS (the LAN-IP rewrite only applies when browsing at localhost). The Copy buttons now use the native Clipboard API.
4. **Settings → General → Software update → Check for updates** still works (outbound git fetch is unaffected by the proxy).
---
## Troubleshooting
| Symptom | Cause / fix |
|---|---|
| Certificate request fails | HTTP-01 attempted against an unreachable host — make sure the **DNS challenge** is selected and the API credentials are for the zone's actual DNS host |
| Cards don't stream during scans | Proxy buffering the SSE response — check Zoraxy timeout/buffering settings for the rule |
| Share links still show the LAN IP | Page was loaded via the old `http://<ip>:5100` URL — use the HTTPS hostname; links follow the page origin |
| `http://<ip>:5100` still reachable | The `--host 127.0.0.1` flag is missing from the scanner's launch command |

View File

@ -53,6 +53,21 @@ import sys
from datetime import date, datetime, timedelta
from pathlib import Path
try:
import psutil as _psutil
_PSUTIL_OK = True
except ImportError:
_PSUTIL_OK = False
_OCR_MEM_THRESHOLD_MB = 500
def _ocr_mem_ok() -> bool:
"""Return False if available RAM is below the threshold for OCR rendering."""
if not _PSUTIL_OK:
return True
return _psutil.virtual_memory().available >= _OCR_MEM_THRESHOLD_MB * 1024 * 1024
# Suppress pdfminer's noisy font-descriptor warnings that appear when PDFs
# contain malformed or incomplete font definitions. These do not affect text
# extraction or CPR detection — the warning is informational only.
@ -102,6 +117,12 @@ try:
except ImportError:
SPACY_OK = False
try:
import anthropic as _anthropic
ANTHROPIC_OK = True
except ImportError:
ANTHROPIC_OK = False
try:
from docx import Document as DocxDocument
DOCX_OK = True
@ -217,6 +238,91 @@ def load_nlp():
return None
# ── Claude NER ────────────────────────────────────────────────────────────────
def _get_claude_ner_config() -> "tuple[bool, str]":
"""Read Claude NER settings from config.json. Small file — OS-cached."""
try:
from app_config import _load_config, get_claude_api_key
cfg = _load_config()
return bool(cfg.get("claude_ner")), get_claude_api_key()
except Exception:
return False, ""
_CLAUDE_NER_CACHE: "dict[int, list[dict]]" = {}
_CLAUDE_NER_LOCK = None
def _claude_lock():
global _CLAUDE_NER_LOCK
if _CLAUDE_NER_LOCK is None:
import threading as _th
_CLAUDE_NER_LOCK = _th.Lock()
return _CLAUDE_NER_LOCK
def _ner_claude(text: str, api_key: str) -> "list[dict]":
"""
Extract named entities via Claude Haiku. Returns list of
{"text": str, "type": "NAME"|"ADDRESS"|"ORG"}.
In-memory cache keyed by hash(text); evicts oldest when > 2000 entries.
"""
if not ANTHROPIC_OK or not api_key:
return []
cache_key = hash(text)
lock = _claude_lock()
with lock:
if cache_key in _CLAUDE_NER_CACHE:
return _CLAUDE_NER_CACHE[cache_key]
try:
import json as _json
client = _anthropic.Anthropic(api_key=api_key)
CHUNK = 8_000
entities: "list[dict]" = []
for i in range(0, min(len(text), CHUNK * 10), CHUNK):
chunk = text[i : i + CHUNK]
if not chunk.strip():
continue
msg = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=512,
messages=[{
"role": "user",
"content": (
"Extract personal data from the text. "
"Return ONLY valid JSON: "
"{\"entities\":[{\"text\":\"<exact substring>\","
"\"type\":\"NAME\"|\"ADDRESS\"|\"ORG\"}]}. "
"NAME=person names, ADDRESS=physical addresses, "
"ORG=organisation names. "
"Skip CPR numbers, emails, phones, dates. "
"Return {\"entities\":[]} if none.\n\nTEXT:\n" + chunk
),
}],
)
raw = msg.content[0].text.strip()
if "```" in raw:
raw = raw.split("```")[1]
if raw.startswith("json\n"):
raw = raw[5:]
entities.extend(_json.loads(raw).get("entities", []))
result = [e for e in entities
if isinstance(e, dict) and e.get("text") and e.get("type")]
except Exception:
result = []
with lock:
if len(_CLAUDE_NER_CACHE) >= 2_000:
try:
del _CLAUDE_NER_CACHE[next(iter(_CLAUDE_NER_CACHE))]
except Exception:
pass
_CLAUDE_NER_CACHE[cache_key] = result
return result
# ── OCR page cache ───────────────────────────────────────────────────────────
_OCR_CACHE_PATH = Path.home() / ".document_scanner_ocr_cache.db"
@ -728,8 +834,15 @@ def count_pii_types(text: str, use_ner: bool = True) -> dict:
if 1 <= int(reg) <= 9999 and len(acct) >= 6:
counts["BANK_ACCOUNT"] += 1
# NER-based counts — only run if model is loaded and text is non-trivial
# NER-based counts — Claude (if enabled) else spaCy
if use_ner and len(text.strip()) > 20:
_claude_on, _claude_key = _get_claude_ner_config()
if _claude_on and ANTHROPIC_OK and _claude_key:
for ent in _ner_claude(text, _claude_key):
_t = ent.get("type")
if _t in counts:
counts[_t] += 1
else:
nlp = load_nlp()
if nlp:
NER_LIMIT = 20_000
@ -887,21 +1000,26 @@ def find_pii_spans_in_text(text: str, use_ner: bool = True) -> list[tuple[int, i
if _is_name_match(m):
spans.append((m.start(), m.end(), "NAME"))
# NER (names, addresses, orgs)
# Cap at 20 000 chars per call — spaCy NER is O(n) but dense tabular text
# (e.g. Excel-converted PDFs) can have thousands of tokens per page and stall.
#
# Context boosting: spaCy needs sentence context to recognise isolated names.
# For short text (< 80 chars, e.g. a single cell or line) we prepend a label
# so the model sees "Navn: Peter Hansen" instead of bare "Peter Hansen".
# Matches are shifted back by the prefix length before being recorded.
# NER spans — Claude (if enabled) else spaCy
if use_ner:
_claude_on, _claude_key = _get_claude_ner_config()
if _claude_on and ANTHROPIC_OK and _claude_key:
for ent in _ner_claude(text, _claude_key):
_label = ent.get("type")
_ent_text = ent.get("text", "")
if not _ent_text or _label not in ("NAME", "ADDRESS", "ORG"):
continue
for _m in re.finditer(re.escape(_ent_text), text):
spans.append((_m.start(), _m.end(), _label))
else:
# spaCy NER — cap at 20 000 chars per call (dense tabular text can stall).
# Context boosting: prepend "Navn: " for short/isolated text so spaCy
# sees sentence context; shift match positions back by prefix length.
nlp = load_nlp()
if nlp:
NER_LIMIT = 20_000
PREFIX = "Navn: "
PLEN = len(PREFIX)
# Only inject prefix for short/isolated text
if len(text.strip()) < 80:
ner_input = PREFIX + text
ner_offset = -PLEN
@ -1144,11 +1262,6 @@ def redact_pdf_secure(input_path: Path, output_path: Path, results: dict,
page_methods = results["page_methods"]
images = None
ocr_pages = [p for p, m in page_methods.items() if m == "ocr"]
if ocr_pages and OCR_AVAILABLE:
images = convert_from_path(str(input_path), dpi=dpi, poppler_path=poppler_path)
total = 0
doc = _fitz.open(str(input_path))
@ -1161,10 +1274,20 @@ def redact_pdf_secure(input_path: Path, output_path: Path, results: dict,
if method == "text":
bboxes = (find_pii_char_bboxes(plumb_page, use_ner=use_ner)
if use_ner else find_cpr_char_bboxes(plumb_page))
elif method == "ocr" and images is not None:
img = images[page_num - 1]
elif method == "ocr" and OCR_AVAILABLE:
if not _ocr_mem_ok():
print(f" Page {page_num}: skipped redact — less than {_OCR_MEM_THRESHOLD_MB} MB RAM available.", flush=True)
bboxes = []
else:
_imgs = convert_from_path(
str(input_path), dpi=dpi, poppler_path=poppler_path,
first_page=page_num, last_page=page_num,
)
img = _imgs[0]
del _imgs
bboxes = (find_pii_image_bboxes(img, lang, use_ner=use_ner)
if use_ner else find_cpr_image_bboxes(img, lang))
del img
else:
bboxes = []
@ -1227,11 +1350,6 @@ def redact_pdf(input_path: Path, output_path: Path, results: dict,
reader = PdfReader(str(input_path))
writer = PdfWriter()
images = None
ocr_pages = [p for p, m in page_methods.items() if m == "ocr"]
if ocr_pages and OCR_AVAILABLE:
images = convert_from_path(str(input_path), dpi=dpi, poppler_path=poppler_path)
total = 0
with pdfplumber.open(input_path) as plumb_pdf:
for page_num, plumb_page in enumerate(plumb_pdf.pages, start=1):
@ -1247,8 +1365,17 @@ def redact_pdf(input_path: Path, output_path: Path, results: dict,
else:
writer.add_page(reader_page)
elif method == "ocr" and images is not None:
img = images[page_num - 1]
elif method == "ocr" and OCR_AVAILABLE:
if not _ocr_mem_ok():
print(f" Page {page_num}: skipped redact — less than {_OCR_MEM_THRESHOLD_MB} MB RAM available.", flush=True)
writer.add_page(reader_page)
continue
_imgs = convert_from_path(
str(input_path), dpi=dpi, poppler_path=poppler_path,
first_page=page_num, last_page=page_num,
)
img = _imgs[0]
del _imgs
bboxes = (find_pii_image_bboxes(img, lang, use_ner=use_ner)
if use_ner else find_cpr_image_bboxes(img, lang))
if bboxes:
@ -1260,6 +1387,7 @@ def redact_pdf(input_path: Path, output_path: Path, results: dict,
total += len(bboxes)
else:
writer.add_page(reader_page)
del img
else:
writer.add_page(reader_page)
@ -2048,29 +2176,30 @@ def scan_pdf(pdf_path: Path, force_ocr=False, lang="dan+eng",
results = {"cprs": [], "dates": [], "page_methods": {}}
with pdfplumber.open(pdf_path) as pdf:
images = None
if OCR_AVAILABLE:
needs_ocr = (list(range(len(pdf.pages))) if force_ocr
else [i for i, p in enumerate(pdf.pages) if not is_text_page(p)])
if needs_ocr:
print(f" Rendering pages to images for OCR (DPI={dpi})...", flush=True)
images = convert_from_path(str(pdf_path), dpi=dpi, poppler_path=poppler_path)
for page_num, page in enumerate(pdf.pages, start=1):
use_text = not force_ocr and is_text_page(page)
if use_text:
method = "text"
text = page.extract_text() or ""
cprs, dates = extract_matches(text, page_num, "text")
elif OCR_AVAILABLE and images is not None:
elif OCR_AVAILABLE:
if not _ocr_mem_ok():
print(f" Page {page_num}: skipped — less than {_OCR_MEM_THRESHOLD_MB} MB RAM available.", flush=True)
method = "skipped"
cprs, dates = [], []
else:
print(f" Rendering page {page_num} for OCR (DPI={dpi})...", flush=True)
_imgs = convert_from_path(
str(pdf_path), dpi=dpi, poppler_path=poppler_path,
first_page=page_num, last_page=page_num,
)
_img = _imgs[0]
del _imgs
method = "ocr"
_img = images[page_num-1]
images[page_num-1] = None # release PIL image as soon as OCR is done
cprs, dates = extract_matches(ocr_page_cached(_img, lang), page_num, "ocr")
del _img
else:
method = "skipped"
if not OCR_AVAILABLE:
print(f" Page {page_num}: image-based but OCR unavailable.")
cprs, dates = [], []

View File

@ -24,6 +24,8 @@ import hashlib
from pathlib import Path, PurePosixPath
from typing import Iterator
from cpr_detector import SUPPORTED_EXTS as DEFAULT_EXTENSIONS
# ── Optional dependency flags ─────────────────────────────────────────────────
try:
@ -58,19 +60,8 @@ except ImportError:
KEYCHAIN_SERVICE = "gdpr-scanner-nas"
# File extensions passed through to _scan_bytes(). Matches SUPPORTED_EXTS in
# gdpr_scanner.py; kept here too so FileScanner can filter without importing it.
DEFAULT_EXTENSIONS = {
".pdf", ".docx", ".doc", ".xlsx", ".xlsm", ".csv",
".txt", ".eml", ".msg",
".jpg", ".jpeg", ".png", ".bmp", ".tiff", ".tif", ".webp",
".heic", ".heif",
}
# Extensions for local/SMB file scans — PDFs now included; OCR runs in a spawned
# subprocess with a 60-second hard timeout via _scan_bytes_timeout so hanging
# Tesseract/Poppler processes can never block the scan thread indefinitely.
FILE_SCAN_EXTENSIONS = DEFAULT_EXTENSIONS
# DEFAULT_EXTENSIONS is imported from cpr_detector.SUPPORTED_EXTS — single source of truth.
# Adding a new file type to cpr_detector.py automatically extends local/SMB scans too.
# Maximum file size to load into memory (bytes). Files larger than this are
# skipped with a warning — same guard used by the M365 attachment scanner.
@ -147,7 +138,7 @@ def store_smb_password(smb_host: str, smb_user: str,
class FileScanner:
"""Unified local + SMB/CIFS file iterator."""
FILE_SCAN_EXTENSIONS = FILE_SCAN_EXTENSIONS # excludes .pdf
FILE_SCAN_EXTENSIONS = DEFAULT_EXTENSIONS
"""Unified iterator over local paths and SMB/CIFS network shares.
Usage::
@ -209,7 +200,7 @@ class FileScanner:
Args:
extensions: Set of lowercase extensions to include, e.g. {".pdf", ".docx"}.
Defaults to DEFAULT_EXTENSIONS.
Defaults to DEFAULT_EXTENSIONS (cpr_detector.SUPPORTED_EXTS).
progress_cb: Optional callable(rel_path) called before each file is read,
so the caller can update a progress indicator.
@ -560,6 +551,68 @@ def _smb_read_file(tree, smb_path: str) -> bytes:
fh.close(get_attributes=False)
def write_smb_file(smb_path_uri: str, content: bytes,
username: str, password: str, domain: str = "") -> None:
"""Overwrite an SMB file at smb_path_uri (e.g. '//host/share/folder/file.docx').
Raises RuntimeError if smbprotocol is not installed.
Raises ValueError if the path cannot be parsed.
All SMB errors propagate as-is.
"""
if not SMB_OK:
raise RuntimeError("smbprotocol not installed — run: pip install smbprotocol")
norm = smb_path_uri.replace("\\", "/").lstrip("/")
parts = norm.split("/", 2)
if len(parts) < 2:
raise ValueError(f"Cannot parse SMB path '{smb_path_uri}' — expected //host/share[/path]")
host = parts[0]
share = parts[1]
file_rel = parts[2].replace("/", "\\") if len(parts) > 2 else ""
if not host or not share or not file_rel:
raise ValueError(f"Cannot parse SMB path '{smb_path_uri}'")
import uuid as _uuid
conn = Connection(_uuid.uuid4(), host, 445)
conn.connect(timeout=30)
try:
session = Session(conn, username=username, password=password,
require_encryption=False)
if domain:
session.username = f"{domain}\\{username}"
session.connect()
try:
tree = TreeConnect(session, f"\\\\{host}\\{share}")
tree.connect()
try:
fh = Open(tree, file_rel)
fh.create(
ImpersonationLevel.Impersonation,
FilePipePrinterAccessMask.FILE_WRITE_DATA |
FilePipePrinterAccessMask.FILE_WRITE_ATTRIBUTES,
FileAttributes.FILE_ATTRIBUTE_NORMAL,
ShareAccess.FILE_SHARE_NONE,
CreateDisposition.FILE_SUPERSEDE,
CreateOptions.FILE_NON_DIRECTORY_FILE,
)
try:
chunk_size = 1024 * 1024
offset = 0
while offset < len(content):
chunk = content[offset:offset + chunk_size]
fh.write(chunk, offset)
offset += len(chunk)
finally:
fh.close(get_attributes=False)
finally:
tree.disconnect()
finally:
session.disconnect()
finally:
conn.disconnect()
def _smb_ts(windows_ts: int) -> str:
"""Convert Windows FILETIME (100ns intervals since 1601-01-01) to YYYY-MM-DD."""
if not windows_ts:

View File

@ -6,7 +6,7 @@ Stores scan results alongside the existing JSON cache. Neither replaces the
other: JSON is fast and portable, SQLite enables querying, trending, and the
data-subject index.
Database location: ~/.gdpr_scanner.db (configurable via DB_PATH)
Database location: ~/.gdprscanner/scanner.db (configurable via DB_PATH)
Schema
------
@ -29,11 +29,14 @@ Usage (from gdpr_scanner.py)
import hashlib
import json
import logging
import sqlite3
import time
from pathlib import Path
from typing import Iterator
logger = logging.getLogger(__name__)
from pathlib import Path as _P
_DATA_DIR = _P.home() / ".gdprscanner"
_DATA_DIR.mkdir(exist_ok=True)
@ -180,6 +183,17 @@ CREATE INDEX IF NOT EXISTS idx_dellog_time ON deletion_log(deleted_at);
CREATE INDEX IF NOT EXISTS idx_dellog_item ON deletion_log(item_id);
CREATE INDEX IF NOT EXISTS idx_dellog_reason ON deletion_log(reason);
CREATE TABLE IF NOT EXISTS audit_log (
id INTEGER PRIMARY KEY AUTOINCREMENT,
ts REAL NOT NULL,
action TEXT NOT NULL DEFAULT '',
actor TEXT NOT NULL DEFAULT '',
detail TEXT NOT NULL DEFAULT '',
ip TEXT NOT NULL DEFAULT ''
);
CREATE INDEX IF NOT EXISTS idx_audit_ts ON audit_log(ts);
CREATE INDEX IF NOT EXISTS idx_audit_action ON audit_log(action);
-- Indexes
CREATE INDEX IF NOT EXISTS idx_items_scan ON flagged_items(scan_id);
CREATE INDEX IF NOT EXISTS idx_items_source ON flagged_items(source_type);
@ -200,6 +214,9 @@ _MIGRATIONS: list[tuple[int, str]] = [
(4, "ALTER TABLE flagged_items ADD COLUMN face_count INTEGER NOT NULL DEFAULT 0"),
(5, "ALTER TABLE flagged_items ADD COLUMN exif_json TEXT NOT NULL DEFAULT '{}'"),
(6, "ALTER TABLE flagged_items ADD COLUMN full_path TEXT NOT NULL DEFAULT ''"),
(8, "ALTER TABLE flagged_items ADD COLUMN email_count INTEGER NOT NULL DEFAULT 0"),
(9, "ALTER TABLE flagged_items ADD COLUMN phone_count INTEGER NOT NULL DEFAULT 0"),
(10, "ALTER TABLE flagged_items ADD COLUMN body_excerpt TEXT NOT NULL DEFAULT ''"),
(7, """CREATE TABLE IF NOT EXISTS schedule_runs (
id INTEGER PRIMARY KEY AUTOINCREMENT,
started_at REAL NOT NULL,
@ -211,6 +228,7 @@ _MIGRATIONS: list[tuple[int, str]] = [
emailed INTEGER NOT NULL DEFAULT 0,
error TEXT NOT NULL DEFAULT ''
)"""),
(11, "ALTER TABLE flagged_items ADD COLUMN account_name TEXT NOT NULL DEFAULT ''"),
]
@ -311,8 +329,9 @@ class ScanDB:
(id, scan_id, name, source, source_type, account_id, folder,
url, drive_id, size_kb, modified, cpr_count, risk,
thumb_b64, thumb_mime, attachments, user_role, transfer_risk,
special_category, face_count, exif_json, full_path, scanned_at)
VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)""",
special_category, face_count, exif_json, full_path,
email_count, phone_count, body_excerpt, account_name, scanned_at)
VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)""",
(
card.get("id", ""),
scan_id,
@ -336,6 +355,10 @@ class ScanDB:
card.get("face_count", 0),
json.dumps(card.get("exif", {})),
card.get("full_path", ""),
card.get("email_count", 0),
card.get("phone_count", 0),
card.get("body_excerpt", ""),
card.get("account_name", ""),
now,
),
)
@ -414,6 +437,33 @@ class ScanDB:
c.commit()
def finalize_orphan_scans(self) -> int:
"""Finalise scans left unfinished by a crash, kill, or mid-scan restart.
After a fresh process start nothing is scanning, so any scan still
carrying finished_at IS NULL is dead the process that owned it is gone.
Its already-saved flagged_items were stranded: both get_session_items
and get_open_items require finished_at, so those items are invisible and
effectively lost. Finalising the orphans on startup makes them show up
and prevents permanent data loss from interrupted scans (the M365 and
Google engines return early on abort and never reach finish_scan; only
the file scan finalises in a finally block).
Safe to call only when no scan is running (i.e. at startup). Returns the
number of scans finalised.
"""
rows = self._connect().execute(
"SELECT id, total_scanned FROM scans WHERE finished_at IS NULL"
).fetchall()
count = 0
for sid, total in rows:
try:
self.finish_scan(sid, total or 0)
count += 1
except Exception as e:
logger.warning("[db] finalize_orphan_scans: scan %s failed: %s", sid, e)
return count
# ── Query helpers ─────────────────────────────────────────────────────────
def latest_scan_id(self) -> int | None:
@ -442,14 +492,60 @@ class ScanDB:
result.append(d)
return result
def get_session_items(self, window_seconds: int = 300) -> list[dict]:
def get_sessions(self, limit: int = 50, window_seconds: int = 300) -> list[dict]:
"""Return scan sessions (groups of concurrent scans) newest-first.
Concurrent M365 + Google + File scans each get their own scan_id but start
within seconds of each other. This method groups them into logical sessions
by the same 300-second window used by get_session_items().
"""
rows = self._connect().execute(
"""SELECT id, started_at, finished_at, sources, flagged_count, total_scanned, delta
FROM scans WHERE finished_at IS NOT NULL ORDER BY started_at ASC"""
).fetchall()
# Group consecutive scans started within window_seconds of each other
groups: list[list[dict]] = []
for r in rows:
d = dict(r)
d["sources"] = json.loads(d.get("sources") or "[]")
if groups and d["started_at"] - groups[-1][0]["started_at"] <= window_seconds:
groups[-1].append(d)
else:
groups.append([d])
# Build session summaries newest-first
sessions: list[dict] = []
for grp in reversed(groups):
ref = grp[-1] # highest scan_id in group (last in ASC order)
sessions.append({
"ref_scan_id": ref["id"],
"started_at": grp[0]["started_at"],
"finished_at": ref.get("finished_at"),
"sources": list({s for g in grp for s in g["sources"]}),
"flagged_count": sum(g["flagged_count"] or 0 for g in grp),
"total_scanned": sum(g["total_scanned"] or 0 for g in grp),
"delta": any(bool(g["delta"]) for g in grp),
})
if len(sessions) >= limit:
break
return sessions
def get_session_items(self, window_seconds: int = 300,
ref_scan_id: int | None = None) -> list[dict]:
"""Return flagged items from all scans in the same session as the latest scan.
A session is all scans whose started_at is within *window_seconds* of the
most recently started completed scan. This captures concurrent M365, Google,
and file scans which each create their own scan_id but start within seconds
of each other.
If *ref_scan_id* is given, the session is anchored to that scan's started_at
instead of the latest scan.
"""
if ref_scan_id:
row = self._connect().execute(
"SELECT started_at FROM scans WHERE id=?", (ref_scan_id,)
).fetchone()
else:
row = self._connect().execute(
"SELECT started_at FROM scans WHERE finished_at IS NOT NULL ORDER BY id DESC LIMIT 1"
).fetchone()
@ -461,9 +557,9 @@ class ScanDB:
FROM flagged_items fi
JOIN scans s ON fi.scan_id = s.id
LEFT JOIN dispositions d ON d.item_id = fi.id
WHERE s.started_at >= ? AND s.finished_at IS NOT NULL
WHERE s.started_at BETWEEN ? AND ? AND s.finished_at IS NOT NULL
ORDER BY fi.cpr_count DESC""",
(latest_start - window_seconds,),
(latest_start - window_seconds, latest_start + window_seconds),
).fetchall()
result = []
for r in rows:
@ -472,6 +568,98 @@ class ScanDB:
result.append(d)
return result
def get_open_items(self) -> list[dict]:
"""Return every flagged item across all scans that has no action taken.
"Open" means the item has no disposition row (or a row whose status is
still 'unreviewed'). Unlike get_session_items this is NOT limited to the
latest scan window it surfaces all outstanding items so nothing slips
out of view once a newer scan starts a fresh session.
flagged_items has a composite PK of (id, scan_id), so the same logical
item appears once per scan that flagged it. We deduplicate by id, keeping
the row from the most recent finished scan, so each open item shows once.
"""
rows = self._connect().execute(
"""SELECT fi.*, COALESCE(d.status, 'unreviewed') AS disposition
FROM flagged_items fi
JOIN scans s ON fi.scan_id = s.id
LEFT JOIN dispositions d ON d.item_id = fi.id
WHERE s.finished_at IS NOT NULL
AND (d.item_id IS NULL OR d.status = 'unreviewed')
AND fi.scan_id = (
SELECT MAX(fi2.scan_id)
FROM flagged_items fi2
JOIN scans s2 ON fi2.scan_id = s2.id
WHERE fi2.id = fi.id AND s2.finished_at IS NOT NULL
)
ORDER BY fi.cpr_count DESC""",
).fetchall()
result = []
for r in rows:
d = dict(r)
d["attachments"] = json.loads(d.get("attachments") or "[]")
result.append(d)
return result
def get_related_items(self, item_id: str, ref_scan_id: int | None = None,
window_seconds: int = 300) -> list[dict]:
"""Return flagged items from the same session that share at least one CPR
hash with *item_id*, ordered by number of shared CPRs descending."""
if ref_scan_id:
row = self._connect().execute(
"SELECT started_at FROM scans WHERE id=?", (ref_scan_id,)
).fetchone()
else:
row = self._connect().execute(
"SELECT started_at FROM scans WHERE finished_at IS NOT NULL ORDER BY id DESC LIMIT 1"
).fetchone()
if not row:
return []
latest_start = row[0]
rows = self._connect().execute(
"""SELECT fi.*, COUNT(DISTINCT ci2.cpr_hash) AS shared_cprs
FROM cpr_index ci1
JOIN cpr_index ci2 ON ci2.cpr_hash = ci1.cpr_hash
JOIN flagged_items fi ON fi.id = ci2.item_id
JOIN scans s ON fi.scan_id = s.id
WHERE ci1.item_id = ?
AND fi.id != ?
AND s.started_at BETWEEN ? AND ?
AND s.finished_at IS NOT NULL
GROUP BY fi.id
ORDER BY shared_cprs DESC, fi.cpr_count DESC""",
(item_id, item_id, latest_start - window_seconds, latest_start + window_seconds),
).fetchall()
return [dict(r) for r in rows]
def get_session_sources(self, window_seconds: int = 300) -> set:
"""Return the union of all source keys scanned in the current session.
Reads the ``sources`` JSON array stored in each scan record that belongs
to the same session as the latest completed scan. This is used by the
export builders so they can show every scanned source in summary tables
even when a source produced zero flagged items.
"""
row = self._connect().execute(
"SELECT started_at FROM scans WHERE finished_at IS NOT NULL ORDER BY id DESC LIMIT 1"
).fetchone()
if not row:
return set()
latest_start = row[0]
rows = self._connect().execute(
"""SELECT sources FROM scans
WHERE started_at >= ? AND finished_at IS NOT NULL""",
(latest_start - window_seconds,),
).fetchall()
result: set = set()
for r in rows:
try:
result.update(json.loads(r[0] or "[]"))
except Exception:
pass
return result
def lookup_data_subject(self, cpr: str) -> list[dict]:
"""Find all flagged items containing a given CPR number (by hash)."""
cpr_hash = hashlib.sha256(str(cpr).encode()).hexdigest()
@ -698,6 +886,34 @@ class ScanDB:
).fetchone()[0] or 0
return {"total": total, "by_reason": by_reason, "cpr_hits_deleted": cpr_deleted}
# ── Compliance audit log ──────────────────────────────────────────────────
def log_audit(self, action: str, detail: str = "",
actor: str = "", ip: str = "") -> None:
"""Write an immutable compliance audit record."""
c = self._connect()
c.execute(
"INSERT INTO audit_log (ts, action, actor, detail, ip) VALUES (?,?,?,?,?)",
(time.time(), action, actor, detail, ip),
)
c.commit()
def get_audit_log(self, limit: int = 200,
action: str | None = None) -> list[dict]:
"""Return audit records, most recent first."""
c = self._connect()
if action:
rows = c.execute(
"SELECT * FROM audit_log WHERE action=? ORDER BY ts DESC LIMIT ?",
(action, limit),
).fetchall()
else:
rows = c.execute(
"SELECT * FROM audit_log ORDER BY ts DESC LIMIT ?",
(limit,),
).fetchall()
return [dict(r) for r in rows]
def delete_item_record(self, item_id: str, scan_id: int | None = None) -> None:
"""Remove a flagged item from the DB (after it has been deleted in M365)."""
c = self._connect()
@ -946,6 +1162,15 @@ class ScanDB:
_db: ScanDB | None = None
def log_audit_event(action: str, detail: str = "",
actor: str = "", ip: str = "") -> None:
"""Write an audit record to the shared DB. Silently no-ops if DB unavailable."""
try:
get_db().log_audit(action, detail, actor=actor, ip=ip)
except Exception:
pass
def get_db(path: Path = DB_PATH) -> ScanDB:
"""Return the module-level ScanDB singleton, creating it if needed."""
global _db

View File

@ -146,7 +146,7 @@ _migrate_to_data_dir()
# ── Flask ─────────────────────────────────────────────────────────────────────
try:
from flask import Flask, Response, jsonify, render_template, request, session
from flask import Flask, Response, jsonify, redirect, render_template, request, session
except ImportError:
print("Flask required: pip install flask")
sys.exit(1)
@ -251,7 +251,7 @@ from app_config import (
from checkpoint import (
_checkpoint_key, _save_checkpoint, _load_checkpoint, _clear_checkpoint,
_load_delta_tokens, _save_delta_tokens,
_CHECKPOINT_PATH, _DELTA_PATH,
_cp_path, _DELTA_PATH,
)
from sse import broadcast, _sse_queues, _sse_buffer
@ -260,8 +260,8 @@ import sse as _sse_mod # for _current_scan_id access at call time
from cpr_detector import (
_scan_bytes, _scan_bytes_timeout, _scan_text_direct, _html_esc, _get_pii_counts,
_make_thumb, _placeholder_svg,
_extract_exif, _detect_photo_faces,
SUPPORTED_EXTS, PHOTO_EXTS,
_extract_exif, _extract_video_metadata, _extract_audio_metadata, _detect_photo_faces,
SUPPORTED_EXTS, PHOTO_EXTS, VIDEO_EXTS, AUDIO_EXTS,
_EXIF_PII_TAGS,
)
# Inject runtime deps into cpr_detector
@ -285,12 +285,16 @@ _se.FILE_SCANNER_OK = FILE_SCANNER_OK
_se.CONNECTOR_OK = CONNECTOR_OK
_se.DB_OK = DB_OK
_se.PHOTO_EXTS = PHOTO_EXTS
_se.VIDEO_EXTS = VIDEO_EXTS
_se.AUDIO_EXTS = AUDIO_EXTS
_se.SUPPORTED_EXTS = SUPPORTED_EXTS
# cpr helpers
_se._scan_bytes = _scan_bytes
_se._scan_bytes_timeout = _scan_bytes_timeout
_se._detect_photo_faces = _detect_photo_faces
_se._extract_exif = _extract_exif
_se._extract_video_metadata = _extract_video_metadata
_se._extract_audio_metadata = _extract_audio_metadata
_se._make_thumb = _make_thumb
_se._placeholder_svg = _placeholder_svg
_se._check_special_category = _check_special_category
@ -313,6 +317,11 @@ app = Flask(__name__,
template_folder=_os.path.join(_BASE_DIR, "templates"),
static_folder=_os.path.join(_BASE_DIR, "static"))
# Static files must revalidate on every load (cheap 304s via ETag). Without
# this there is no Cache-Control header and browsers cache JS/CSS heuristically
# for days — after a self-update the backend is new but the UI stays stale.
app.config["SEND_FILE_MAX_AGE_DEFAULT"] = 0
# Session secret — derived from machine_id so it survives restarts without a separate file.
# machine_id is also the Fernet key (base64-encoded 32 bytes); we use its raw bytes as the secret.
try:
@ -368,7 +377,72 @@ def _sync_state():
# JavaScript served from static/app.js via Flask static file handling.
# ── Auth state ─────────────────────────────────────────────────────────────────
# ── Interface PIN auth ────────────────────────────────────────────────────────
_iface_pin_attempts: dict[str, list[float]] = {}
_IFACE_MAX_ATTEMPTS = 5
_IFACE_WINDOW_S = 300
def _iface_rate_limited(ip: str) -> bool:
now = time.time()
times = [t for t in _iface_pin_attempts.get(ip, []) if now - t < _IFACE_WINDOW_S]
_iface_pin_attempts[ip] = times
return len(times) >= _IFACE_MAX_ATTEMPTS
@app.before_request
def _require_interface_pin():
from app_config import get_interface_pin_hash
if not get_interface_pin_hash():
return # feature disabled — open access
path = request.path
# Always-exempt paths
if (path.startswith("/static/")
or path in ("/login", "/view", "/manual", "/favicon.ico")
or path == "/api/interface/pin/verify"
or path == "/api/viewer/pin/verify"):
return
# Authenticated sessions (interface or viewer) pass through
if session.get("interface_ok") or session.get("viewer_ok"):
return
if path.startswith("/api/"):
return jsonify({"error": "authentication required"}), 401
return redirect("/login")
@app.route("/login")
def login_page():
from app_config import get_interface_pin_hash
if not get_interface_pin_hash():
return redirect("/")
if session.get("interface_ok"):
return redirect("/")
return render_template("interface_login.html", LANG=LANG)
@app.route("/api/interface/pin/verify", methods=["POST"])
def interface_pin_verify():
from app_config import verify_interface_pin
ip = request.remote_addr or "unknown"
if _iface_rate_limited(ip):
return jsonify({"error": "Too many failed attempts. Try again later."}), 429
body = request.get_json(silent=True) or {}
pin = str(body.get("pin", "")).strip()
if not verify_interface_pin(pin):
_iface_pin_attempts.setdefault(ip, []).append(time.time())
return jsonify({"error": "Incorrect PIN"}), 401
_iface_pin_attempts.pop(ip, None)
session["interface_ok"] = True
return jsonify({"ok": True})
@app.route("/api/interface/logout", methods=["POST"])
def interface_logout():
session.pop("interface_ok", None)
return jsonify({"ok": True})
# ── Routes ────────────────────────────────────────────────────────────────────
@app.route("/")
@ -383,17 +457,21 @@ def viewer():
from app_config import validate_viewer_token, get_viewer_pin_hash
token = request.args.get("token", "").strip()
if token:
if validate_viewer_token(token) is None:
entry = validate_viewer_token(token)
if entry is None:
return render_template("viewer_denied.html"), 403
# Bind a session so the viewer doesn't need the token on every navigation
session["viewer_ok"] = True
session["viewer_scope"] = entry.get("scope", {})
return render_template("index.html", app_version=APP_VERSION,
lang_json=json.dumps(LANG, ensure_ascii=False),
viewer_mode=True)
viewer_mode=True,
viewer_scope=json.dumps(entry.get("scope", {}), ensure_ascii=False))
if session.get("viewer_ok"):
return render_template("index.html", app_version=APP_VERSION,
lang_json=json.dumps(LANG, ensure_ascii=False),
viewer_mode=True)
viewer_mode=True,
viewer_scope=json.dumps(session.get("viewer_scope", {}), ensure_ascii=False))
# No token, no session — show PIN form if a PIN is configured, else deny
pin_hash = get_viewer_pin_hash()
if pin_hash:
@ -1499,10 +1577,11 @@ from routes.scheduler import bp as scheduler_bp
from routes.google_auth import bp as google_auth_bp
from routes.google_scan import bp as google_scan_bp
from routes.viewer import bp as viewer_bp
from routes.updates import bp as updates_bp
for _bp in [auth_bp, users_bp, scan_bp, sources_bp, profiles_bp,
email_bp, database_bp, export_bp, app_routes_bp, scheduler_bp,
google_auth_bp, google_scan_bp, viewer_bp]:
google_auth_bp, google_scan_bp, viewer_bp, updates_bp]:
app.register_blueprint(_bp)
# ── Entry point ───────────────────────────────────────────────────────────────
@ -1519,10 +1598,10 @@ Headless (scheduled) usage:
environment variables: M365_CLIENT_ID, M365_TENANT_ID, M365_CLIENT_SECRET
or a settings JSON: --settings /path/to/settings.json
Scan options are loaded from ~/.gdpr_scanner_settings.json (saved automatically
Scan options are loaded from ~/.gdprscanner/settings.json (saved automatically
after any interactive scan), or overridden in the --settings file.
SMTP config is loaded from ~/.gdpr_scanner_smtp.json (saved in the UI) or from
SMTP config is loaded from ~/.gdprscanner/smtp.json (saved in the UI) or from
an 'smtp' key in the --settings file.
Example cron (weekly, Mondays at 06:00):
@ -1557,7 +1636,7 @@ Example --settings file with SMTP:
parser.add_argument("--output", default=".",
help="Output directory for Excel export in headless mode (default: .)")
parser.add_argument("--settings", default=None,
help="Path to a JSON settings file (overrides ~/.gdpr_scanner_settings.json)")
help="Path to a JSON settings file (overrides ~/.gdprscanner/settings.json)")
parser.add_argument("--email-to", default=None,
help="Comma-separated recipient addresses — send Excel report by email (headless only)")
parser.add_argument("--retention-years", type=int, default=None,
@ -1565,7 +1644,7 @@ Example --settings file with SMTP:
parser.add_argument("--fiscal-year-end", default=None,
help="Fiscal year end as MM-DD for retention cutoff (e.g. 12-31 for Bogforingsloven). Omit for rolling window.")
parser.add_argument("--reset-db", action="store_true",
help="Reset the results database (~/.gdpr_scanner.db) — permanently deletes all scan history, "
help="Reset the results database (~/.gdprscanner/scanner.db) — permanently deletes all scan history, "
"dispositions, and deletion log. Prompts for confirmation unless --yes is also passed.")
parser.add_argument("--yes", action="store_true",
help="Skip confirmation prompts (use with --reset-db for scripted resets)")
@ -1769,7 +1848,7 @@ Example --settings file with SMTP:
(_SETTINGS_PATH, "Headless scan settings"),
(_ROLE_OVERRIDES_PATH, "Manual role overrides"),
(_FILE_SOURCES_PATH, "File source definitions"),
(_CHECKPOINT_PATH, "Scan checkpoint (resume state)"),
(_cp_path("m365"), "Scan checkpoint (resume state)"),
(_DELTA_PATH, "Delta scan tokens"),
(_LANG_OVERRIDE_FILE, "Language preference"),
(Path.home() / ".gdprscanner" / "schedule.json", "Scheduler configuration"),
@ -1856,10 +1935,12 @@ Example --settings file with SMTP:
print(" ✖ m365_db not available — cannot reset")
_sys.exit(1)
# Also clear the JSON checkpoint so the UI starts with no cached results
_clear_checkpoint()
if not _CHECKPOINT_PATH.exists():
print(f" ✔ Checkpoint cleared")
# Also clear all checkpoints so the UI starts with no cached results
from pathlib import Path as _Path
for _cpf in (_Path.home() / ".gdprscanner").glob("checkpoint_*.json"):
try: _cpf.unlink()
except Exception: pass
print(f" ✔ Checkpoints cleared")
# Clear delta tokens too — stale after a full DB reset
if _DELTA_PATH.exists():
@ -2068,7 +2149,7 @@ Example --settings file with SMTP:
email_to = getattr(args, "email_to", None)
if email_to:
recipients = [r.strip() for r in email_to.replace(";", ",").split(",") if r.strip()]
# SMTP config: --settings file takes priority, then saved ~/.gdpr_scanner_smtp.json
# SMTP config: --settings file takes priority, then saved ~/.gdprscanner/smtp.json
smtp_cfg = _load_smtp_config()
if cfg.get("smtp"):
smtp_cfg = {**smtp_cfg, **cfg["smtp"]}
@ -2185,14 +2266,33 @@ Example --settings file with SMTP:
# Find a free port — auto-increment from the requested port if in use.
import socket as _socket
def _find_free_port(start: int, host: str) -> int:
for p in range(start, start + 100):
def _can_bind(p: int, host: str) -> bool:
with _socket.socket(_socket.AF_INET, _socket.SOCK_STREAM) as s:
# Probe with SO_REUSEADDR, matching how Werkzeug binds.
# Without it, connections left in TIME_WAIT by a previous
# instance (e.g. the in-app update restart) make the port
# look occupied and the app silently moves to the next one.
s.setsockopt(_socket.SOL_SOCKET, _socket.SO_REUSEADDR, 1)
try:
s.bind((host, p))
return p
return True
except OSError:
continue
return False
def _find_free_port(start: int, host: str) -> int:
# Give the requested port a grace period — after a self-restart
# the previous process may not have released it yet.
deadline = time.time() + 10
while True:
if _can_bind(start, host):
return start
if time.time() >= deadline:
break
time.sleep(0.5)
for p in range(start + 1, start + 100):
if _can_bind(p, host):
return p
raise RuntimeError(f"No free port found in range {start}{start + 99}")
actual_port = _find_free_port(args.port, args.host)
@ -2205,6 +2305,19 @@ Example --settings file with SMTP:
print(f"\n GDPRScanner\n ──────────────────────────────")
print(f" Open: http://{args.host}:{args.port}")
# Recover scans left unfinished by a crash / kill / mid-scan restart.
# Nothing is scanning at startup, so any scan with finished_at IS NULL is
# dead; finalising it makes its already-saved items visible again instead
# of stranding them (both get_session_items and get_open_items require a
# finished scan). Must run before the scheduler can start a new scan.
try:
if DB_OK:
_recovered = _get_db().finalize_orphan_scans()
if _recovered:
print(f" Recovered {_recovered} unfinished scan(s) from a prior restart")
except Exception as _orphan_err:
print(f" Orphan-scan recovery: failed ({_orphan_err})")
# Start in-process scheduler (#19)
try:
import scan_scheduler as _sched_mod
@ -2221,5 +2334,14 @@ Example --settings file with SMTP:
except Exception as _sched_err:
print(f" Scheduler: failed to start ({_sched_err})")
# Auto-update background thread (Settings → General → Software update)
try:
from routes.updates import start_auto_update_thread
from app_config import get_update_config as _get_upd_cfg
if start_auto_update_thread() and _get_upd_cfg().get("auto_update"):
print(" Auto-update: enabled (checked daily)")
except Exception as _upd_err:
print(f" Auto-update: failed to start ({_upd_err})")
print(f" Press Ctrl+C to stop\n")
app.run(host=args.host, port=args.port, debug=False, threaded=True)

View File

@ -70,6 +70,9 @@ GMAIL_SCOPES = [
DRIVE_SCOPES = [
"https://www.googleapis.com/auth/drive.readonly",
]
DRIVE_WRITE_SCOPES = [
"https://www.googleapis.com/auth/drive",
]
ADMIN_SCOPES = [
"https://www.googleapis.com/auth/admin.directory.user.readonly",
]
@ -260,6 +263,50 @@ class GoogleConnector:
raise GoogleError(f"Drive auth failed for {user_email}: {e}") from e
yield from _drive_iter(service, user_email, max_files, max_file_mb)
def get_drive_start_token(self, user_email: str) -> str:
"""Return the current Changes API start page token for user's Drive."""
try:
creds = self._creds_for(user_email, DRIVE_SCOPES)
service = build("drive", "v3", credentials=creds, cache_discovery=False)
except HttpError as e:
raise GoogleError(f"Drive auth failed for {user_email}: {e}") from e
return _drive_get_start_page_token(service)
def get_drive_changes(
self,
user_email: str,
page_token: str,
max_files: int = 5000,
max_file_mb: float = 50.0,
) -> "tuple[list[tuple[dict, bytes]], str]":
"""Return (changed_files, new_page_token) since page_token."""
try:
creds = self._creds_for(user_email, DRIVE_SCOPES)
service = build("drive", "v3", credentials=creds, cache_discovery=False)
except HttpError as e:
raise GoogleError(f"Drive auth failed for {user_email}: {e}") from e
return _drive_changes_collect(service, user_email, page_token, max_files, max_file_mb)
# ── Drive write-back (redaction) ──────────────────────────────────────────
def get_drive_file_mime(self, user_email: str, file_id: str) -> str:
"""Return the mimeType of a Drive file."""
creds = self._creds_for(user_email, DRIVE_WRITE_SCOPES)
service = build("drive", "v3", credentials=creds, cache_discovery=False)
return _get_drive_file_mime(service, file_id)
def download_drive_file_by_id(self, user_email: str, file_id: str) -> bytes:
"""Download raw bytes of a non-Google-native Drive file by ID."""
creds = self._creds_for(user_email, DRIVE_WRITE_SCOPES)
service = build("drive", "v3", credentials=creds, cache_discovery=False)
return _download_drive_file_by_id(service, file_id)
def update_drive_file(self, user_email: str, file_id: str, content: bytes, mime_type: str) -> None:
"""Replace Drive file content in-place. Requires drive (not drive.readonly) scope."""
creds = self._creds_for(user_email, DRIVE_WRITE_SCOPES)
service = build("drive", "v3", credentials=creds, cache_discovery=False)
_update_drive_file_content(service, file_id, content, mime_type)
# ── Persistence helpers ───────────────────────────────────────────────────────
@ -412,6 +459,101 @@ def _gmail_iter(
yield (att_meta, data)
def _download_drive_file(
service,
f: dict,
user_email: str,
max_bytes: int,
) -> "tuple[dict, bytes] | None":
"""Download one Drive file entry. Returns (meta, data) or None if skipped."""
mime = f.get("mimeType", "")
fid = f.get("id", "")
fname = f.get("name", "")
size = int(f.get("size", 0) or 0)
meta = {
"id": f"gdrive:{fid}",
"name": fname,
"_source": "gdrive",
"_source_type": "gdrive",
"_account": user_email,
"_account_id": user_email,
"_url": f.get("webViewLink", ""),
"lastModifiedDateTime": f.get("modifiedTime", "")[:10],
"size": size,
}
if mime in _EXPORT_MAP:
export_mime, ext = _EXPORT_MAP[mime]
try:
req = service.files().export_media(fileId=fid, mimeType=export_mime)
buf = io.BytesIO()
dl = MediaIoBaseDownload(buf, req, chunksize=4 * 1024 * 1024)
done = False
total = 0
while not done:
_, done = dl.next_chunk()
total = buf.tell()
if total > _MAX_EXPORT_BYTES:
break
if total > _MAX_EXPORT_BYTES:
return None
meta["name"] = fname + ext
meta["size"] = total
data = buf.getvalue()
del buf
return (meta, data)
except HttpError as e:
if "exportSizeLimitExceeded" in str(e):
print(
f"[gdrive] skip '{fname}' — file too large for Google export API"
f" (exportSizeLimitExceeded); fid={fid}",
flush=True,
)
return None
else:
if mime.startswith("application/vnd.google-apps."):
return None
if size == 0 or size > max_bytes:
return None
try:
req = service.files().get_media(fileId=fid)
buf = io.BytesIO()
dl = MediaIoBaseDownload(buf, req, chunksize=4 * 1024 * 1024)
done = False
while not done:
_, done = dl.next_chunk()
data = buf.getvalue()
del buf
return (meta, data)
except HttpError:
return None
def _get_drive_file_mime(service, file_id: str) -> str:
"""Return the mimeType of a Drive file."""
info = service.files().get(fileId=file_id, fields="mimeType").execute()
return info.get("mimeType", "")
def _download_drive_file_by_id(service, file_id: str) -> bytes:
"""Download raw bytes of a non-Google-native Drive file by ID."""
req = service.files().get_media(fileId=file_id)
buf = io.BytesIO()
dl = MediaIoBaseDownload(buf, req, chunksize=4 * 1024 * 1024)
done = False
while not done:
_, done = dl.next_chunk()
return buf.getvalue()
def _update_drive_file_content(service, file_id: str, content: bytes, mime_type: str) -> None:
"""Replace a Drive file's content in-place."""
from googleapiclient.http import MediaInMemoryUpload
media = MediaInMemoryUpload(content, mimetype=mime_type, resumable=False)
service.files().update(fileId=file_id, media_body=media).execute()
def _drive_iter(
service,
user_email: str,
@ -439,74 +581,77 @@ def _drive_iter(
for f in resp.get("files", []):
fetched += 1
mime = f.get("mimeType", "")
fid = f.get("id", "")
fname = f.get("name", "")
size = int(f.get("size", 0) or 0)
meta = {
"id": f"gdrive:{fid}",
"name": fname,
"_source": "gdrive",
"_source_type": "gdrive",
"_account": user_email,
"_account_id": user_email,
"_url": f.get("webViewLink", ""),
"lastModifiedDateTime": f.get("modifiedTime", "")[:10],
"size": size,
}
if mime in _EXPORT_MAP:
export_mime, ext = _EXPORT_MAP[mime]
try:
req = service.files().export_media(fileId=fid, mimeType=export_mime)
buf = io.BytesIO()
dl = MediaIoBaseDownload(buf, req, chunksize=4 * 1024 * 1024)
done = False
total = 0
while not done:
status, done = dl.next_chunk()
total = buf.tell()
if total > _MAX_EXPORT_BYTES:
break
if total > _MAX_EXPORT_BYTES:
continue
meta["name"] = fname + ext
meta["size"] = total
data = buf.getvalue()
del buf
yield (meta, data)
except HttpError as e:
if "exportSizeLimitExceeded" in str(e):
print(
f"[gdrive] skip '{fname}' — file too large for Google export API"
f" (exportSizeLimitExceeded); fid={fid}",
flush=True,
)
continue
else:
if mime.startswith("application/vnd.google-apps."):
continue # other native formats we can't export — skip
if size == 0 or size > max_bytes:
continue
try:
req = service.files().get_media(fileId=fid)
buf = io.BytesIO()
dl = MediaIoBaseDownload(buf, req, chunksize=4 * 1024 * 1024)
done = False
while not done:
_, done = dl.next_chunk()
data = buf.getvalue()
del buf
yield (meta, data)
except HttpError:
continue
result = _download_drive_file(service, f, user_email, max_bytes)
if result:
yield result
page_token = resp.get("nextPageToken")
if not page_token:
break
def _drive_get_start_page_token(service) -> str:
"""Return the current Changes API start page token for this Drive."""
resp = service.changes().getStartPageToken().execute()
return resp["startPageToken"]
def _drive_changes_collect(
service,
user_email: str,
page_token: str,
max_files: int,
max_file_mb: float,
) -> "tuple[list[tuple[dict, bytes]], str]":
"""
Collect Drive changes since page_token using the Changes API.
Returns (list_of_(meta, data)_tuples, new_start_page_token).
Skips removed/trashed files.
Raises GoogleError on API failure so the caller can fall back to a full scan.
"""
max_bytes = int(max_file_mb * 1024 * 1024)
fields = (
"nextPageToken,newStartPageToken,"
"changes(removed,file(id,name,mimeType,size,webViewLink,modifiedTime,owners,parents))"
)
results: list = []
new_token = page_token
fetched = 0
while fetched < max_files:
params: dict = {
"pageToken": page_token,
"spaces": "drive",
"fields": fields,
"includeRemoved": True,
"pageSize": min(1000, max_files - fetched),
}
try:
resp = service.changes().list(**params).execute()
except HttpError as e:
raise GoogleError(f"Drive changes error for {user_email}: {e}") from e
for change in resp.get("changes", []):
if change.get("removed"):
continue
f = change.get("file")
if not f:
continue
fetched += 1
result = _download_drive_file(service, f, user_email, max_bytes)
if result:
results.append(result)
if "newStartPageToken" in resp:
new_token = resp["newStartPageToken"]
break
page_token = resp.get("nextPageToken")
if not page_token:
break
return results, new_token
# ── Personal Google account (OAuth device-code) connector ────────────────────
class PersonalGoogleConnector:
@ -621,6 +766,50 @@ class PersonalGoogleConnector:
raise GoogleError(f"Drive auth failed: {e}") from e
yield from _drive_iter(service, user_email, max_files, max_file_mb)
def get_drive_start_token(self, user_email: str) -> str:
"""Return the current Changes API start page token for this Drive."""
self._refresh_if_needed()
try:
service = build("drive", "v3", credentials=self._creds, cache_discovery=False)
except HttpError as e:
raise GoogleError(f"Drive auth failed: {e}") from e
return _drive_get_start_page_token(service)
def get_drive_changes(
self,
user_email: str,
page_token: str,
max_files: int = 5000,
max_file_mb: float = 50.0,
) -> "tuple[list[tuple[dict, bytes]], str]":
"""Return (changed_files, new_page_token) since page_token."""
self._refresh_if_needed()
try:
service = build("drive", "v3", credentials=self._creds, cache_discovery=False)
except HttpError as e:
raise GoogleError(f"Drive auth failed: {e}") from e
return _drive_changes_collect(service, user_email, page_token, max_files, max_file_mb)
# ── Drive write-back (redaction) ──────────────────────────────────────────
def get_drive_file_mime(self, user_email: str, file_id: str) -> str:
"""Return the mimeType of a Drive file."""
self._refresh_if_needed()
service = build("drive", "v3", credentials=self._creds, cache_discovery=False)
return _get_drive_file_mime(service, file_id)
def download_drive_file_by_id(self, user_email: str, file_id: str) -> bytes:
"""Download raw bytes of a non-Google-native Drive file by ID."""
self._refresh_if_needed()
service = build("drive", "v3", credentials=self._creds, cache_discovery=False)
return _download_drive_file_by_id(service, file_id)
def update_drive_file(self, user_email: str, file_id: str, content: bytes, mime_type: str) -> None:
"""Replace Drive file content in-place. Requires drive (not drive.readonly) scope."""
self._refresh_if_needed()
service = build("drive", "v3", credentials=self._creds, cache_discovery=False)
_update_drive_file_content(service, file_id, content, mime_type)
@staticmethod
def get_device_code_flow(client_id: str, client_secret: str) -> dict:
"""

View File

@ -103,6 +103,13 @@
"lbl_time": "Tid",
"lbl_space": "Mellemrum",
"lbl_loading": "Indlæser…",
"history_lbl": "Historik",
"history_items": "fund",
"history_btn_sessions": "Sessioner",
"history_btn_latest": "Åbne fund",
"history_picker_empty": "Ingen tidligere scanninger",
"history_delta_badge": "Delta",
"history_latest_badge": "Seneste",
"lbl_blurred": "Sløret",
"lbl_none": "Ingen",
"lbl_scanner": "Scanner",
@ -341,8 +348,9 @@
"m365_resuming": "Genoptager — springer allerede skannede elementer over…",
"m365_opt_delta": "Delta-scanning",
"m365_opt_delta_hint": "Kun ændrede elementer (efter første fulde scanning)",
"m365_delta_tokens_saved": "Tokens gemt",
"m365_delta_tokens_saved": "Tokens gemt for {n} kilde(r)",
"m365_delta_clear": "Ryd tokens",
"m365_delta_tokens_hint": "Gemte ændringstokens gør, at delta-scanninger kun henter elementer ændret siden sidste scanning. Ryd tokens tvinger næste scanning til at være en fuld scanning.",
"m365_delta_cleared": "Delta-tokens ryddet — næste scanning bliver fuld scanning.",
"m365_delta_mode": "Delta-tilstand — henter kun ændrede elementer…",
"m365_smtp_title": "✉ Send rapport",
@ -357,6 +365,8 @@
"m365_smtp_recipients": "Modtagere",
"m365_smtp_recipients_hint": "Adskil med komma eller semikolon",
"m365_smtp_save": "Gem",
"m365_smtp_auto_email_manual": "Send rapport efter manuel scanning",
"m365_smtp_prefer_smtp": "Send altid via SMTP (spring Microsoft Graph over)",
"m365_smtp_send": "Send nu",
"m365_smtp_saved": "Indstillinger gemt.",
"m365_smtp_sending": "Sender…",
@ -551,15 +561,32 @@
"m365_db_import_mode": "Tilstand:",
"m365_db_import_merge": "Sammenflet (sikker)",
"m365_db_import_replace": "Erstat (fuld gendannelse)",
"m365_db_import_replace_warn": "⚠ Erstatningstilstand sletter alle eksisterende scanningsdata inden gendannelse. Sørg for at have en sikkerhedskopi af ~/.gdpr_scanner.db først.",
"m365_db_import_replace_confirm": "Erstatningstilstand sletter ALLE eksisterende scanningsdata og gendanner fra arkivet.\\n\\nSørg for at have en manuel sikkerhedskopi af ~/.gdpr_scanner.db.\\n\\nFortsæt?",
"m365_db_import_replace_warn": "⚠ Erstatningstilstand sletter alle eksisterende scanningsdata inden gendannelse. Sørg for at have en sikkerhedskopi af ~/.gdprscanner/scanner.db først.",
"m365_db_import_replace_confirm": "Erstatningstilstand sletter ALLE eksisterende scanningsdata og gendanner fra arkivet.\\n\\nSørg for at have en manuel sikkerhedskopi af ~/.gdprscanner/scanner.db.\\n\\nFortsæt?",
"m365_db_import_no_file": "Vælg venligst en ZIP-fil først.",
"m365_db_importing": "Importerer…",
"m365_db_imported": "Importeret",
"m365_db_import_run": "Importer",
"m365_opt_scan_photos": "Søg efter ansigter i billeder",
"m365_opt_scan_photos_hint": "Markerer billeder med registrerede ansigter som Art. 9 biometriske data. Langsommere — aktivér efter behov.",
"m365_opt_skip_gps": "Ignorer GPS i billeder",
"m365_opt_skip_gps_hint": "Billeder med GPS-koordinater flagges ikke — nyttigt ved elevscanninger, hvor smartphones indlejrer placering i alle fotos.",
"m365_opt_min_cpr": "Min. CPR-antal pr. fil",
"m365_opt_scan_emails": "Søg efter e-mailadresser",
"m365_opt_scan_emails_hint": "Flagger filer med e-mailadresser. Slået fra som standard — e-mailadresser er meget almindelige og kan give mange resultater.",
"m365_opt_scan_phones": "Søg efter telefonnumre",
"m365_opt_scan_phones_hint": "Flagger filer med danske telefonnumre (8 cifre). Nyttigt til at finde kontaktlister og forældrekorrespondance.",
"m365_badge_emails": "e-mail",
"m365_badge_phones": "tlf.",
"m365_opt_min_cpr_hint": "Filer med færre distinkte CPR-numre end denne tærskel rapporteres ikke. Sæt til 2 for at undgå falske positive, når elever har egne CPR-numre i filer.",
"m365_opt_cpr_only": "Kun CPR-tilstand",
"m365_opt_cpr_only_hint": "Flagger kun filer med CPR-numre. Filer med kun e-mailadresser, telefonnumre, ansigter eller EXIF-metadata ignoreres.",
"m365_opt_ocr_lang": "OCR-sprog",
"m365_opt_ocr_lang_hint": "Tesseract-sprogpakke(r) der bruges ved scanning af scannede PDF'er og billeder. Sprogpakker skal være installeret på serveren (f.eks. tesseract-ocr-dan). Flere pakker: dan+eng.",
"m365_filter_photo_only": "📷 Billeder / biometrisk",
"m365_filter_all_roles": "Alle roller",
"m365_filter_staff": "Ansatte",
"m365_filter_student": "Elever",
"m365_badge_faces": "ansigter",
"a30_photo_items": "Billeder med registrerede ansigter (Art. 9 biometrisk)",
"a30_photo_note": "Fotografier af identificerbare personer er biometriske data i henhold til Art. 9 GDPR. Opbevaring kræver et dokumenteret retsgrundlag i henhold til Art. 9(2). For skolefotografier af elever under 15 år er forældrenes samtykke påkrævet (Databeskyttelsesloven §6). Se Datatilsynets vejledning om fotografering i skoler.",
@ -583,16 +610,47 @@
"m365_file_sources_empty": "Ingen filkilder konfigureret. Tilføj en lokal mappe eller netværksdeling nedenfor.",
"m365_file_sources_add": "Tilføj kilde",
"m365_fsrc_label": "Betegnelse",
"m365_fsrc_name": "Navn",
"m365_fsrc_sftp_auth": "Auth",
"m365_fsrc_path": "Sti",
"m365_fsrc_smb_detected": "SMB/CIFS-netværksdeling registreret",
"m365_fsrc_smb_host": "SMB-vært",
"m365_fsrc_smb_user": "Brugernavn",
"m365_fsrc_smb_pw": "Adgangskode",
"m365_fsrc_smb_pw_hint": "Adgangskoden gemmes i nøglekæden — aldrig i en fil.",
"m365_fsrc_pw_keychain_placeholder": "Gemt i OS-nøglering",
"m365_fsrc_add_btn": "Tilføj",
"m365_fsrc_saved": "Kilde gemt",
"m365_fsrc_saving": "Gemmer...",
"m365_fsrc_path_required": "Sti er påkrævet.",
"m365_fsrc_type_local": "Lokal mappe",
"m365_fsrc_type_smb": "Netværksdrev (SMB)",
"m365_fsrc_type_sftp": "SFTP-server",
"m365_fsrc_sftp_host": "SFTP-host",
"m365_fsrc_sftp_port": "Port",
"m365_fsrc_sftp_user": "Brugernavn",
"m365_fsrc_sftp_remote_path": "Fjernsti",
"m365_fsrc_sftp_auth_password": "Adgangskode",
"m365_fsrc_sftp_auth_key": "SSH-nøgle",
"m365_fsrc_sftp_pw": "Adgangskode",
"m365_fsrc_sftp_pw_hint": "Adgangskoden gemmes i OS-nøgleringe — aldrig i en fil.",
"m365_fsrc_sftp_key_upload": "Privat nøglefil",
"m365_fsrc_sftp_key_btn": "Upload nøgle",
"m365_fsrc_sftp_key_uploaded": "Nøgle uploadet",
"m365_fsrc_sftp_passphrase": "Adgangssætning (hvis nøglen er krypteret)",
"m365_fsrc_sftp_passphrase_hint": "Adgangssætningen gemmes i OS-nøgleringe — aldrig i en fil.",
"m365_fsrc_sftp_not_installed": "paramiko er ikke installeret — kør: pip install paramiko",
"m365_fsrc_name_placeholder": "f.eks. Lærerfiler, NAS-arkiv",
"m365_fsrc_path_placeholder": "~/Dokumenter eller //nas/shares",
"m365_fsrc_smb_host_placeholder": "nas.skole.dk",
"m365_fsrc_smb_user_placeholder": "DOMÆNE\\brugernavn",
"m365_fsrc_smb_user_edit_placeholder": "DOMÆNE\\brugernavn eller brugernavn",
"m365_fsrc_sftp_host_placeholder": "sftp.skole.dk",
"m365_fsrc_sftp_user_placeholder": "backup_user",
"m365_fsrc_sftp_path_placeholder": "/var/data",
"m365_fsrc_sftp_passphrase_placeholder": "Lad stå tomt hvis nøglen ikke er krypteret",
"m365_fsrc_sftp_host_required": "SFTP-host er påkrævet.",
"m365_fsrc_sftp_user_required": "SFTP-brugernavn er påkrævet.",
"m365_fsrc_scan_btn": "Scan",
"m365_fsrc_scan_start": "Starter filscanning",
"m365_src_group_files": "Filkilder",
@ -619,6 +677,14 @@
"m365_settings_tab_general": "Generelt",
"m365_settings_tab_email": "E-mailrapport",
"m365_settings_tab_database": "Database",
"m365_settings_tab_auditlog": "Revisionslog",
"m365_audit_title": "Compliance-revisionslog",
"m365_audit_col_time": "Tidspunkt",
"m365_audit_col_action": "Handling",
"m365_audit_col_detail": "Detalje",
"m365_audit_col_ip": "IP",
"m365_audit_loading": "Indlæser…",
"m365_audit_empty": "Ingen revisionsbegivenheder registreret endnu.",
"m365_settings_appearance": "Udseende",
"m365_settings_language": "Sprog",
"m365_settings_theme": "Tema",
@ -655,7 +721,23 @@
"m365_smtp_test": "Test",
"m365_smtp_testing": "Sender test-email…",
"m365_smtp_test_ok": "Test-email sendt",
"m365_smtp_test_ok_graph": "Test-email sendt via Microsoft Graph til",
"m365_smtp_test_ok_smtp": "Test-email sendt via SMTP til",
"m365_smtp_graph_also_failed": "(⚠ Graph mislykkedes også — Mail.Send ikke tildelt)",
"m365_smtp_test_fail": "Forbindelse mislykkedes",
"bulk_select_mode": "Vælg",
"bulk_select_all": "Vælg alle synlige",
"bulk_deselect_all": "Fravælg alle",
"bulk_apply": "Anvend",
"bulk_done": "Afslut",
"bulk_selected": "valgt",
"bulk_applied": "opdateret",
"disp_stats_total": "total",
"disp_stats_unreviewed": "ikke gennemgået",
"disp_stats_retain": "behold",
"disp_stats_delete": "slet",
"disp_stats_other": "andet",
"disp_stats_reviewed": "gennemgået",
"m365_fsrc_edit_btn": "Rediger",
"m365_fsrc_save_changes": "Gem ændringer",
"m365_settings_tab_scheduler": "Planlægger",
@ -673,6 +755,8 @@
"m365_sched_after_scan": "Efter scanning",
"m365_sched_auto_email": "Send rapport automatisk",
"m365_sched_auto_retention": "Håndhæv opbevaringspolitik",
"m365_sched_report_only": "Kun rapport",
"m365_sched_report_only_hint": "Send de seneste scanningsresultater uden at køre en ny scanning. Kræver scanningsresultater i databasen.",
"m365_sched_status": "Status",
"m365_sched_run_now": "▶ Kør nu",
"m365_sched_add": "+ Tilføj planlagt scanning",
@ -681,6 +765,9 @@
"m365_sched_editor_edit": "Rediger planlagt scanning",
"m365_sched_name_required": "Navn er påkrævet",
"m365_sched_no_runs": "Ingen planlagte kørsler endnu",
"m365_sched_no_jobs": "Ingen planlagte scanninger endnu.",
"m365_sched_running": "Kører...",
"m365_sched_disabled": "Deaktiveret",
"m365_sched_freq_daily": "Dagligt",
"m365_sched_freq_weekly": "Ugentligt",
"m365_sched_freq_monthly": "Månedligt",
@ -728,9 +815,7 @@
"role_staff": "Ansat",
"role_student": "Elev",
"role_other": "Anden",
"m365_settings_tab_security": "Sikkerhed",
"share_modal_title": "Del resultater",
"share_modal_desc": "Skrivebeskyttede links lader en DPO eller gennemganger se resultater og tilknytte dispositioner uden adgang til scanningskontroller eller legitimationsoplysninger.",
"share_new_link": "Nyt link",
@ -759,7 +844,18 @@
"share_create_error": "Kunne ikke oprette link:",
"share_revoke_confirm": "Tilbagekald dette link? Alle der bruger det, mister straks adgang.",
"share_revoke_error": "Kunne ikke tilbagekalde:",
"share_scope_lbl": "Omfang",
"share_scope_all": "Alle",
"share_scope_type_role": "Rolle",
"share_scope_type_user": "Bruger",
"share_date_from": "Emner fra",
"share_date_to": "Emner til og med",
"share_scope_role_lbl": "Rolle",
"share_scope_user_lbl": "Brugerens e-mail",
"share_scope_user_placeholder": "alice@skole.dk",
"share_scope_user_invalid": "Angiv venligst en gyldig e-mailadresse for brugeromfanget.",
"share_scope_staff": "Ansatte",
"share_scope_student": "Elever",
"viewer_pin_group_title": "Seerens PIN",
"viewer_pin_desc": "En numerisk PIN (48 cifre), der lader alle åbne <code style=\"font-size:10px\">/view</code> i en browser for skrivebeskyttet adgang til resultater uden et token-link.",
"viewer_pin_clear": "Ryd PIN",
@ -769,5 +865,44 @@
"viewer_pin_saving": "Gemmer…",
"viewer_pin_saved": "PIN gemt",
"viewer_pin_clear_confirm": "Fjern seerens PIN? /view vil igen kræve et token-link.",
"viewer_pin_cleared": "PIN ryddet"
"viewer_pin_cleared": "PIN ryddet",
"interface_pin_group_title": "Interface-PIN",
"interface_pin_desc": "En numerisk PIN-kode (48 cifre), der skal indtastes, inden man får adgang til selve scanneren. Seere, der tilgår <code style=\"font-size:10px\">/view</code>, er ikke berørt.",
"interface_pin_clear": "Ryd PIN",
"interface_pin_is_set": "Interface-PIN er angivet",
"interface_pin_not_set_msg": "Ingen PIN angivet — grænsefladen er åben for alle på netværket",
"interface_pin_saved": "PIN gemt",
"interface_pin_clear_confirm": "Fjern interface-PIN? Scanneren vil herefter være tilgængelig for alle på netværket.",
"interface_pin_cleared": "PIN ryddet",
"interface_pin_login_desc": "Indtast interface-PIN for at fortsætte.",
"interface_pin_login_btn": "Fortsæt",
"interface_pin_err_incorrect": "Forkert PIN.",
"interface_pin_err_too_many": "For mange forsøg. Prøv igen om lidt.",
"interface_pin_err_network": "Netværksfejl. Prøv igen.",
"m365_settings_tab_ai": "AI / NER",
"m365_ai_title": "AI-forbedret navnegenkendelse",
"m365_ai_desc": "Brug Claude AI i stedet for spaCy til navn-, adresse- og organisationsgenkendelse. Betydeligt mere nøjagtig på dansk tekst — særligt dobbeltefternavne og fremmedsprogede navne. Kræver en Anthropic API-nøgle; faktureres pr. token.",
"m365_ai_enable": "Aktiver Claude NER",
"m365_ai_api_key_label": "Anthropic API-nøgle",
"m365_ai_show_key": "Vis",
"m365_ai_hide_key": "Skjul",
"m365_ai_key_set": "API-nøgle gemt",
"m365_ai_key_not_set": "Ingen API-nøgle gemt",
"m365_ai_test": "Test nøgle",
"m365_ai_testing": "Tester…",
"m365_ai_test_ok": "API-nøgle er gyldig",
"m365_ai_test_fail": "Test mislykkedes",
"m365_ai_saved": "Gemt",
"m365_ai_model_note": "Model: claude-haiku-4-5 · faktureres efter Anthropics token-priser · resultater caches pr. dokument.",
"m365_settings_updates": "Softwareopdatering",
"m365_update_idle": "Tjek om der findes en nyere version.",
"m365_update_auto": "Installér opdateringer automatisk (tjekkes dagligt — programmet genstarter selv)",
"m365_update_check": "Søg efter opdateringer",
"m365_update_install": "Installér opdatering",
"m365_update_checking": "Tjekker…",
"m365_update_uptodate": "Du kører den nyeste version.",
"m365_update_available": "Opdatering tilgængelig",
"m365_update_installing": "Installerer opdatering — programmet genstarter…",
"m365_update_failed": "Opdateringstjek mislykkedes",
"m365_update_scan_running": "Kan ikke opdatere, mens en scanning kører."
}

View File

@ -164,6 +164,13 @@
"lbl_working": "Wird bearbeitet…",
"lbl_stopping": "Wird gestoppt…",
"lbl_loading": "Wird geladen…",
"history_lbl": "Verlauf",
"history_items": "Treffer",
"history_btn_sessions": "Sessionen",
"history_btn_latest": "Offene Einträge",
"history_picker_empty": "Keine früheren Scans",
"history_delta_badge": "Delta",
"history_latest_badge": "Aktuell",
"lbl_blurred": "Unscharf gemacht",
"lbl_none": "Keine",
"lbl_size": "Größe",
@ -341,8 +348,9 @@
"m365_resuming": "Fortsetzen — bereits gescannte Elemente werden übersprungen…",
"m365_opt_delta": "Delta-Scan",
"m365_opt_delta_hint": "Nur geänderte Elemente (nach erstem Vollscan)",
"m365_delta_tokens_saved": "Tokens gespeichert",
"m365_delta_tokens_saved": "Tokens für {n} Quelle(n) gespeichert",
"m365_delta_clear": "Tokens löschen",
"m365_delta_tokens_hint": "Gespeicherte Änderungstokens lassen Delta-Scans nur Elemente abrufen, die seit dem letzten Scan geändert wurden. Tokens löschen erzwingt beim nächsten Scan einen Vollscan.",
"m365_delta_cleared": "Delta-Tokens gelöscht — nächster Scan wird ein Vollscan.",
"m365_delta_mode": "Delta-Modus — nur geänderte Elemente werden abgerufen…",
"m365_smtp_title": "✉ Bericht senden",
@ -357,6 +365,8 @@
"m365_smtp_recipients": "Empfänger",
"m365_smtp_recipients_hint": "Komma- oder semikolongetrennt",
"m365_smtp_save": "Speichern",
"m365_smtp_auto_email_manual": "Bericht nach manueller Suche senden",
"m365_smtp_prefer_smtp": "Immer via SMTP senden (Microsoft Graph überspringen)",
"m365_smtp_send": "Jetzt senden",
"m365_smtp_saved": "Einstellungen gespeichert.",
"m365_smtp_sending": "Senden…",
@ -551,15 +561,32 @@
"m365_db_import_mode": "Modus:",
"m365_db_import_merge": "Zusammenführen (sicher)",
"m365_db_import_replace": "Ersetzen (vollständige Wiederherstellung)",
"m365_db_import_replace_warn": "⚠ Der Ersetzungsmodus löscht alle vorhandenen Scandaten vor der Wiederherstellung. Stellen Sie sicher, dass Sie zuerst eine Sicherungskopie von ~/.gdpr_scanner.db haben.",
"m365_db_import_replace_confirm": "Der Ersetzungsmodus löscht ALLE vorhandenen Scandaten und stellt aus dem Archiv wieder her.\\n\\nStellen Sie sicher, dass Sie eine manuelle Sicherungskopie von ~/.gdpr_scanner.db haben.\\n\\nFortfahren?",
"m365_db_import_replace_warn": "⚠ Der Ersetzungsmodus löscht alle vorhandenen Scandaten vor der Wiederherstellung. Stellen Sie sicher, dass Sie zuerst eine Sicherungskopie von ~/.gdprscanner/scanner.db haben.",
"m365_db_import_replace_confirm": "Der Ersetzungsmodus löscht ALLE vorhandenen Scandaten und stellt aus dem Archiv wieder her.\\n\\nStellen Sie sicher, dass Sie eine manuelle Sicherungskopie von ~/.gdprscanner/scanner.db haben.\\n\\nFortfahren?",
"m365_db_import_no_file": "Bitte wählen Sie zuerst eine ZIP-Datei aus.",
"m365_db_importing": "Importiere…",
"m365_db_imported": "Importiert",
"m365_db_import_run": "Importieren",
"m365_opt_scan_photos": "Fotos nach Gesichtern durchsuchen",
"m365_opt_scan_photos_hint": "Markiert Bilder mit erkannten Gesichtern als biometrische Daten gem. Art. 9. Langsamer — bei Bedarf aktivieren.",
"m365_opt_skip_gps": "GPS in Bildern ignorieren",
"m365_opt_skip_gps_hint": "Bilder mit GPS-Koordinaten werden nicht markiert — nützlich beim Scannen von Schüler-Konten, deren Smartphones Standort in jedes Foto einbetten.",
"m365_opt_min_cpr": "Min. CPR-Anzahl pro Datei",
"m365_opt_scan_emails": "E-Mail-Adressen scannen",
"m365_opt_scan_emails_hint": "Markiert Dateien mit E-Mail-Adressen. Standardmäßig deaktiviert — E-Mail-Adressen sind sehr häufig und können viele Treffer erzeugen.",
"m365_opt_scan_phones": "Telefonnummern scannen",
"m365_opt_scan_phones_hint": "Markiert Dateien mit dänischen Telefonnummern (8 Ziffern). Nützlich zum Auffinden von Kontaktlisten.",
"m365_badge_emails": "E-Mail",
"m365_badge_phones": "Tel.",
"m365_opt_min_cpr_hint": "Dateien mit weniger eindeutigen CPR-Nummern als dieser Schwellenwert werden nicht gemeldet. Auf 2 setzen, um Falsch-Positive zu vermeiden, wenn Schüler eigene CPR-Nummern in Dateien haben.",
"m365_opt_cpr_only": "Nur-CPR-Modus",
"m365_opt_cpr_only_hint": "Markiert nur Dateien mit CPR-Nummern. Dateien mit nur E-Mail-Adressen, Telefonnummern, Gesichtern oder EXIF-Metadaten werden ignoriert.",
"m365_opt_ocr_lang": "OCR-Sprache",
"m365_opt_ocr_lang_hint": "Tesseract-Sprachpaket(e) für das Scannen von gescannten PDFs und Bildern. Pakete müssen auf dem Server installiert sein (z.B. tesseract-ocr-dan). Mehrere Pakete: dan+eng.",
"m365_filter_photo_only": "📷 Fotos / biometrisch",
"m365_filter_all_roles": "Alle Rollen",
"m365_filter_staff": "Personal",
"m365_filter_student": "Schüler",
"m365_badge_faces": "Gesichter",
"a30_photo_items": "Fotos mit erkannten Gesichtern (Art. 9 biometrisch)",
"a30_photo_note": "Fotografien identifizierbarer Personen sind biometrische Daten gemäß Art. 9 DSGVO. Die Aufbewahrung erfordert eine dokumentierte Rechtsgrundlage gemäß Art. 9(2). Für Schulfotos von Schülern unter 15 Jahren ist die elterliche Einwilligung erforderlich (Databeskyttelsesloven §6). Siehe Leitfaden des Datatilsynet zur Schulfotografie.",
@ -583,16 +610,47 @@
"m365_file_sources_empty": "Keine Dateiquellen konfiguriert. Fügen Sie unten einen lokalen Ordner oder eine Netzwerkfreigabe hinzu.",
"m365_file_sources_add": "Quelle hinzufügen",
"m365_fsrc_label": "Bezeichnung",
"m365_fsrc_name": "Name",
"m365_fsrc_sftp_auth": "Auth",
"m365_fsrc_path": "Pfad",
"m365_fsrc_smb_detected": "SMB/CIFS-Netzwerkfreigabe erkannt",
"m365_fsrc_smb_host": "SMB-Host",
"m365_fsrc_smb_user": "Benutzername",
"m365_fsrc_smb_pw": "Passwort",
"m365_fsrc_smb_pw_hint": "Das Passwort wird im OS-Schlüsselbund gespeichert — nie in einer Datei.",
"m365_fsrc_pw_keychain_placeholder": "Im OS-Schlüsselbund gespeichert",
"m365_fsrc_add_btn": "Hinzufügen",
"m365_fsrc_saved": "Quelle gespeichert",
"m365_fsrc_saving": "Speichern...",
"m365_fsrc_path_required": "Pfad ist erforderlich.",
"m365_fsrc_type_local": "Lokaler Ordner",
"m365_fsrc_type_smb": "Netzwerkfreigabe (SMB)",
"m365_fsrc_type_sftp": "SFTP-Server",
"m365_fsrc_sftp_host": "SFTP-Host",
"m365_fsrc_sftp_port": "Port",
"m365_fsrc_sftp_user": "Benutzername",
"m365_fsrc_sftp_remote_path": "Remote-Pfad",
"m365_fsrc_sftp_auth_password": "Passwort",
"m365_fsrc_sftp_auth_key": "SSH-Schlüssel",
"m365_fsrc_sftp_pw": "Passwort",
"m365_fsrc_sftp_pw_hint": "Passwort wird im OS-Schlüsselbund gespeichert — nie in einer Datei.",
"m365_fsrc_sftp_key_upload": "Private Schlüsseldatei",
"m365_fsrc_sftp_key_btn": "Schlüssel hochladen",
"m365_fsrc_sftp_key_uploaded": "Schlüssel hochgeladen",
"m365_fsrc_sftp_passphrase": "Passphrase (wenn Schlüssel verschlüsselt ist)",
"m365_fsrc_sftp_passphrase_hint": "Passphrase wird im OS-Schlüsselbund gespeichert — nie in einer Datei.",
"m365_fsrc_sftp_not_installed": "paramiko nicht installiert — ausführen: pip install paramiko",
"m365_fsrc_name_placeholder": "z.B. Lehrerdateien, NAS-Archiv",
"m365_fsrc_path_placeholder": "~/Dokumente oder //nas/freigaben",
"m365_fsrc_smb_host_placeholder": "nas.schule.de",
"m365_fsrc_smb_user_placeholder": "DOMÄNE\\Benutzername",
"m365_fsrc_smb_user_edit_placeholder": "DOMÄNE\\Benutzername oder Benutzername",
"m365_fsrc_sftp_host_placeholder": "sftp.schule.de",
"m365_fsrc_sftp_user_placeholder": "backup_user",
"m365_fsrc_sftp_path_placeholder": "/var/data",
"m365_fsrc_sftp_passphrase_placeholder": "Leer lassen, wenn der Schlüssel nicht verschlüsselt ist",
"m365_fsrc_sftp_host_required": "SFTP-Host ist erforderlich.",
"m365_fsrc_sftp_user_required": "SFTP-Benutzername ist erforderlich.",
"m365_fsrc_scan_btn": "Scannen",
"m365_fsrc_scan_start": "Datei-Scan wird gestartet",
"m365_src_group_files": "Dateiquellen",
@ -619,6 +677,14 @@
"m365_settings_tab_general": "Allgemein",
"m365_settings_tab_email": "E-Mail-Bericht",
"m365_settings_tab_database": "Datenbank",
"m365_settings_tab_auditlog": "Prüfprotokoll",
"m365_audit_title": "Compliance-Prüfprotokoll",
"m365_audit_col_time": "Zeitpunkt",
"m365_audit_col_action": "Aktion",
"m365_audit_col_detail": "Detail",
"m365_audit_col_ip": "IP",
"m365_audit_loading": "Wird geladen…",
"m365_audit_empty": "Noch keine Prüfereignisse aufgezeichnet.",
"m365_settings_appearance": "Erscheinungsbild",
"m365_settings_language": "Sprache",
"m365_settings_theme": "Design",
@ -655,7 +721,23 @@
"m365_smtp_test": "Testen",
"m365_smtp_testing": "Test-E-Mail wird gesendet…",
"m365_smtp_test_ok": "Test-E-Mail gesendet",
"m365_smtp_test_ok_graph": "Test-E-Mail über Microsoft Graph gesendet an",
"m365_smtp_test_ok_smtp": "Test-E-Mail über SMTP gesendet an",
"m365_smtp_graph_also_failed": "(⚠ Graph fehlgeschlagen — Mail.Send nicht erteilt)",
"m365_smtp_test_fail": "Verbindung fehlgeschlagen",
"bulk_select_mode": "Auswählen",
"bulk_select_all": "Alle sichtbaren auswählen",
"bulk_deselect_all": "Alle abwählen",
"bulk_apply": "Anwenden",
"bulk_done": "Fertig",
"bulk_selected": "ausgewählt",
"bulk_applied": "aktualisiert",
"disp_stats_total": "gesamt",
"disp_stats_unreviewed": "nicht überprüft",
"disp_stats_retain": "behalten",
"disp_stats_delete": "löschen",
"disp_stats_other": "sonstige",
"disp_stats_reviewed": "überprüft",
"m365_fsrc_edit_btn": "Bearbeiten",
"m365_fsrc_save_changes": "Änderungen speichern",
"m365_settings_tab_scheduler": "Zeitplaner",
@ -673,6 +755,8 @@
"m365_sched_after_scan": "Nach dem Scan",
"m365_sched_auto_email": "Bericht automatisch senden",
"m365_sched_auto_retention": "Aufbewahrungsrichtlinie durchsetzen",
"m365_sched_report_only": "Nur Bericht",
"m365_sched_report_only_hint": "Letzte Scanergebnisse senden, ohne einen neuen Scan durchzuführen. Erfordert Scanergebnisse in der Datenbank.",
"m365_sched_status": "Status",
"m365_sched_run_now": "▶ Jetzt ausführen",
"m365_sched_add": "+ Geplante Suche hinzufügen",
@ -681,6 +765,9 @@
"m365_sched_editor_edit": "Geplante Suche bearbeiten",
"m365_sched_name_required": "Name ist erforderlich",
"m365_sched_no_runs": "Noch keine geplanten Läufe",
"m365_sched_no_jobs": "Noch keine geplanten Scans.",
"m365_sched_running": "Läuft...",
"m365_sched_disabled": "Deaktiviert",
"m365_sched_freq_daily": "Täglich",
"m365_sched_freq_weekly": "Wöchentlich",
"m365_sched_freq_monthly": "Monatlich",
@ -728,9 +815,7 @@
"role_staff": "Personal",
"role_student": "Schüler",
"role_other": "Andere",
"m365_settings_tab_security": "Sicherheit",
"share_modal_title": "Ergebnisse teilen",
"share_modal_desc": "Schreibgeschützte Links ermöglichen einem Datenschutzbeauftragten oder Prüfer, Ergebnisse einzusehen und Verwendungszwecke zuzuweisen, ohne Zugriff auf Scansteuerung oder Anmeldedaten.",
"share_new_link": "Neuer Link",
@ -759,9 +844,20 @@
"share_create_error": "Link konnte nicht erstellt werden:",
"share_revoke_confirm": "Diesen Link widerrufen? Alle Nutzer verlieren sofort den Zugriff.",
"share_revoke_error": "Widerrufen fehlgeschlagen:",
"share_scope_lbl": "Bereich",
"share_scope_all": "Alle",
"share_scope_type_role": "Rolle",
"share_scope_type_user": "Benutzer",
"share_date_from": "Elemente ab",
"share_date_to": "Elemente bis",
"share_scope_role_lbl": "Rolle",
"share_scope_user_lbl": "Benutzer-E-Mail",
"share_scope_user_placeholder": "alice@schule.de",
"share_scope_user_invalid": "Bitte gib eine gültige E-Mail-Adresse für den Benutzerbereich an.",
"share_scope_staff": "Mitarbeitende",
"share_scope_student": "Schüler",
"viewer_pin_group_title": "Betrachter-PIN",
"viewer_pin_desc": "Eine numerische PIN (48 Stellen), die es jedem ermöglicht, <code style=\"font-size:10px\">/view</code> im Browser zu öffnen und schreibgeschützt auf Ergebnisse zuzugreifen \u2013 ohne Token-Link.",
"viewer_pin_desc": "Eine numerische PIN (48 Stellen), die es jedem ermöglicht, <code style=\"font-size:10px\">/view</code> im Browser zu öffnen und schreibgeschützt auf Ergebnisse zuzugreifen ohne Token-Link.",
"viewer_pin_clear": "PIN löschen",
"viewer_pin_is_set": "Betrachter-PIN ist festgelegt",
"viewer_pin_not_set_msg": "Keine PIN festgelegt — /view erfordert einen Token-Link",
@ -769,5 +865,44 @@
"viewer_pin_saving": "Wird gespeichert…",
"viewer_pin_saved": "PIN gespeichert",
"viewer_pin_clear_confirm": "Betrachter-PIN entfernen? /view erfordert dann wieder einen Token-Link.",
"viewer_pin_cleared": "PIN gelöscht"
"viewer_pin_cleared": "PIN gelöscht",
"interface_pin_group_title": "Interface-PIN",
"interface_pin_desc": "Eine numerische PIN (48 Stellen), die eingegeben werden muss, bevor auf die Scanner-Oberfläche zugegriffen werden kann. Betrachter, die <code style=\"font-size:10px\">/view</code> aufrufen, sind nicht betroffen.",
"interface_pin_clear": "PIN löschen",
"interface_pin_is_set": "Interface-PIN ist gesetzt",
"interface_pin_not_set_msg": "Keine PIN gesetzt — Oberfläche ist für alle im Netzwerk offen",
"interface_pin_saved": "PIN gespeichert",
"interface_pin_clear_confirm": "Interface-PIN entfernen? Der Scanner ist dann für alle im Netzwerk zugänglich.",
"interface_pin_cleared": "PIN gelöscht",
"interface_pin_login_desc": "Interface-PIN eingeben, um fortzufahren.",
"interface_pin_login_btn": "Weiter",
"interface_pin_err_incorrect": "Falsche PIN.",
"interface_pin_err_too_many": "Zu viele Versuche. Bitte später erneut versuchen.",
"interface_pin_err_network": "Netzwerkfehler. Bitte erneut versuchen.",
"m365_settings_tab_ai": "KI / NER",
"m365_ai_title": "KI-gestützte Entitätserkennung",
"m365_ai_desc": "Claude KI statt spaCy für Name-, Adress- und Organisationserkennung verwenden. Deutlich genauer bei dänischen Texten — insbesondere bei Doppelnamen und fremdsprachigen Namen. Benötigt einen Anthropic-API-Schlüssel; Abrechnung per Token.",
"m365_ai_enable": "Claude NER aktivieren",
"m365_ai_api_key_label": "Anthropic-API-Schlüssel",
"m365_ai_show_key": "Anzeigen",
"m365_ai_hide_key": "Ausblenden",
"m365_ai_key_set": "API-Schlüssel gespeichert",
"m365_ai_key_not_set": "Kein API-Schlüssel gespeichert",
"m365_ai_test": "Schlüssel testen",
"m365_ai_testing": "Wird getestet…",
"m365_ai_test_ok": "API-Schlüssel gültig",
"m365_ai_test_fail": "Test fehlgeschlagen",
"m365_ai_saved": "Gespeichert",
"m365_ai_model_note": "Modell: claude-haiku-4-5 · Abrechnung nach Anthropic-Token-Tarifen · Ergebnisse werden pro Dokument gecacht.",
"m365_settings_updates": "Softwareaktualisierung",
"m365_update_idle": "Prüfen, ob eine neuere Version verfügbar ist.",
"m365_update_auto": "Updates automatisch installieren (tägliche Prüfung — die App startet sich selbst neu)",
"m365_update_check": "Nach Updates suchen",
"m365_update_install": "Update installieren",
"m365_update_checking": "Wird geprüft…",
"m365_update_uptodate": "Sie verwenden die neueste Version.",
"m365_update_available": "Update verfügbar",
"m365_update_installing": "Update wird installiert — die App startet neu…",
"m365_update_failed": "Updateprüfung fehlgeschlagen",
"m365_update_scan_running": "Update nicht möglich, während ein Scan läuft."
}

View File

@ -103,6 +103,13 @@
"lbl_time": "Time",
"lbl_space": "Space",
"lbl_loading": "Loading…",
"history_lbl": "History",
"history_items": "items",
"history_btn_sessions": "Sessions",
"history_btn_latest": "Open items",
"history_picker_empty": "No past scans",
"history_delta_badge": "Delta",
"history_latest_badge": "Latest",
"lbl_blurred": "Blurred",
"lbl_none": "None",
"lbl_scanner": "Scanner",
@ -341,8 +348,9 @@
"m365_resuming": "Resuming — skipping already-scanned items…",
"m365_opt_delta": "Delta scan",
"m365_opt_delta_hint": "Changed items only (after first full scan)",
"m365_delta_tokens_saved": "Tokens saved",
"m365_delta_tokens_saved": "Tokens saved for {n} source(s)",
"m365_delta_clear": "Clear tokens",
"m365_delta_tokens_hint": "Saved change-tokens let delta scans fetch only items modified since the last scan. Clear tokens forces the next scan to be a full scan.",
"m365_delta_cleared": "Delta tokens cleared — next scan will be a full scan.",
"m365_delta_mode": "Delta mode — fetching changed items only…",
"m365_smtp_title": "✉ Email report",
@ -357,6 +365,8 @@
"m365_smtp_recipients": "Recipients",
"m365_smtp_recipients_hint": "Comma or semicolon separated",
"m365_smtp_save": "Save",
"m365_smtp_auto_email_manual": "Email report after manual scan",
"m365_smtp_prefer_smtp": "Always send via SMTP (skip Microsoft Graph)",
"m365_smtp_send": "Send now",
"m365_smtp_saved": "Settings saved.",
"m365_smtp_sending": "Sending…",
@ -551,15 +561,32 @@
"m365_db_import_mode": "Mode:",
"m365_db_import_merge": "Merge (safe)",
"m365_db_import_replace": "Replace (full restore)",
"m365_db_import_replace_warn": "⚠ Replace mode will erase all existing scan data before restoring. Make sure you have a backup of ~/.gdpr_scanner.db first.",
"m365_db_import_replace_confirm": "Replace mode will erase ALL existing scan data and restore from the archive.\\n\\nMake sure you have a manual backup of ~/.gdpr_scanner.db.\\n\\nProceed?",
"m365_db_import_replace_warn": "⚠ Replace mode will erase all existing scan data before restoring. Make sure you have a backup of ~/.gdprscanner/scanner.db first.",
"m365_db_import_replace_confirm": "Replace mode will erase ALL existing scan data and restore from the archive.\\n\\nMake sure you have a manual backup of ~/.gdprscanner/scanner.db.\\n\\nProceed?",
"m365_db_import_no_file": "Please select a ZIP file first.",
"m365_db_importing": "Importing…",
"m365_db_imported": "Imported",
"m365_db_import_run": "Import",
"m365_opt_scan_photos": "Scan photos for faces",
"m365_opt_scan_photos_hint": "Flags images with detected faces as Art. 9 biometric data. Slower — opt in.",
"m365_opt_skip_gps": "Ignore GPS in images",
"m365_opt_skip_gps_hint": "Images with GPS coordinates are not flagged — useful when scanning students whose smartphones embed location in every photo.",
"m365_opt_min_cpr": "Min. CPR count per file",
"m365_opt_scan_emails": "Scan for email addresses",
"m365_opt_scan_emails_hint": "Flags files that contain email addresses. Off by default — email addresses are very common and may produce many results.",
"m365_opt_scan_phones": "Scan for phone numbers",
"m365_opt_scan_phones_hint": "Flags files containing Danish phone numbers (8 digits). Useful for finding contact lists and parent correspondence.",
"m365_badge_emails": "email",
"m365_badge_phones": "phone",
"m365_opt_min_cpr_hint": "Files with fewer distinct CPR numbers than this threshold are not reported. Set to 2 to avoid false positives when students have their own CPR in documents.",
"m365_opt_cpr_only": "CPR-only mode",
"m365_opt_cpr_only_hint": "Only flag files that contain CPR numbers. Files with only email addresses, phone numbers, detected faces, or EXIF metadata are skipped.",
"m365_opt_ocr_lang": "OCR language",
"m365_opt_ocr_lang_hint": "Tesseract language pack(s) used when scanning scanned PDFs and images. Language packs must be installed on the server (e.g. tesseract-ocr-dan). Multiple packs: dan+eng.",
"m365_filter_photo_only": "📷 Photos / biometric",
"m365_filter_all_roles": "All roles",
"m365_filter_staff": "Staff",
"m365_filter_student": "Students",
"m365_badge_faces": "faces",
"a30_photo_items": "Photos with detected faces (Art. 9 biometric)",
"a30_photo_note": "Photographs of identifiable persons are biometric data under Art. 9 GDPR. Retention requires a documented legal basis under Art. 9(2). For school photographs of pupils under 15, parental consent is required (Databeskyttelsesloven §6). See Datatilsynet guidance on school photography.",
@ -583,16 +610,47 @@
"m365_file_sources_empty": "No file sources configured. Add a local folder or network share below.",
"m365_file_sources_add": "Add source",
"m365_fsrc_label": "Label",
"m365_fsrc_name": "Name",
"m365_fsrc_sftp_auth": "Auth",
"m365_fsrc_path": "Path",
"m365_fsrc_smb_detected": "SMB/CIFS network share detected",
"m365_fsrc_smb_host": "SMB host",
"m365_fsrc_smb_user": "Username",
"m365_fsrc_smb_pw": "Password",
"m365_fsrc_smb_pw_hint": "Password is saved to the OS keychain — never stored in a file.",
"m365_fsrc_pw_keychain_placeholder": "Stored in OS keychain",
"m365_fsrc_add_btn": "Add",
"m365_fsrc_saved": "Source saved",
"m365_fsrc_saving": "Saving...",
"m365_fsrc_path_required": "Path is required.",
"m365_fsrc_type_local": "Local folder",
"m365_fsrc_type_smb": "Network share (SMB)",
"m365_fsrc_type_sftp": "SFTP server",
"m365_fsrc_sftp_host": "SFTP host",
"m365_fsrc_sftp_port": "Port",
"m365_fsrc_sftp_user": "Username",
"m365_fsrc_sftp_remote_path": "Remote path",
"m365_fsrc_sftp_auth_password": "Password",
"m365_fsrc_sftp_auth_key": "SSH key",
"m365_fsrc_sftp_pw": "Password",
"m365_fsrc_sftp_pw_hint": "Password is saved to the OS keychain — never stored in a file.",
"m365_fsrc_sftp_key_upload": "Private key file",
"m365_fsrc_sftp_key_btn": "Upload key",
"m365_fsrc_sftp_key_uploaded": "Key uploaded",
"m365_fsrc_sftp_passphrase": "Passphrase (if key is encrypted)",
"m365_fsrc_sftp_passphrase_hint": "Passphrase is saved to the OS keychain — never stored in a file.",
"m365_fsrc_sftp_not_installed": "paramiko not installed — run: pip install paramiko",
"m365_fsrc_name_placeholder": "e.g. Teacher files, NAS archive",
"m365_fsrc_path_placeholder": "~/Documents or //nas/shares",
"m365_fsrc_smb_host_placeholder": "nas.school.dk",
"m365_fsrc_smb_user_placeholder": "DOMAIN\\username",
"m365_fsrc_smb_user_edit_placeholder": "DOMAIN\\username or username",
"m365_fsrc_sftp_host_placeholder": "sftp.school.dk",
"m365_fsrc_sftp_user_placeholder": "backup_user",
"m365_fsrc_sftp_path_placeholder": "/var/data",
"m365_fsrc_sftp_passphrase_placeholder": "Leave blank if key has no passphrase",
"m365_fsrc_sftp_host_required": "SFTP host is required.",
"m365_fsrc_sftp_user_required": "SFTP username is required.",
"m365_fsrc_scan_btn": "Scan",
"m365_fsrc_scan_start": "Starting file scan",
"m365_src_group_files": "File sources",
@ -619,6 +677,14 @@
"m365_settings_tab_general": "General",
"m365_settings_tab_email": "Email report",
"m365_settings_tab_database": "Database",
"m365_settings_tab_auditlog": "Audit Log",
"m365_audit_title": "Compliance Audit Log",
"m365_audit_col_time": "Time",
"m365_audit_col_action": "Action",
"m365_audit_col_detail": "Detail",
"m365_audit_col_ip": "IP",
"m365_audit_loading": "Loading…",
"m365_audit_empty": "No audit events recorded yet.",
"m365_settings_appearance": "Appearance",
"m365_settings_language": "Language",
"m365_settings_theme": "Theme",
@ -655,7 +721,23 @@
"m365_smtp_test": "Test",
"m365_smtp_testing": "Sending test email…",
"m365_smtp_test_ok": "Test email sent",
"m365_smtp_test_ok_graph": "Test email sent via Microsoft Graph to",
"m365_smtp_test_ok_smtp": "Test email sent via SMTP to",
"m365_smtp_graph_also_failed": "(⚠ Graph also failed — Mail.Send not granted)",
"m365_smtp_test_fail": "Connection failed",
"bulk_select_mode": "Select",
"bulk_select_all": "Select all visible",
"bulk_deselect_all": "Deselect all",
"bulk_apply": "Apply",
"bulk_done": "Done",
"bulk_selected": "selected",
"bulk_applied": "updated",
"disp_stats_total": "total",
"disp_stats_unreviewed": "unreviewed",
"disp_stats_retain": "retain",
"disp_stats_delete": "delete",
"disp_stats_other": "other",
"disp_stats_reviewed": "reviewed",
"m365_fsrc_edit_btn": "Edit",
"m365_fsrc_save_changes": "Save changes",
"m365_settings_tab_scheduler": "Scheduler",
@ -673,6 +755,8 @@
"m365_sched_after_scan": "After scan",
"m365_sched_auto_email": "Email report automatically",
"m365_sched_auto_retention": "Enforce retention policy",
"m365_sched_report_only": "Report only",
"m365_sched_report_only_hint": "Email the latest scan results without running a new scan. Requires scan results in the database.",
"m365_sched_status": "Status",
"m365_sched_run_now": "▶ Run now",
"m365_sched_add": "+ Add scheduled scan",
@ -681,6 +765,9 @@
"m365_sched_editor_edit": "Edit scheduled scan",
"m365_sched_name_required": "Name is required",
"m365_sched_no_runs": "No scheduled runs yet",
"m365_sched_no_jobs": "No scheduled scans yet.",
"m365_sched_running": "Running...",
"m365_sched_disabled": "Disabled",
"m365_sched_freq_daily": "Daily",
"m365_sched_freq_weekly": "Weekly",
"m365_sched_freq_monthly": "Monthly",
@ -728,9 +815,7 @@
"role_staff": "Staff",
"role_student": "Student",
"role_other": "Other",
"m365_settings_tab_security": "Security",
"share_modal_title": "Share results",
"share_modal_desc": "Read-only links let a DPO or reviewer browse results and tag dispositions without access to scan controls or credentials.",
"share_new_link": "New link",
@ -759,15 +844,65 @@
"share_create_error": "Failed to create link:",
"share_revoke_confirm": "Revoke this link? Anyone using it will immediately lose access.",
"share_revoke_error": "Failed to revoke:",
"share_scope_lbl": "Scope",
"share_scope_all": "All",
"share_scope_type_role": "Role",
"share_scope_type_user": "User",
"share_date_from": "Items from",
"share_date_to": "Items until",
"share_scope_role_lbl": "Role",
"share_scope_user_lbl": "User email",
"share_scope_user_placeholder": "alice@school.dk",
"share_scope_user_invalid": "Please enter a valid email address for the user scope.",
"share_scope_staff": "Staff",
"share_scope_student": "Students",
"viewer_pin_group_title": "Viewer PIN",
"viewer_pin_desc": "A numeric PIN (4\u20138 digits) that lets anyone open <code style=\"font-size:10px\">/view</code> in a browser for read-only access to results without a token URL.",
"viewer_pin_desc": "A numeric PIN (48 digits) that lets anyone open <code style=\"font-size:10px\">/view</code> in a browser for read-only access to results without a token URL.",
"viewer_pin_clear": "Clear PIN",
"viewer_pin_is_set": "Viewer PIN is set",
"viewer_pin_not_set_msg": "No PIN set \u2014 /view requires a token link",
"viewer_pin_format": "PIN must be 4\u20138 digits.",
"viewer_pin_saving": "Saving\u2026",
"viewer_pin_not_set_msg": "No PIN set /view requires a token link",
"viewer_pin_format": "PIN must be 48 digits.",
"viewer_pin_saving": "Saving",
"viewer_pin_saved": "PIN saved",
"viewer_pin_clear_confirm": "Remove the viewer PIN? /view will require a token link again.",
"viewer_pin_cleared": "PIN cleared"
"viewer_pin_cleared": "PIN cleared",
"interface_pin_group_title": "Interface PIN",
"interface_pin_desc": "A numeric PIN (48 digits) that must be entered before accessing the main scanner interface. Viewers accessing <code style=\"font-size:10px\">/view</code> are not affected.",
"interface_pin_clear": "Clear PIN",
"interface_pin_is_set": "Interface PIN is set",
"interface_pin_not_set_msg": "No PIN set — interface is open to anyone on the network",
"interface_pin_saved": "PIN saved",
"interface_pin_clear_confirm": "Remove the interface PIN? The scanner will be accessible to anyone on the network.",
"interface_pin_cleared": "PIN cleared",
"interface_pin_login_desc": "Enter the interface PIN to continue.",
"interface_pin_login_btn": "Continue",
"interface_pin_err_incorrect": "Incorrect PIN.",
"interface_pin_err_too_many": "Too many attempts. Try again later.",
"interface_pin_err_network": "Network error. Please try again.",
"m365_settings_tab_ai": "AI / NER",
"m365_ai_title": "AI-Enhanced Named Entity Recognition",
"m365_ai_desc": "Use Claude AI instead of spaCy for name, address, and organisation detection. Significantly more accurate on Danish text — especially hyphenated surnames and foreign-origin names. Requires an Anthropic API key; charged per token.",
"m365_ai_enable": "Enable Claude NER",
"m365_ai_api_key_label": "Anthropic API key",
"m365_ai_show_key": "Show",
"m365_ai_hide_key": "Hide",
"m365_ai_key_set": "API key saved",
"m365_ai_key_not_set": "No API key saved",
"m365_ai_test": "Test key",
"m365_ai_testing": "Testing…",
"m365_ai_test_ok": "API key valid",
"m365_ai_test_fail": "Test failed",
"m365_ai_saved": "Saved",
"m365_ai_model_note": "Model: claude-haiku-4-5 · billed at Anthropic token rates · results cached per document.",
"m365_settings_updates": "Software update",
"m365_update_idle": "Check whether a newer version is available.",
"m365_update_auto": "Install updates automatically (checked daily — the app restarts itself)",
"m365_update_check": "Check for updates",
"m365_update_install": "Install update",
"m365_update_checking": "Checking…",
"m365_update_uptodate": "You are running the latest version.",
"m365_update_available": "Update available",
"m365_update_installing": "Installing update — the app will restart…",
"m365_update_failed": "Update check failed",
"m365_update_scan_running": "Cannot update while a scan is running."
}

View File

@ -39,9 +39,11 @@ except ImportError:
GRAPH_BASE = "https://graph.microsoft.com/v1.0"
# Delegated scopes — used when signing in as a specific user (device code flow)
# Files.ReadWrite.All is a superset of Files.Read.All; required for in-place
# OneDrive/SharePoint/Teams redaction (PUT /drives/{id}/items/{id}/content).
SCOPES = [
"Mail.Read",
"Files.Read.All",
"Files.ReadWrite.All",
"Sites.Read.All",
"Team.ReadBasic.All",
"ChannelMessage.Read.All",
@ -82,8 +84,9 @@ class M365PermissionError(M365Error):
f"to access this resource.\n"
f" Path: {path}\n"
f" Fix: the signed-in user must be a Global/Exchange Admin, OR an admin must "
f"grant Application permissions (Mail.Read, Files.Read.All, Sites.Read.All) "
f"in Azure → App registrations → API permissions → Grant admin consent."
f"grant Application permissions (Mail.Read, Files.ReadWrite.All, Sites.Read.All) "
f"in Azure → App registrations → API permissions → Grant admin consent.\n"
f" Note: Files.ReadWrite.All (not Files.Read.All) is required for file redaction."
)
@ -93,6 +96,17 @@ class M365DeltaTokenExpired(M365Error):
pass
class M365DriveNotFound(M365Error):
"""Raised when the Graph API returns 404 for a drive/root path.
Common causes: OneDrive licence not assigned, service plan disabled,
drive not yet provisioned (user has never signed in), or account
suspended/deleted. Not a scan error callers should skip the user
and log at a lower severity.
"""
pass
class M365Connector:
def __init__(self, client_id: str, tenant_id: str, client_secret: str = ""):
if not MSAL_OK:
@ -425,6 +439,8 @@ class M365Connector:
except Exception:
msg = r.text[:200]
raise M365PermissionError(path, msg)
if r.status_code == 404:
raise M365DriveNotFound(f"404 Not Found: {path}")
r.raise_for_status()
return r.json()
raise _requests.exceptions.RetryError(f"Gave up after {self._MAX_RETRIES} attempts: {url}")
@ -460,7 +476,7 @@ class M365Connector:
msg = r.text[:200]
raise M365PermissionError(path, msg)
r.raise_for_status()
return r.json()
return r.json() if r.content else {}
raise _requests.exceptions.RetryError(f"Gave up after {self._MAX_RETRIES} attempts: {url}")
def _get_bytes(self, url: str, _retry: bool = True) -> bytes:
@ -536,6 +552,8 @@ class M365Connector:
r.raise_for_status()
return True # 204 No Content = success
raise _requests.exceptions.RetryError(f"Gave up after {self._MAX_RETRIES} attempts: {url}")
def delete_message(self, user_id: str, message_id: str) -> bool:
"""Move an email to Deleted Items (soft delete)."""
base = "/me" if (not user_id or user_id == "me") else f"/users/{user_id}"
try:
@ -872,6 +890,50 @@ class M365Connector:
url = f"{GRAPH_BASE}/drives/{drive_id}/items/{item_id}/content"
return self._get_bytes(url)
def put_drive_item_content(self, drive_id: str, item_id: str, content: bytes,
user_id: str = "") -> None:
"""Replace file content via Graph. Tries drives/{drive_id} first; falls back
to users/{user_id}/drive when drive_id is absent, then /me/drive."""
if drive_id:
url = f"{GRAPH_BASE}/drives/{drive_id}/items/{item_id}/content"
elif user_id and user_id != "me":
url = f"{GRAPH_BASE}/users/{user_id}/drive/items/{item_id}/content"
else:
url = f"{GRAPH_BASE}/me/drive/items/{item_id}/content"
for attempt in range(self._MAX_RETRIES):
try:
r = _requests.put(url, headers={**self._headers(),
"Content-Type": "application/octet-stream"},
data=content, timeout=self._TIMEOUT_BYTES)
except self._RETRYABLE_ERRORS:
if attempt == self._MAX_RETRIES - 1:
raise
self._backoff_sleep(attempt)
continue
if r.status_code == 429:
self._backoff_sleep(attempt, float(r.headers.get("Retry-After", 5)))
continue
if r.status_code in (503, 504):
if attempt < self._MAX_RETRIES - 1:
self._backoff_sleep(attempt)
continue
if r.status_code == 401 and attempt == 0:
self._token = None
if self.try_silent_auth():
self.put_drive_item_content(drive_id, item_id, content, user_id)
return
if r.status_code == 403:
try:
msg = r.json().get("error", {}).get("message", "")
except Exception:
msg = r.text[:200]
raise M365PermissionError(url, msg)
r.raise_for_status()
return
raise _requests.exceptions.RetryError(f"Gave up after {self._MAX_RETRIES} attempts: {url}")
# ── Teams ─────────────────────────────────────────────────────────────────
def list_all_teams(self) -> list:

View File

@ -13,10 +13,11 @@ pdfplumber>=0.11 # PDF text extraction
python-docx>=1.1 # Word document scanning
openpyxl>=3.1 # Excel scanning + export
# ── Image processing ──────────────────────────────────────────────────────────
# ── Image / video processing ─────────────────────────────────────────────────
Pillow>=10.0 # Image thumbnails + EXIF extraction (always-on)
opencv-python>=4.9 # Face detection (opt-in — Scan photos for faces)
numpy>=1.26 # Required by opencv-python
mutagen>=1.47 # Video metadata extraction (MP4/MOV/AVI — GPS, author, title)
# ── NER / PII detection ───────────────────────────────────────────────────────
# spaCy 3.7 supports Python 3.83.12. Do NOT upgrade past Python 3.12.
@ -36,12 +37,16 @@ pystray>=0.19 # System tray icon
# ── File system scanning (optional) ──────────────────────────────────────────
smbprotocol>=1.13 # SMB2/3 network share scanning without mounting
keyring>=25.0 # OS keychain credential storage for SMB passwords
paramiko>=3.4 # SFTP scanning over SSH
keyring>=25.0 # OS keychain credential storage for SMB/SFTP passwords
python-dotenv>=1.0 # .env file fallback for headless SMB credentials
# ── Scheduler (#19) ──────────────────────────────────────────────────────────
APScheduler>=3.10 # In-process scheduled scans
# ── AI NER (Claude) ──────────────────────────────────────────────────────────
anthropic>=0.40.0 # Claude API client for AI-enhanced NER
# ── Google Workspace scanning (#10) ──────────────────────────────────────────
google-auth>=2.0 # Service account + domain-wide delegation
google-auth-httplib2 # HTTP transport for google-auth

View File

@ -5,6 +5,8 @@ SSE routes must live in `gdpr_scanner.py`, not blueprints — blueprints can't s
M365 scan emits `scan_done`; Google emits `google_scan_done`; file scan emits `file_scan_done`. Never mix them up.
**`scan_start` is M365-only** — `run_scan()` broadcasts `scan_start`; `run_file_scan()` and `routes/google_scan.py` must NOT. The `scan_start` handler in `_attachSchedulerListeners` (scan.js) unconditionally sets `S._m365ScanRunning = true`. If a file scan emits `scan_start`, the flag is set with no matching `scan_done` to clear it — `file_scan_done` checks `!S._m365ScanRunning` before re-enabling the scan button, so the button stays disabled permanently after the scan completes.
## scan_progress source field
All three scan engines must include `"source": "m365"` / `"google"` / `"file"` in every `scan_progress` SSE event. Never remove this field — the frontend uses it to route progress to the correct segment.
@ -14,6 +16,102 @@ All three scan engines must include `"source": "m365"` / `"google"` / `"file"` i
## Circular import prohibition
`scan_engine.py` and `gdpr_scanner.py` must not import each other. `scan_engine` imports from `sse`, `checkpoint`, `app_config`, `cpr_detector`; `gdpr_scanner` imports scan functions from `scan_engine`.
## `_scan_bytes` injection
`scan_engine.py` declares stub versions of `_scan_bytes` / `_scan_bytes_timeout` at module level. `gdpr_scanner.py` replaces them with the real `cpr_detector` implementations at startup. `routes/google_scan.py` pulls them from `gdpr_scanner` via `__getattr__`. Never import these directly in blueprint or engine modules — that breaks the circular-import barrier.
## M365 connector exceptions — m365_connector.py
Exception hierarchy (all inherit `M365Error(Exception)`):
| Exception | Trigger | Handler |
|---|---|---|
| `M365PermissionError` | 403 Forbidden | `scan_error` broadcast with human-readable permission hint |
| `M365DeltaTokenExpired` | 410 Gone on delta endpoint | Caller clears token and falls back to full scan |
| `M365DriveNotFound` | 404 Not Found on any path | `scan_phase` broadcast ("not provisioned — skipped") in `_scan_user_onedrive`; full-scan path's `except Exception: return` also silences it |
**`M365DriveNotFound` — why it exists:** `_get()` previously fell through to `raise_for_status()` on 404, which was caught by the generic `except Exception` handler and broadcast as a red `scan_error`. Adding the specific exception makes the delta path consistent with the full-scan path: a user without a provisioned OneDrive is skipped silently. **Do not add a 404 handler to `_get()` that returns a fallback value** — that would silently mask genuine path bugs.
## Export — routes/export.py
- **`GDPRDb.get_session_sources()`** — returns a `set` of source-key strings for every scan in the current session window. Used by both `_build_excel_bytes()` and `_build_article30_docx()` to include zero-hit sources in summary tables. Do not derive the scanned-source set from `by_source` alone — that dict only contains sources with flagged items.
- **Excel Summary sheet** — shows all scanned sources (even with 0 items). Per-source tabs only created for sources with items.
- **ART.30 breakdown table** — iterates `scanned_sources` (not `by_source`) so Gmail, Drive, etc. appear with `0 | 0 | 0 | —` when the scan found nothing.
- **Role-filtered exports**`_build_excel_bytes(role='')` and `_build_article30_docx(role='')` accept `role='student'` or `role='staff'`. A local `_items` list is built at the top of each function; GPS sheet, External transfers sheet, and Art.30 tables all see only the filtered subset. Filenames get `_elever` / `_ansatte` suffix.
- **`POST /api/redact_item`** — rewrites a file in-place with CPR numbers replaced by `██████-████` / `█` blocks, removes the card from the grid, logs a `"redacted"` disposition. Source types: `local` (DOCX/XLSX/CSV/TXT/PDF, written via temp+move), `onedrive`/`sharepoint`/`teams` (Graph download → redact → PUT, requires `Files.ReadWrite.All`), `gdrive` (Drive API, requires `drive` scope), `sftp` (paramiko read/write, item must still be in `state.flagged_items`), `smb` (smbprotocol `FILE_SUPERSEDE`). **Keep `_redactExts`/`_cloudRedactExts` in `results.js` and `_REDACT_EXTS`/`_GDRIVE_MIME_MAP`/`_ALL_REDACTABLE_TYPES` in `export.py` in sync** — the button and the route must agree.
- **PDF redaction**`redact_pdf_secure` uses PyMuPDF `page.apply_redactions()` (physical removal). Falls back to reportlab overlay if PyMuPDF absent. Text pages use `find_cpr_char_bboxes`; scanned pages use OCR at 200 DPI + `find_cpr_image_bboxes`.
## Preview — routes/database.py
`GET /api/preview/<item_id>?source_type=…&account_id=…` dispatches by `source_type`:
- **`local` / `smb`** — re-reads from disk; renders images as data URIs, text/CSV/PDF/DOCX/XLSX inline.
- **`email`** — fetches M365 message body via Graph (requires `state.connector`).
- **`gmail`** — shows info card with "Open in Gmail" link (X-Frame-Options blocks embedding).
- **`gdrive`** — returns `https://drive.google.com/file/d/{id}/preview` iframe.
- **All other values** (M365 files) — calls Graph `/preview` POST; tries `drive_id`-based path first, then user-drive, then `/me/drive`.
**`_source_type` must be set in `google_scan.py`** — Gmail items need `meta["_source_type"] = "gmail"` and Drive items `"gdrive"` before `_broadcast_card`. Without it, cards fall through to the M365 branch, which calls Graph with a Gmail ID and gets a 404.
**`state.connector` guard** — only the `email` and M365 `else` branches require M365 auth. The `local`/`smb`/`gmail`/`gdrive` branches must not gate on `state.connector` — they work in Google-only deployments.
## Compliance audit log — gdpr_db.py + routes/
- **`audit_log` table** — created by `_DDL` (`CREATE TABLE IF NOT EXISTS`), auto-appears on next server start. Schema: `id, ts (Unix float), action, actor, detail, ip`.
- **`log_audit_event(action, detail, actor, ip)`** — module-level helper; silently no-ops on any exception. Import: `from gdpr_db import log_audit_event as _audit`.
- **`GET /api/audit_log?limit=200&action=<filter>`** — in `routes/app_routes.py`. No auth gate.
- **Recorded events**`profile_save/delete`, `token_create/revoke`, `viewer_pin_set/change/clear`, `interface_pin_set/change/clear`, `source_add/update/delete`, `scheduler_job_save/delete`, `scan_start/stop`, `smtp_save`, `disposition`, `disposition_bulk`, `admin_pin_set/change`, `item_delete`, `item_redact`, `app_update`.
- **`actor` always empty** — no per-user login; field reserved for future use.
## Email sending — routes/email.py + m365_connector.py
- **`_post()` returns `{}` on empty body** — Graph `sendMail` returns HTTP 202 with no body; `r.json()` on empty raises `JSONDecodeError`. Do not revert to unconditional `r.json()`.
- **Graph preferred over SMTP**`smtp_test` and `send_report` try `_send_email_graph()` first; fall back to SMTP only if Graph raises. If Graph fails and no SMTP host saved, the Graph exception surfaces directly.
- **Auto-email after manual scan**`_maybe_send_auto_email()` in `routes/scan.py` called from the `_run()` thread after `run_scan()` returns. Reads `smtp_cfg.get("auto_email_manual")`; no-ops if false, no flagged items, or no recipients.
- **Gmail vs Google Workspace** — auth error handlers check if SMTP username ends in `@gmail.com`/`@googlemail.com`; custom domains are treated as Google Workspace and error message points to the Workspace admin console.
- **Canonical SMTP config keys are `username` and `use_tls`** — all backend readers (`smtp_test`, `_send_report_email`, `_send_email_graph`) use these. The Settings → E-mailrapport tab (`scheduler.js`) historically saved `user`/`starttls`, which left `username` empty so `server.login()` was skipped and the server rejected the send. Frontend now sends the canonical keys, and `_load_smtp_config()` normalises legacy `user``username` / `starttls``use_tls` for already-saved configs. The send-report modal (`scan.js`) already used the canonical keys. Keep both UIs and the backend on `username`/`use_tls`.
- **Graph 202 ≠ delivered**`_send_email_graph` returns on Graph's HTTP 202 (queued), and `smtp_test`/`send_report` treat that as success and never fall back to SMTP. A recipient on a domain Exchange Online considers an accepted/internal domain (e.g. a Google-hosted subdomain of the O365 domain) is silently dropped after the 202. There is no in-app fix for that routing; reaching such recipients requires SMTP (e.g. Google Workspace `smtp.gmail.com`/`smtp-relay.gmail.com`) or fixing Exchange Accepted Domains.
- **`prefer_smtp` config flag** — when truthy, `smtp_test`, `send_report`, and `_maybe_send_auto_email` (routes/scan.py) skip the Graph path entirely and send via SMTP. This is the in-app escape hatch for the Graph-202 routing trap above. The gate is `... and not smtp_cfg.get("prefer_smtp")` on each Graph branch — keep all three in sync. UI: `#st-smtpPreferSmtp` toggle (key `m365_smtp_prefer_smtp`), saved/loaded by `scheduler.js`.
## Scheduler — scan_scheduler.py + routes/scheduler.py
- **Job config keys**`id`, `name`, `enabled`, `frequency` (daily/weekly/monthly), `day_of_week`, `day_of_month`, `hour`, `minute`, `profile_id`, `auto_email`, `auto_retention`, `retention_years`, `fiscal_year_end`, `report_only`. Stored in `~/.gdprscanner/schedule.json`.
- **`_execute_scan(job_id)`** — acquires per-job lock (`_running_jobs` set), records DB run via `db.begin_schedule_run()`, runs M365 → file → Google pipeline, then emails and applies retention. DB run finalised in `finally`.
- **Report-only path** — when `report_only=True`, short-circuits before M365 auth check, populates `_m.flagged_items` from `db.get_session_items()` if empty, calls `_send_email_report()`. Does NOT acquire scan lock; fails with `RuntimeError("No scan results available")` if DB is also empty.
- **`_m.flagged_items` and `state.flagged_items` are the same object** — assigned at startup; in-place updates (`flagged_items[:] = ...`) propagate to both.
- **`scheduler_started` / `scheduler_done` SSE events** — separate from `scan_done` (M365). `scheduler_done` carries `flagged`, `scanned`, `emailed`, `job_name`.
- **Profile options merge into file sources** — scheduler unpacks `{**fs, **_fs_extra}` before calling `run_file_scan(fs)`. Do not pass `fs` directly — the file scan reads `source.get(...)` and silently falls back to defaults without the merge.
## Claude NER — document_scanner.py + app_config.py + routes/app_routes.py
Optional AI-powered NER replacing spaCy. Activated via `config.json` keys `claude_ner` (bool) and `claude_api_key` (str, **Fernet-encrypted at rest** with an `enc:` prefix — same scheme as the SMTP password).
- **`ANTHROPIC_OK`** — module-level flag in `document_scanner.py`; `True` if `anthropic` is importable. Guards all Claude code paths.
- **`_ner_claude(text, api_key)`** — calls `claude-haiku-4-5-20251001` in 8 000-char chunks. Thread-safe cache keyed by `hash(text)`, evicts oldest when > 2 000 entries.
- **Always read the key via `app_config.get_claude_api_key()`** — it decrypts and transparently handles legacy plaintext. Never read `config.json["claude_api_key"]` directly; `save_claude_config()` writes it encrypted.
- **`GET/POST /api/settings/claude`** — GET returns `{"enabled": bool, "api_key_set": bool}` (never exposes key). POST accepts `{"enabled": bool, "api_key": "..."}` — omitting `api_key` leaves stored key unchanged.
- **`POST /api/settings/claude/test`** — minimal 8-token API call; returns `{"ok": true}` or `{"ok": false, "error": "..."}`.
- **Do not import `anthropic` at module level outside `document_scanner.py`**`routes/app_routes.py` imports it locally inside the function body so the server starts without the package.
## Software update — routes/updates.py
- **Git-checkout only**`_supported()` requires a `.git` dir and not `sys.frozen`. The frozen desktop build gets `{"supported": false}` and the UI hides the Settings group.
- **`POST /api/update/apply`** — stash-if-dirty → `merge --ff-only origin/<branch>` → pip install only if `requirements.txt` changed → audit `app_update``_schedule_restart()` re-execs the process via `os.execv` (same PID; works under systemd and `start_gdpr.sh`). Refuses with `code: "scan_running"` (409) while `state._scan_lock` or `state._google_scan_lock` is held.
- **`apply_update()` never restarts itself** — callers decide. Tests patch `_schedule_restart`; the auto-update thread calls `_restart_self()` directly.
- **Auto-update thread**`start_auto_update_thread()` called from `gdpr_scanner.py` `__main__`. Hourly tick, applies at most once per 24 h when `config.json["auto_update"]` is true; skips (and retries next tick) while a scan runs.
- **`update_gdpr.sh`** — standalone CLI/cron equivalent of the same logic; keep stash/ff-only/requirements behaviour in sync.
## Viewer mode — routes/viewer.py
- **`/view` auth chain** — token (`?token=`) → session cookie (`session["viewer_ok"]`) → PIN form → 403. Never skip this order.
- **Token scope** — stored as `"scope": {"role": "student"|"staff"}`, `{"user": [...], "display_name": "..."}`, or `{}` in `viewer_tokens.json`. Enforced server-side in `GET /api/db/flagged`. **Column name is `user_role`** — do not use `role`.
- **`session["viewer_scope"]`** — set at `/view` token validation. `GET /api/db/flagged` reads `session.get("viewer_scope", {})` — defaults to `{}` (unrestricted) for PIN-authenticated sessions.
- **`viewer_tokens.json` format** — `{"tokens": [...], "__pin__": {"hash": "…", "salt": "…"}}`. Old bare-list format handled transparently. Do not write as bare list.
- **Rate-limit state** (`_pin_attempts` dict) — in-memory only, resets on server restart. Intentional.
- **User-scoped tokens**`scope.user` always a list; legacy single-string coerced on read. File-scan items (`account_id = ""`) never appear in user-scoped views. `POST /api/viewer/tokens` rejects combined `role`+`user` scope with 400.
- **Date-range scoping**`valid_from`/`valid_to` (YYYY-MM-DD) in scope dict; filtered via lexicographic string comparison in `GET /api/db/flagged`. Server validates format and enforces `valid_from ≤ valid_to`.
- **`app.secret_key`** — derived from `machine_id` bytes so sessions survive restarts. Set once at startup; do not override.
- **Flask binds to `0.0.0.0`**`gdpr_scanner.py`, `m365_launcher.py`, and `build_gdpr.py` all use `host="0.0.0.0"`. Internal loopback URLs intentionally keep `127.0.0.1`.
## Gotchas
- **`_load_settings()` return** — does NOT include `file_sources`. Returns only: sources, user_ids, options, retention_years, fiscal_year_end, email_to.

View File

@ -72,6 +72,50 @@ def get_lang_json():
return jsonify(state.LANG)
@bp.route("/api/audit_log")
def audit_log_list():
"""Return recent compliance audit log entries."""
try:
from gdpr_db import get_db as _get_db
limit = min(int(request.args.get("limit", 200)), 1000)
action = request.args.get("action") or None
return jsonify(_get_db().get_audit_log(limit=limit, action=action))
except Exception as e:
return jsonify({"error": str(e)}), 500
@bp.route("/api/settings/claude", methods=["GET", "POST"])
def claude_settings():
from app_config import get_claude_config, save_claude_config
if request.method == "GET":
return jsonify(get_claude_config())
data = request.get_json(silent=True) or {}
api_key = data.get("api_key") # None = keep existing key
if api_key == "":
api_key = None # empty string = don't change
save_claude_config(bool(data.get("enabled", False)), api_key)
return jsonify({"ok": True})
@bp.route("/api/settings/claude/test", methods=["POST"])
def claude_test():
from app_config import get_claude_api_key
api_key = get_claude_api_key()
if not api_key:
return jsonify({"ok": False, "error": "No API key saved"}), 400
try:
import anthropic
client = anthropic.Anthropic(api_key=api_key)
client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=8,
messages=[{"role": "user", "content": "Hi"}],
)
return jsonify({"ok": True})
except Exception as e:
return jsonify({"ok": False, "error": str(e)}), 400
@bp.route("/manual")
def manual():
"""Serve the user manual as a styled, printable HTML page.

View File

@ -11,11 +11,12 @@ from checkpoint import _clear_checkpoint, _DELTA_PATH
from cpr_detector import _extract_exif, _html_esc, _placeholder_svg
try:
from gdpr_db import get_db as _get_db
from gdpr_db import get_db as _get_db, log_audit_event as _audit
DB_OK = True
except ImportError:
DB_OK = False
def _get_db(*a, **kw): return None # type: ignore[misc]
def _audit(*a, **kw): pass # type: ignore[misc]
try:
import document_scanner as _ds # noqa: F401
@ -70,6 +71,13 @@ def db_scans():
return jsonify(_get_db().scans_list())
@bp.route("/api/db/sessions")
def db_sessions():
"""List scan sessions (grouped concurrent scans), newest first."""
if not DB_OK: return jsonify([])
return jsonify(_get_db().get_sessions())
@bp.route("/api/db/subject", methods=["POST"])
def db_subject_lookup():
"""Find all items containing a given CPR number.
@ -133,9 +141,35 @@ def db_set_disposition():
notes = data.get("notes", ""),
reviewed_by = data.get("reviewed_by", ""),
)
_audit("disposition",
f"item_id={item_id!r} status={data.get('status','')!r}",
ip=request.remote_addr or "")
return jsonify({"status": "saved"})
@bp.route("/api/db/disposition/bulk", methods=["POST"])
def db_set_disposition_bulk():
"""Set the same disposition on multiple items at once.
Body: {item_ids: [...], status, legal_basis?, notes?, reviewed_by?}
"""
if not DB_OK: return jsonify({"error": "database not available"}), 503
data = request.get_json() or {}
item_ids = data.get("item_ids", [])
status = data.get("status", "")
if not item_ids or not status:
return jsonify({"error": "item_ids and status required"}), 400
db = _get_db()
for iid in item_ids:
db.set_disposition(iid, status,
legal_basis=data.get("legal_basis", ""),
notes=data.get("notes", ""),
reviewed_by=data.get("reviewed_by", ""))
_audit("disposition_bulk",
f"count={len(item_ids)} status={status!r}",
ip=request.remote_addr or "")
return jsonify({"saved": len(item_ids)})
@bp.route("/api/db/disposition/<item_id>")
def db_get_disposition(item_id):
"""Get the current disposition for an item."""
@ -146,15 +180,62 @@ def db_get_disposition(item_id):
@bp.route("/api/db/flagged")
def db_flagged_items():
"""Return flagged items from the most recent completed scan session.
"""Return flagged items for the results grid.
With ?ref=N, returns the items from that specific past scan session (history
mode). Without ref, returns every item still awaiting action across all
scans (the default landing view) not just the latest session window.
Used by the read-only viewer to load results without an active SSE connection.
Respects viewer_scope.role stored in the session for scoped tokens.
"""
if not DB_OK: return jsonify([])
items = _get_db().get_session_items()
from flask import session as _session
scope = _session.get("viewer_scope", {})
role_filt = scope.get("role", "") if isinstance(scope, dict) else ""
date_from = scope.get("valid_from", "") if isinstance(scope, dict) else ""
date_to = scope.get("valid_to", "") if isinstance(scope, dict) else ""
# user may be a list of emails (current) or a legacy single string
raw_user = scope.get("user", "") if isinstance(scope, dict) else ""
if isinstance(raw_user, list):
user_filt = set(e.lower() for e in raw_user if e)
else:
user_filt = {raw_user.lower()} if raw_user else set()
ref_scan_id = request.args.get("ref", type=int)
if ref_scan_id:
# History mode — a specific past session was requested.
items = _get_db().get_session_items(ref_scan_id=ref_scan_id)
else:
# Default landing / viewer — show every item still awaiting action,
# across all scans, not just the latest session window.
items = _get_db().get_open_items()
# Normalise JSON-encoded columns the same way scan_engine does for SSE cards
import json as _json
out = []
for row in items:
if role_filt and row.get("user_role", "") != role_filt:
continue
if user_filt and (row.get("account_id", "") or "").lower() not in user_filt:
continue
if date_from and (row.get("modified") or "") < date_from:
continue
if date_to and (row.get("modified") or "") > date_to:
continue
row["special_category"] = _json.loads(row.get("special_category") or "[]") if isinstance(row.get("special_category"), str) else row.get("special_category", [])
row["exif"] = _json.loads(row.get("exif_json") or "{}") if isinstance(row.get("exif_json"), str) else row.get("exif", {})
row.pop("exif_json", None)
out.append(row)
return jsonify(out)
@bp.route("/api/db/related/<item_id>")
def db_related_items(item_id):
"""Return flagged items from the same session sharing at least one CPR hash."""
if not DB_OK:
return jsonify([])
ref = request.args.get("ref", type=int)
import json as _json
out = []
for row in _get_db().get_related_items(item_id, ref_scan_id=ref):
row["special_category"] = _json.loads(row.get("special_category") or "[]") if isinstance(row.get("special_category"), str) else row.get("special_category", [])
row["exif"] = _json.loads(row.get("exif_json") or "{}") if isinstance(row.get("exif_json"), str) else row.get("exif", {})
row.pop("exif_json", None)
@ -217,10 +298,13 @@ def admin_pin_set():
new_pin = data.get("new_pin", "").strip()
if not new_pin:
return jsonify({"error": "new_pin required"}), 400
if _admin_pin_is_set():
had_pin = _admin_pin_is_set()
if had_pin:
if not _verify_admin_pin(data.get("current_pin", "")):
return jsonify({"error": "incorrect_pin"}), 403
_set_admin_pin(new_pin)
_audit("admin_pin_change" if had_pin else "admin_pin_set", "",
ip=request.remote_addr or "")
return jsonify({"ok": True})
@ -286,6 +370,29 @@ def db_import():
return jsonify({"error": str(e)}), 500
def _excerpt_page(excerpt: str, item_meta: dict) -> str:
"""Minimal HTML page showing a stored body excerpt as a preview fallback."""
import html as _html
subject = _html.escape(item_meta.get("name", ""))
modified = item_meta.get("modified", "")
account = _html.escape(item_meta.get("account_name", ""))
body = "<pre style='white-space:pre-wrap;font-family:sans-serif;margin:0'>" + _html.escape(excerpt) + "</pre>"
note = "<p style='font-size:11px;color:#888;margin-top:12px'>Stored excerpt — connect to reload the full message.</p>"
return (
"<!DOCTYPE html><html><head><meta charset='utf-8'>"
"<style>body{font-family:-apple-system,sans-serif;font-size:13px;"
"padding:12px 16px;background:#fff;color:#111;word-break:break-word}"
".hdr{border-bottom:1px solid #eee;margin-bottom:12px;padding-bottom:10px}"
".hdr-row{color:#555;font-size:12px;margin-bottom:3px}"
".hdr-row b{color:#111}</style></head><body>"
f"<div class='hdr'>"
+ (f"<div class='hdr-row'><b>From:</b> {account}</div>" if account else "")
+ (f"<div class='hdr-row'><b>Date:</b> {_html.escape(modified)}</div>" if modified else "")
+ (f"<div class='hdr-row'><b>Subject:</b> {subject}</div>" if subject else "")
+ f"</div>{body}{note}</body></html>"
)
@bp.route("/api/preview/<item_id>")
def get_preview(item_id):
"""Return a preview URL or HTML for a flagged item."""
@ -478,14 +585,17 @@ def get_preview(item_id):
except Exception as e:
return jsonify({"error": str(e)})
if not state.connector:
return jsonify({"error": "not authenticated"}), 401
item_meta = next((x for x in state.flagged_items if x.get("id") == item_id), {})
drive_id = item_meta.get("drive_id", "")
try:
if source_type == "email":
excerpt = item_meta.get("body_excerpt", "")
if not state.connector:
if excerpt:
import html as _html
return jsonify({"type": "html", "html": _excerpt_page(excerpt, item_meta)})
return jsonify({"error": "not authenticated"}), 401
uid = account_id
try:
msg = state.connector._get(
@ -493,6 +603,8 @@ def get_preview(item_id):
{"$select": "subject,from,receivedDateTime,body"}
)
except Exception as e:
if excerpt:
return jsonify({"type": "html", "html": _excerpt_page(excerpt, item_meta)})
return jsonify({"error": f"Could not load email: {e}"})
sender = msg.get("from", {}).get("emailAddress", {})
@ -550,8 +662,51 @@ def get_preview(item_id):
</body></html>"""
return jsonify({"type": "html", "html": page})
elif source_type in ("gmail", "gdrive"):
item_url = item_meta.get("url", "")
name = item_meta.get("name", "")
if source_type == "gdrive" and item_url:
# Extract Drive file ID and use the embeddable /preview URL
import re as _re
m = _re.search(r"/file/d/([^/]+)", item_url)
if m:
fid = m.group(1)
return jsonify({"type": "iframe", "url": f"https://drive.google.com/file/d/{fid}/preview"})
# Fallback: generic Drive embed
return jsonify({"type": "iframe", "url": item_url.replace("/view", "/preview")})
# Gmail — not embeddable; show link card + stored body excerpt if available
icon = "✉️" if source_type == "gmail" else "☁️"
label = "Open in Gmail" if source_type == "gmail" else "Open in Google Drive"
excerpt = item_meta.get("body_excerpt", "")
link_html = (
f'<a href="{_html_esc(item_url)}" target="_blank" '
f'style="display:inline-block;margin-top:12px;padding:8px 16px;'
f'background:#3b7dd8;color:#fff;border-radius:6px;text-decoration:none;font-size:12px">'
f'{label}</a>'
) if item_url else ""
if excerpt and source_type == "gmail":
html_out = _excerpt_page(excerpt, item_meta)
if item_url:
# Inject the "Open in Gmail" link before </body>
html_out = html_out.replace(
"</body>",
f'<div style="margin-top:12px">{link_html}</div></body>'
)
else:
html_out = (
f'<div style="padding:24px;text-align:center;font-family:sans-serif">'
f'<div style="font-size:40px">{icon}</div>'
f'<div style="font-size:13px;font-weight:600;margin:8px 0">{_html_esc(name)}</div>'
f'<div style="font-size:11px;color:var(--muted)">No inline preview available for this item</div>'
f'{link_html}'
f'</div>'
)
return jsonify({"type": "html", "html": html_out})
else:
# OneDrive / SharePoint / Teams — use Graph's embed preview API
if not state.connector:
return jsonify({"error": "not authenticated"}), 401
preview_url = None
errors = []

View File

@ -5,6 +5,10 @@ from __future__ import annotations
from flask import Blueprint, jsonify, request
from routes import state
from app_config import _load_smtp_config, _save_smtp_config
try:
from gdpr_db import log_audit_event as _audit
except ImportError:
def _audit(*a, **kw): pass # type: ignore[misc]
from routes.export import _build_excel_bytes
bp = Blueprint("email", __name__)
@ -119,6 +123,7 @@ def smtp_config_save():
if not data.get("password") and existing.get("password"):
data["password"] = existing["password"]
_save_smtp_config(data)
_audit("smtp_save", f"host={data.get('host','')!r}", ip=request.remote_addr or "")
return jsonify({"status": "saved"})
@ -143,12 +148,15 @@ def smtp_test():
"</body></html>"
)
# Try Graph API first
if state.connector and state.connector.is_authenticated():
# Try Graph API first — unless the user opted to always use SMTP. Graph
# returns 202 (queued) even for recipients Exchange later silently drops
# (e.g. a Google-hosted subdomain of the O365 domain), so SMTP is the only
# reliable path for those; prefer_smtp forces it.
prefer_smtp = bool(saved.get("prefer_smtp"))
if state.connector and state.connector.is_authenticated() and not prefer_smtp:
try:
_send_email_graph(subject, body_html, recipients)
return jsonify({"ok": True,
"message": f"Test email sent via Microsoft Graph to {', '.join(recipients)}"})
return jsonify({"ok": True, "method": "graph", "recipients": recipients})
except Exception as graph_err:
graph_error_str = str(graph_err)
else:
@ -164,6 +172,12 @@ def smtp_test():
use_tls = bool(saved.get("use_tls", True)) and not use_ssl
if not host:
if graph_error_str:
return jsonify({"error": (
f"Microsoft Graph email failed: {graph_error_str}\n\n"
"Make sure Mail.Send is added to your Azure app registration and admin consent has been granted:\n"
"Azure AD → App registrations → [your app] → API permissions → Add → Microsoft Graph → Mail.Send → Grant admin consent."
)}), 400
return jsonify({"error": "No SMTP host configured. To send via Microsoft 365 Graph (no SMTP needed), add Mail.Send to your Azure app registration."}), 400
try:
@ -187,8 +201,8 @@ def smtp_test():
if username and password:
server.login(username, password)
server.sendmail(from_addr, recipients, msg.as_string())
suffix = " (⚠ Graph also failed — Mail.Send permission not granted)" if graph_error_str else ""
return jsonify({"ok": True, "message": f"Test email sent via SMTP to {', '.join(recipients)}{suffix}"})
return jsonify({"ok": True, "method": "smtp", "recipients": recipients,
"graph_also_failed": bool(graph_error_str)})
except Exception as smtp_err:
err_str = str(smtp_err)
_h = host.lower()
@ -210,9 +224,31 @@ def smtp_test():
"(Users → Active users → [user] → Mail → Manage email apps → Authenticated SMTP), "
"or add Mail.Send to your Azure app to use Graph instead.")
elif (_personal_ms or _gmail_host) and _auth_err:
provider = "Microsoft" if _personal_ms else "Google"
url = "account.microsoft.com/security" if _personal_ms else "myaccount.google.com → Security → 2-Step Verification"
err_str = (f"Authentication failed — {provider} blocks regular passwords for SMTP when MFA is enabled.\n\n"
if _gmail_host:
_gws_account = "@gmail.com" not in username.lower() and "@googlemail.com" not in username.lower()
if _gws_account:
err_str = ("Google Workspace SMTP authentication failed.\n\n"
"Your account uses a custom domain via Google Workspace. "
"SMTP access is controlled by your organisation's Google Workspace admin, not your personal account settings.\n\n"
"Ask your Google Workspace admin to:\n"
" • Enable 2-Step Verification for your account (required for App Passwords)\n"
" • Allow users to manage their own App Passwords (Admin console → Security → 2-Step Verification)\n"
" • Or configure SMTP relay: Admin console → Apps → Google Workspace → Gmail → Routing → SMTP relay service\n\n"
"If App Passwords are available for your account, generate one at "
"myaccount.google.com → Security → 2-Step Verification → App passwords "
"and use it instead of your normal password.")
else:
err_str = ("Gmail SMTP authentication failed.\n\n"
"Google requires an App Password for SMTP — your normal password will not work.\n\n"
"If you are already using an App Password, check:\n"
" • No spaces — the 16-character code must be entered without spaces\n"
" • The App Password has not been revoked — generate a new one at "
"myaccount.google.com → Security → 2-Step Verification → App passwords\n"
" • The correct username (your full Gmail address, e.g. you@gmail.com)\n"
" • Port 587 with STARTTLS, or port 465 with SSL")
else:
url = "account.microsoft.com/security"
err_str = (f"Authentication failed — Microsoft blocks regular passwords for SMTP when MFA is enabled.\n\n"
f"Fix: create an App Password at {url} → App passwords "
f"and use that instead of your normal password.")
elif graph_error_str:
@ -253,8 +289,8 @@ def send_report():
"</body></html>"
)
# Try Graph API first
if state.connector and state.connector.is_authenticated():
# Try Graph API first — unless prefer_smtp is set (see smtp_test for why).
if state.connector and state.connector.is_authenticated() and not smtp_cfg.get("prefer_smtp"):
try:
_send_email_graph(subject, body_html, recipients,
attachment_bytes=xl_bytes, attachment_name=fname)
@ -295,9 +331,32 @@ def send_report():
err = (f"{err}\n\nTip: Enable SMTP AUTH for this mailbox in the Microsoft 365 admin centre, "
"or connect to M365 first so the scanner can send via Microsoft Graph instead.")
elif (_personal_ms_2 or _gmail_2) and _auth_err_2:
provider2 = "Microsoft" if _personal_ms_2 else "Google"
url2 = "account.microsoft.com/security" if _personal_ms_2 else "myaccount.google.com → Security → 2-Step Verification"
err = (f"Authentication failed — {provider2} blocks regular passwords for SMTP when MFA is enabled.\n\n"
if _gmail_2:
_uname2 = smtp_cfg.get("username", "").lower()
_gws2 = "@gmail.com" not in _uname2 and "@googlemail.com" not in _uname2
if _gws2:
err = ("Google Workspace SMTP authentication failed.\n\n"
"Your account uses a custom domain via Google Workspace. "
"SMTP access is controlled by your organisation's Google Workspace admin, not your personal account settings.\n\n"
"Ask your Google Workspace admin to:\n"
" • Enable 2-Step Verification for your account (required for App Passwords)\n"
" • Allow users to manage their own App Passwords (Admin console → Security → 2-Step Verification)\n"
" • Or configure SMTP relay: Admin console → Apps → Google Workspace → Gmail → Routing → SMTP relay service\n\n"
"If App Passwords are available for your account, generate one at "
"myaccount.google.com → Security → 2-Step Verification → App passwords "
"and use it instead of your normal password.")
else:
err = ("Gmail SMTP authentication failed.\n\n"
"Google requires an App Password for SMTP — your normal password will not work.\n\n"
"If you are already using an App Password, check:\n"
" • No spaces — the 16-character code must be entered without spaces\n"
" • The App Password has not been revoked — generate a new one at "
"myaccount.google.com → Security → 2-Step Verification → App passwords\n"
" • The correct username (your full Gmail address, e.g. you@gmail.com)\n"
" • Port 587 with STARTTLS, or port 465 with SSL")
else:
url2 = "account.microsoft.com/security"
err = (f"Authentication failed — Microsoft blocks regular passwords for SMTP when MFA is enabled.\n\n"
f"Fix: create an App Password at {url2} → App passwords "
f"and use that instead of your normal password.")
return jsonify({"error": err}), 500

View File

@ -9,11 +9,12 @@ from routes import state
from app_config import _GUID_RE, _resolve_display_name
try:
from gdpr_db import get_db as _get_db
from gdpr_db import get_db as _get_db, log_audit_event as _audit
DB_OK = True
except ImportError:
DB_OK = False
def _get_db(*a, **kw): return None # type: ignore[misc]
def _audit(*a, **kw): pass # type: ignore[misc]
try:
from m365_connector import M365PermissionError
@ -24,9 +25,10 @@ bp = Blueprint("export", __name__)
logger = logging.getLogger(__name__)
def _build_excel_bytes() -> tuple[bytes, str]:
def _build_excel_bytes(role: str = "") -> tuple[bytes, str]:
"""Build the M365 scan Excel workbook and return (bytes, filename).
Raises on error. Used by export_excel() and send_report()."""
Raises on error. Used by export_excel() and send_report().
role: '' = all, 'student' = students only, 'staff' = staff + other."""
from openpyxl import Workbook
from openpyxl.styles import Font, PatternFill, Alignment, Border, Side
from openpyxl.utils import get_column_letter
@ -43,6 +45,7 @@ def _build_excel_bytes() -> tuple[bytes, str]:
"gdrive": ("💾 Google Drive", "D5F5E3"),
"local": ("📁 Local", "E6F7E6"),
"smb": ("🌐 Network", "E0F0FA"),
"sftp": ("🔒 SFTP", "EDE9F7"),
}
COLS = [
("Name / Subject", 45),
@ -131,11 +134,20 @@ def _build_excel_bytes() -> tuple[bytes, str]:
ws.auto_filter.ref = f"A1:{get_column_letter(len(COLS))}1"
# Apply role filter — '' means all roles
if role == "student":
_items = [i for i in state.flagged_items if i.get("user_role") == "student"]
elif role == "staff":
_items = [i for i in state.flagged_items if i.get("user_role") != "student"]
else:
_items = list(state.flagged_items)
wb = Workbook()
ws_sum = wb.active
ws_sum.title = "Summary"
ws_sum.sheet_properties.tabColor = "1F3864"
ws_sum["A1"] = "GDPRScanner — Export"
_role_label = {"student": " — Elever", "staff": " — Ansatte"}.get(role, "")
ws_sum["A1"] = f"GDPRScanner — Export{_role_label}"
ws_sum["A1"].font = Font(name="Arial", bold=True, size=14, color=HEADER_FG)
ws_sum["A1"].fill = _fill(HEADER_BG)
ws_sum.merge_cells("A1:D1")
@ -146,8 +158,8 @@ def _build_excel_bytes() -> tuple[bytes, str]:
ws_sum["A2"] = "Generated:"
ws_sum["B2"] = _dt.datetime.now().strftime("%Y-%m-%d %H:%M")
ws_sum["A3"] = "Total flagged items:"
ws_sum["B3"] = len(state.flagged_items)
gps_count = sum(1 for i in state.flagged_items if (i.get("exif") or {}).get("gps"))
ws_sum["B3"] = len(_items)
gps_count = sum(1 for i in _items if (i.get("exif") or {}).get("gps"))
if gps_count:
ws_sum["A4"] = "Items with GPS data:"
ws_sum["B4"] = gps_count
@ -168,14 +180,26 @@ def _build_excel_bytes() -> tuple[bytes, str]:
ws_sum.column_dimensions["C"].width = 16
by_source: dict = {}
for item in state.flagged_items:
for item in _items:
by_source.setdefault(item.get("source_type", "other"), []).append(item)
# Determine which sources were actually scanned (even if they found nothing)
scanned_sources: set = set()
if DB_OK:
try:
_db_tmp = _get_db()
if _db_tmp:
scanned_sources = _db_tmp.get_session_sources()
except Exception:
pass
# Fall back: treat any source that has items as scanned
scanned_sources |= set(by_source.keys())
sum_row = 7
for src_key, (label, tab_bg) in SOURCE_MAP.items():
items = by_source.get(src_key, [])
if not items:
if src_key not in scanned_sources:
continue
items = by_source.get(src_key, [])
ws_sum.cell(row=sum_row, column=1, value=label).font = Font(name="Arial", size=10)
ws_sum.cell(row=sum_row, column=2, value=len(items)).font = Font(name="Arial", size=10)
ws_sum.cell(row=sum_row, column=3, value=sum(i.get("cpr_count", 0) for i in items)).font = Font(name="Arial", size=10)
@ -192,7 +216,7 @@ def _build_excel_bytes() -> tuple[bytes, str]:
_write_sheet(wb.create_sheet(title=clean_label), items, tab_bg)
# GPS items sheet
gps_items = [i for i in state.flagged_items if (i.get("exif") or {}).get("gps")]
gps_items = [i for i in _items if (i.get("exif") or {}).get("gps")]
if gps_items:
ws_gps = wb.create_sheet(title="GPS locations")
ws_gps.sheet_properties.tabColor = "1A7A6E"
@ -230,7 +254,7 @@ def _build_excel_bytes() -> tuple[bytes, str]:
ws_gps.auto_filter.ref = f"A1:{get_column_letter(len(GPS_COLS))}1"
# External transfers sheet
ext_items = [i for i in state.flagged_items
ext_items = [i for i in _items
if i.get("transfer_risk") in ("external-recipient", "external-share", "shared")]
if ext_items:
ws_ext = wb.create_sheet(title="External transfers")
@ -246,8 +270,11 @@ def _build_excel_bytes() -> tuple[bytes, str]:
buf = io.BytesIO()
wb.save(buf)
buf.seek(0)
fname = f"m365_scan_{_dt.datetime.now().strftime('%Y%m%d_%H%M%S')}.xlsx"
_role_suffix = {"student": "_elever", "staff": "_ansatte"}.get(role, "")
fname = f"m365_scan{_role_suffix}_{_dt.datetime.now().strftime('%Y%m%d_%H%M%S')}.xlsx"
return buf.read(), fname
@bp.route("/api/export_excel")
def export_excel():
"""Export flagged items as an Excel workbook with per-source tabs."""
@ -263,8 +290,9 @@ def export_excel():
state.flagged_items[:] = db_items
except Exception:
pass
role = request.args.get("role", "")
try:
xl_bytes, fname = _build_excel_bytes()
xl_bytes, fname = _build_excel_bytes(role=role)
return Response(
xl_bytes,
mimetype="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
@ -280,9 +308,10 @@ def export_excel():
# ── Article 30 report ─────────────────────────────────────────────────────────
def _build_article30_docx() -> tuple[bytes, str]:
def _build_article30_docx(role: str = "") -> tuple[bytes, str]:
"""Generate a GDPR Article 30 Register of Processing Activities as .docx.
Returns (bytes, filename). Strings are translated using the active state.LANG dict."""
Returns (bytes, filename). Strings are translated using the active state.LANG dict.
role: '' = all, 'student' = students only, 'staff' = staff + other."""
try:
from docx import Document as _Document
from docx.shared import Pt, RGBColor, Inches, Cm
@ -302,6 +331,10 @@ def _build_article30_docx() -> tuple[bytes, str]:
db = _get_db() if DB_OK else None
stats = db.get_stats() if db else {}
items = db.get_session_items() if db else list(state.flagged_items)
if role == "student":
items = [i for i in items if i.get("user_role") == "student"]
elif role == "staff":
items = [i for i in items if i.get("user_role") != "student"]
trend = db.get_trend(10) if db else []
overdue = db.get_overdue_items(5) if db else []
@ -345,7 +378,8 @@ def _build_article30_docx() -> tuple[bytes, str]:
now_str = _dt.datetime.now().strftime("%Y-%m-%d %H:%M")
date_str = _dt.datetime.now().strftime("%Y-%m-%d")
fname = f"article30_{date_str}.docx"
_role_suffix = {"student": "_elever", "staff": "_ansatte"}.get(role, "")
fname = f"article30{_role_suffix}_{date_str}.docx"
# Aggregate by source
by_source: dict = {}
@ -353,6 +387,15 @@ def _build_article30_docx() -> tuple[bytes, str]:
st = item.get("source_type", "other")
by_source.setdefault(st, []).append(item)
# Determine which sources were actually scanned (may be empty-hit)
scanned_sources: set = set()
if db:
try:
scanned_sources = db.get_session_sources()
except Exception:
pass
scanned_sources |= set(by_source.keys())
SOURCE_LABELS = {
"email": "Exchange (Outlook)",
"onedrive": "OneDrive",
@ -362,6 +405,7 @@ def _build_article30_docx() -> tuple[bytes, str]:
"gdrive": "Google Drive",
"local": "Local files",
"smb": "Network / SMB",
"sftp": "SFTP",
}
# ── Colour palette ────────────────────────────────────────────────────────
@ -556,10 +600,10 @@ def _build_article30_docx() -> tuple[bytes, str]:
r = p.add_run(txt); r.bold = True
r.font.size = Pt(10); r.font.color.rgb = WHITE
for src_key in ("email", "onedrive", "sharepoint", "teams", "gmail", "gdrive", "local", "smb"):
src_items = by_source.get(src_key, [])
if not src_items:
for src_key in ("email", "onedrive", "sharepoint", "teams", "gmail", "gdrive", "local", "smb", "sftp"):
if src_key not in scanned_sources:
continue
src_items = by_source.get(src_key, [])
row = src_tbl.add_row().cells
n_ov = sum(1 for i in src_items if i.get("id") in overdue_ids)
n_cpr = sum(i.get("cpr_count", 0) for i in src_items)
@ -1100,7 +1144,8 @@ def export_article30():
if not state.flagged_items:
return jsonify({"error": "No results to export — run a scan first"}), 400
try:
docx_bytes, fname = _build_article30_docx()
role = request.args.get("role", "")
docx_bytes, fname = _build_article30_docx(role=role)
return Response(
docx_bytes,
mimetype="application/vnd.openxmlformats-officedocument.wordprocessingml.document",
@ -1114,6 +1159,7 @@ def export_article30():
return jsonify({"error": str(e)}), 500
@bp.route("/api/delete_item", methods=["POST"])
def delete_item():
"""Delete a single flagged item. Returns {ok, error}."""
if not state.connector:
@ -1146,6 +1192,9 @@ def delete_item():
reason="manual")
_db.delete_item_record(item_id)
except Exception: pass
_audit("item_delete",
f"id={item_id!r} name={item_meta.get('name','')!r}",
ip=request.remote_addr or "")
return jsonify({"ok": True})
return jsonify({"ok": False, "error": "Delete returned unexpected result"})
except M365PermissionError:
@ -1156,6 +1205,502 @@ def delete_item():
return jsonify({"ok": False, "error": str(e)})
_REDACT_EXTS = {".docx", ".xlsx", ".csv", ".txt", ".pdf"}
_M365_CLOUD_TYPES = {"onedrive", "sharepoint", "teams"}
_GDRIVE_MIME_MAP = {
".docx": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
".xlsx": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
".pdf": "application/pdf",
}
_ALL_REDACTABLE_TYPES = {"local", "smb", "sftp", "gdrive"} | _M365_CLOUD_TYPES
@bp.route("/api/redact_item", methods=["POST"])
def redact_item():
"""Redact CPR numbers in-place in a local, SMB, SFTP, M365, or Google Drive file."""
from pathlib import Path as _Path
import tempfile as _tempfile
import shutil as _shutil
data = request.get_json() or {}
item_id = data.get("id", "")
if not item_id:
return jsonify({"ok": False, "error": "id required"}), 400
# Resolve item meta: in-memory first (active scan), then DB (history)
item_meta = next((x for x in state.flagged_items if x.get("id") == item_id), None)
if item_meta is None:
_db = _get_db() if DB_OK else None
if _db:
row = _db._connect().execute(
"SELECT * FROM flagged_items WHERE id=? LIMIT 1", (item_id,)
).fetchone()
item_meta = dict(row) if row else {}
else:
item_meta = {}
source_type = item_meta.get("source_type", "")
is_m365_cloud = source_type in _M365_CLOUD_TYPES
if source_type not in _ALL_REDACTABLE_TYPES:
return jsonify({"ok": False, "error": "Redaction is only supported for local, SMB, SFTP, M365, and Google Drive files"}), 400
# --- local path branch ---
if source_type == "local":
full_path = item_meta.get("full_path", "")
if not full_path:
return jsonify({"ok": False, "error": "File path not available — rescan to enable redaction"}), 400
path = _Path(full_path).expanduser()
if not path.exists():
return jsonify({"ok": False, "error": f"File not found: {full_path}"}), 404
ext = path.suffix.lower()
if ext not in _REDACT_EXTS:
return jsonify({"ok": False, "error": f"Redaction not supported for {ext or 'this'} files. Supported: DOCX, XLSX, CSV, TXT, PDF"}), 400
tmp_path = None
try:
from document_scanner import (
scan_docx, redact_docx,
scan_xlsx, redact_xlsx,
redact_csv,
scan_pdf, redact_pdf_secure,
find_pii_spans_in_text,
)
with _tempfile.NamedTemporaryFile(suffix=ext, delete=False, dir=path.parent) as tmp:
tmp_path = _Path(tmp.name)
if ext == ".docx":
results = scan_docx(path)
redacted = redact_docx(path, tmp_path, results, use_ner=False)
elif ext == ".xlsx":
results = scan_xlsx(path)
redacted = redact_xlsx(path, tmp_path, results, use_ner=False)
elif ext == ".csv":
redacted = redact_csv(path, tmp_path, use_ner=False)
elif ext == ".pdf":
results = scan_pdf(path)
redacted = redact_pdf_secure(path, tmp_path, results,
force_ocr=False, lang="dan+eng",
dpi=200, poppler_path=None,
use_ner=False)
if redacted is False:
raise RuntimeError("PDF redaction failed — PyMuPDF and reportlab both unavailable. Install with: pip install pymupdf")
else: # .txt
text = path.read_text(encoding="utf-8", errors="replace")
spans = [(s, e, l) for s, e, l in find_pii_spans_in_text(text, use_ner=False) if l == "CPR"]
chars = list(text)
for s, e, _ in sorted(spans, reverse=True):
chars[s:e] = [""] * (e - s)
tmp_path.write_text("".join(chars), encoding="utf-8")
redacted = len(spans)
_shutil.move(str(tmp_path), str(path))
tmp_path = None
except Exception as exc:
if tmp_path and tmp_path.exists():
try:
tmp_path.unlink()
except Exception:
pass
logger.exception("[redact] local file error")
return jsonify({"ok": False, "error": str(exc)}), 500
# --- M365 cloud branch (OneDrive / SharePoint / Teams) ---
elif is_m365_cloud:
conn = state.connector
if conn is None:
return jsonify({"ok": False, "error": "M365 not connected — cannot redact cloud files"}), 400
name = item_meta.get("name", "")
ext = _Path(name).suffix.lower() if name else ""
if ext not in _REDACT_EXTS - {".csv", ".txt"}:
return jsonify({"ok": False, "error": f"Redaction not supported for {ext or 'this'} cloud files. Supported: DOCX, XLSX, PDF"}), 400
drive_id = item_meta.get("drive_id") or item_meta.get("_drive_id", "")
account_id = item_meta.get("account_id") or item_meta.get("_account_id", "")
tmp_path = None
try:
# Download
if drive_id:
raw = conn.download_sharepoint_item(drive_id, item_id)
elif account_id and account_id != "me":
raw = conn.download_drive_item_for(account_id, item_id)
else:
raw = conn.download_drive_item(item_id)
from document_scanner import (
scan_docx, redact_docx,
scan_xlsx, redact_xlsx,
scan_pdf, redact_pdf_secure,
)
with _tempfile.NamedTemporaryFile(suffix=ext, delete=False) as tmp:
tmp.write(raw)
tmp_path = _Path(tmp.name)
del raw
with _tempfile.NamedTemporaryFile(suffix=ext, delete=False) as out:
out_path = _Path(out.name)
if ext == ".docx":
results = scan_docx(tmp_path)
redacted = redact_docx(tmp_path, out_path, results, use_ner=False)
elif ext == ".xlsx":
results = scan_xlsx(tmp_path)
redacted = redact_xlsx(tmp_path, out_path, results, use_ner=False)
else: # .pdf
results = scan_pdf(tmp_path)
redacted = redact_pdf_secure(tmp_path, out_path, results,
force_ocr=False, lang="dan+eng",
dpi=200, poppler_path=None,
use_ner=False)
if redacted is False:
raise RuntimeError("PDF redaction failed — PyMuPDF and reportlab both unavailable. Install with: pip install pymupdf")
# Upload redacted bytes back
redacted_bytes = out_path.read_bytes()
conn.put_drive_item_content(drive_id, item_id, redacted_bytes, user_id=account_id)
del redacted_bytes
except Exception as exc:
logger.exception("[redact] cloud file error")
return jsonify({"ok": False, "error": str(exc)}), 500
finally:
for p in ("tmp_path", "out_path"):
_p = locals().get(p)
if _p and _p.exists():
try:
_p.unlink()
except Exception:
pass
# --- Google Drive branch ---
elif source_type == "gdrive":
gconn = state.google_connector
if gconn is None:
return jsonify({"ok": False, "error": "Google not connected — cannot redact Drive files"}), 400
name = item_meta.get("name", "")
ext = _Path(name).suffix.lower() if name else ""
if ext not in _GDRIVE_MIME_MAP:
return jsonify({"ok": False, "error": f"Redaction not supported for {ext or 'this'} Drive files. Supported: DOCX, XLSX, PDF"}), 400
# item_id is "gdrive:{file_id}"
gfile_id = item_id[len("gdrive:"):] if item_id.startswith("gdrive:") else item_id
user_email = item_meta.get("account_id") or item_meta.get("_account_id", "")
tmp_path = out_path = None
try:
from document_scanner import (
scan_docx, redact_docx,
scan_xlsx, redact_xlsx,
scan_pdf, redact_pdf_secure,
)
from google_connector import GoogleError as _GoogleError
# Refuse Google-native formats (Docs/Sheets exported as DOCX)
try:
mime = gconn.get_drive_file_mime(user_email, gfile_id)
except Exception as exc:
return jsonify({"ok": False, "error": f"Could not read Drive file info: {exc}"}), 500
if mime.startswith("application/vnd.google-apps."):
return jsonify({"ok": False, "error": (
"Cannot redact a Google Docs/Sheets/Slides file in-place. "
"Export it as DOCX/XLSX/PDF first, then redact the exported copy."
)}), 400
raw = gconn.download_drive_file_by_id(user_email, gfile_id)
with _tempfile.NamedTemporaryFile(suffix=ext, delete=False) as tmp:
tmp.write(raw)
tmp_path = _Path(tmp.name)
del raw
with _tempfile.NamedTemporaryFile(suffix=ext, delete=False) as out:
out_path = _Path(out.name)
if ext == ".docx":
results = scan_docx(tmp_path)
redacted = redact_docx(tmp_path, out_path, results, use_ner=False)
elif ext == ".xlsx":
results = scan_xlsx(tmp_path)
redacted = redact_xlsx(tmp_path, out_path, results, use_ner=False)
else: # .pdf
results = scan_pdf(tmp_path)
redacted = redact_pdf_secure(tmp_path, out_path, results,
force_ocr=False, lang="dan+eng",
dpi=200, poppler_path=None,
use_ner=False)
if redacted is False:
raise RuntimeError("PDF redaction failed — PyMuPDF and reportlab both unavailable. Install with: pip install pymupdf")
redacted_bytes = out_path.read_bytes()
gconn.update_drive_file(user_email, gfile_id, redacted_bytes, _GDRIVE_MIME_MAP[ext])
del redacted_bytes
except Exception as exc:
logger.exception("[redact] gdrive file error")
return jsonify({"ok": False, "error": str(exc)}), 500
finally:
for _p in (tmp_path, out_path):
if _p and _p.exists():
try:
_p.unlink()
except Exception:
pass
# --- SFTP branch ---
elif source_type == "sftp":
full_path = item_meta.get("full_path", "")
source_uri = item_meta.get("account_name", "") # sftp://user@host/root_path
if not full_path:
return jsonify({"ok": False, "error": "File path not available — rescan to enable SFTP redaction"}), 400
if not source_uri:
return jsonify({"ok": False, "error": "SFTP source info not in memory — rescan and redact in the same session"}), 400
ext = _Path(full_path).suffix.lower()
if ext not in _REDACT_EXTS:
return jsonify({"ok": False, "error": f"Redaction not supported for {ext or 'this'} files. Supported: DOCX, XLSX, CSV, TXT, PDF"}), 400
# Parse sftp://user@host/root to find matching source config
try:
from urllib.parse import urlparse as _urlparse
_u = _urlparse(source_uri)
_sftp_host = _u.hostname or ""
_sftp_user = _u.username or ""
except Exception:
_sftp_host = _sftp_user = ""
from app_config import _load_file_sources, _resolve_sftp_credentials
_sftp_source = next(
(s for s in _load_file_sources()
if s.get("source_type") == "sftp"
and s.get("sftp_host", "") == _sftp_host
and s.get("sftp_user", "") == _sftp_user),
None,
)
if _sftp_source is None:
return jsonify({"ok": False, "error": f"SFTP source config not found for {_sftp_host} — rescan to enable redaction"}), 400
_sftp_source = _resolve_sftp_credentials(_sftp_source)
tmp_path = out_path = None
try:
from sftp_connector import SFTPScanner as _SFTPScanner
from document_scanner import (
scan_docx, redact_docx,
scan_xlsx, redact_xlsx,
redact_csv,
scan_pdf, redact_pdf_secure,
find_pii_spans_in_text,
)
_sftp = _SFTPScanner(
host=_sftp_source.get("sftp_host", ""),
root_path=_sftp_source.get("path", "/"),
username=_sftp_source.get("sftp_user", ""),
port=int(_sftp_source.get("sftp_port", 22)),
auth_type=_sftp_source.get("sftp_auth", "password"),
password=_sftp_source.get("sftp_password") or None,
key_path=_sftp_source.get("sftp_key_path") or None,
passphrase=_sftp_source.get("sftp_passphrase") or None,
)
raw = _sftp.read_file(full_path)
with _tempfile.NamedTemporaryFile(suffix=ext, delete=False) as tmp:
tmp.write(raw)
tmp_path = _Path(tmp.name)
del raw
with _tempfile.NamedTemporaryFile(suffix=ext, delete=False) as out:
out_path = _Path(out.name)
if ext == ".docx":
results = scan_docx(tmp_path)
redacted = redact_docx(tmp_path, out_path, results, use_ner=False)
elif ext == ".xlsx":
results = scan_xlsx(tmp_path)
redacted = redact_xlsx(tmp_path, out_path, results, use_ner=False)
elif ext == ".csv":
redacted = redact_csv(tmp_path, out_path, use_ner=False)
elif ext == ".pdf":
results = scan_pdf(tmp_path)
redacted = redact_pdf_secure(tmp_path, out_path, results,
force_ocr=False, lang="dan+eng",
dpi=200, poppler_path=None,
use_ner=False)
if redacted is False:
raise RuntimeError("PDF redaction failed — install PyMuPDF: pip install pymupdf")
else: # .txt
text = tmp_path.read_text(encoding="utf-8", errors="replace")
spans = [(s, e, l) for s, e, l in find_pii_spans_in_text(text, use_ner=False) if l == "CPR"]
chars = list(text)
for s, e, _ in sorted(spans, reverse=True):
chars[s:e] = [""] * (e - s)
out_path.write_text("".join(chars), encoding="utf-8")
redacted = len(spans)
_sftp.write_file(full_path, out_path.read_bytes())
except Exception as exc:
logger.exception("[redact] sftp file error")
return jsonify({"ok": False, "error": str(exc)}), 500
finally:
for _p in (tmp_path, out_path):
if _p and _p.exists():
try:
_p.unlink()
except Exception:
pass
# --- SMB branch ---
elif source_type == "smb":
full_path = item_meta.get("full_path", "")
if not full_path:
return jsonify({"ok": False, "error": "File path not available — rescan to enable SMB redaction"}), 400
ext = _Path(full_path.replace("\\", "/").split("/")[-1]).suffix.lower()
if ext not in _REDACT_EXTS:
return jsonify({"ok": False, "error": f"Redaction not supported for {ext or 'this'} files. Supported: DOCX, XLSX, CSV, TXT, PDF"}), 400
# Parse //host/share/... to find matching source config
_norm = full_path.replace("\\", "/").lstrip("/")
_parts = _norm.split("/", 2)
_smb_host_fp = _parts[0] if len(_parts) > 0 else ""
from app_config import _load_file_sources
from file_scanner import get_smb_password as _get_smb_pw
_smb_source = next(
(s for s in _load_file_sources()
if s.get("source_type", "smb") in ("smb", "")
and (s.get("smb_host", "") == _smb_host_fp
or s.get("path", "").replace("\\", "/").lstrip("/").split("/")[0] == _smb_host_fp)),
None,
)
if _smb_source is None:
return jsonify({"ok": False, "error": f"SMB source config not found for {_smb_host_fp}"}), 400
_smb_user = _smb_source.get("smb_user", "")
_smb_domain = _smb_source.get("smb_domain", "")
_smb_kc = _smb_source.get("keychain_key") or None
_smb_pw = _smb_source.get("smb_password") or _get_smb_pw(_smb_host_fp, _smb_user, _smb_kc) or ""
tmp_path = out_path = None
try:
from file_scanner import write_smb_file as _write_smb
from document_scanner import (
scan_docx, redact_docx,
scan_xlsx, redact_xlsx,
redact_csv,
scan_pdf, redact_pdf_secure,
find_pii_spans_in_text,
)
# Download current content
from file_scanner import _smb_read_file as _smb_read, SMB_OK as _SMB_OK
if not _SMB_OK:
raise RuntimeError("smbprotocol not installed — run: pip install smbprotocol")
import uuid as _uuid
from smbprotocol.connection import Connection as _SmbConn
from smbprotocol.session import Session as _SmbSession
from smbprotocol.tree import TreeConnect as _SmbTree
_norm2 = full_path.replace("\\", "/").lstrip("/")
_fp = _norm2.split("/", 2)
_fhost = _fp[0]; _fshare = _fp[1] if len(_fp) > 1 else ""
_frel = (_fp[2].replace("/", "\\")) if len(_fp) > 2 else ""
_smb_conn = _SmbConn(_uuid.uuid4(), _fhost, 445)
_smb_conn.connect(timeout=30)
try:
_smb_sess = _SmbSession(_smb_conn,
username=f"{_smb_domain}\\{_smb_user}" if _smb_domain else _smb_user,
password=_smb_pw, require_encryption=False)
_smb_sess.connect()
try:
_smb_tree = _SmbTree(_smb_sess, f"\\\\{_fhost}\\{_fshare}")
_smb_tree.connect()
try:
raw = _smb_read(_smb_tree, _frel)
finally:
_smb_tree.disconnect()
finally:
_smb_sess.disconnect()
finally:
_smb_conn.disconnect()
with _tempfile.NamedTemporaryFile(suffix=ext, delete=False) as tmp:
tmp.write(raw)
tmp_path = _Path(tmp.name)
del raw
with _tempfile.NamedTemporaryFile(suffix=ext, delete=False) as out:
out_path = _Path(out.name)
if ext == ".docx":
results = scan_docx(tmp_path)
redacted = redact_docx(tmp_path, out_path, results, use_ner=False)
elif ext == ".xlsx":
results = scan_xlsx(tmp_path)
redacted = redact_xlsx(tmp_path, out_path, results, use_ner=False)
elif ext == ".csv":
redacted = redact_csv(tmp_path, out_path, use_ner=False)
elif ext == ".pdf":
results = scan_pdf(tmp_path)
redacted = redact_pdf_secure(tmp_path, out_path, results,
force_ocr=False, lang="dan+eng",
dpi=200, poppler_path=None,
use_ner=False)
if redacted is False:
raise RuntimeError("PDF redaction failed — install PyMuPDF: pip install pymupdf")
else: # .txt
text = tmp_path.read_text(encoding="utf-8", errors="replace")
spans = [(s, e, l) for s, e, l in find_pii_spans_in_text(text, use_ner=False) if l == "CPR"]
chars = list(text)
for s, e, _ in sorted(spans, reverse=True):
chars[s:e] = [""] * (e - s)
out_path.write_text("".join(chars), encoding="utf-8")
redacted = len(spans)
_write_smb(full_path, out_path.read_bytes(), _smb_user, _smb_pw, _smb_domain)
except Exception as exc:
logger.exception("[redact] smb file error")
return jsonify({"ok": False, "error": str(exc)}), 500
finally:
for _p in (tmp_path, out_path):
if _p and _p.exists():
try:
_p.unlink()
except Exception:
pass
# --- shared: remove from grid + DB ---
state.flagged_items[:] = [x for x in state.flagged_items if x.get("id") != item_id]
_db = _get_db() if DB_OK else None
if _db:
try:
_db.log_deletion(item_meta, reason="redacted")
_db.delete_item_record(item_id)
except Exception:
pass
_audit("item_redact",
f"id={item_id!r} name={item_meta.get('name','')!r} spans={redacted}",
ip=request.remote_addr or "")
logger.info("[redact] %s%d CPR span(s) redacted", item_meta.get('name', item_id), redacted)
return jsonify({"ok": True, "redacted": redacted})
@bp.route("/api/delete_bulk", methods=["POST"])
def delete_bulk():
"""Delete multiple items matching criteria. Streams progress as SSE."""
@ -1215,6 +1760,7 @@ def delete_bulk():
return jsonify({
"ok": True,
"deleted": len(deleted_ids),
"deleted_ids": deleted_ids, # so the grid can mark exactly these
"failed": len(failed_items),
"errors": failed_items[:10], # cap error list
})

View File

@ -140,6 +140,16 @@ def _run_google_scan(options: dict):
max_file_mb = float(scan_opts.get("max_file_mb", 50.0))
scan_body = bool(scan_opts.get("scan_body", True))
scan_att = bool(scan_opts.get("scan_attachments", True))
delta_enabled = bool(scan_opts.get("delta", False))
scan_emails = bool(scan_opts.get("scan_emails", False))
scan_phones = bool(scan_opts.get("scan_phones", False))
ocr_lang = str(scan_opts.get("ocr_lang", "dan+eng")) or "dan+eng"
cpr_only = bool(scan_opts.get("cpr_only", False))
from checkpoint import (_load_delta_tokens, _save_delta_tokens,
_save_checkpoint, _load_checkpoint, _clear_checkpoint)
_drive_delta_tokens: dict = _load_delta_tokens() if delta_enabled else {}
_new_drive_tokens: dict = {}
# Resolve users: explicit list → Admin SDK → fall back to SA email itself
_user_role_map: dict = {} # email → role
@ -188,14 +198,45 @@ def _run_google_scan(options: dict):
except Exception as e:
logger.error("[google_scan] begin_scan failed: %s", e)
# ── Checkpoint: resume from a previous interrupted Google scan ────────────
import hashlib as _hl, json as _js
_gck_prefix = "google"
_gck_key = _hl.sha256(_js.dumps({
"emails": sorted(user_emails),
"sources": sorted(sources),
"older_than_days": scan_opts.get("older_than_days", 0),
}, sort_keys=True).encode()).hexdigest()[:16]
_gck = _load_checkpoint(_gck_key, prefix=_gck_prefix)
_g_scanned_ids: set = set(_gck["scanned_ids"]) if _gck else set()
_google_flagged: list = [] # items found by this Google scan (for checkpoint)
_gck_resumed = len(_g_scanned_ids)
if _gck:
from scan_engine import _with_disposition as _wd_ck
_google_flagged = list(_gck.get("flagged", []))
flagged_items.extend(_google_flagged)
broadcast("scan_phase", {"phase": f"Resuming — skipping {_gck_resumed} already-scanned items…"})
for _card in _google_flagged:
broadcast("scan_file_flagged", _wd_ck(_card, _db))
_GCHECKPOINT_SAVE_EVERY = 25
_g_items_since_save = 0
total_flagged = 0
total_scanned = 0
t_start = _time.monotonic()
def _check_abort():
from gdpr_scanner import _scan_abort as _sa
if _sa.is_set():
broadcast("scan_cancelled", {"completed": total_scanned})
if _scan_abort.is_set():
# Emit google_scan_done (not scan_cancelled) so that the frontend
# google_scan_done handler can decide whether to close the SSE based
# on whether other scan types (M365, file) are still running.
# scan_cancelled would unconditionally close the SSE connection,
# dropping events from a concurrently running new scan.
broadcast("google_scan_done", {
"flagged_count": total_flagged,
"total_scanned": total_scanned,
"elapsed_seconds": round(_time.monotonic() - t_start, 1),
"cancelled": True,
})
return True
return False
@ -207,6 +248,8 @@ def _run_google_scan(options: dict):
"source": item_meta.get("_source", ""),
"source_type": item_meta.get("_source_type", ""),
"cpr_count": len(cprs),
"email_count": item_meta.get("_email_count", 0),
"phone_count": item_meta.get("_phone_count", 0),
"url": item_meta.get("_url", ""),
"size_kb": round(item_meta.get("size", 0) / 1024, 1),
"modified": (item_meta.get("lastModifiedDateTime") or item_meta.get("receivedDateTime") or "")[:10],
@ -223,8 +266,10 @@ def _run_google_scan(options: dict):
"special_category": [],
"face_count": 0,
"exif": {},
"body_excerpt": item_meta.get("_body_excerpt", ""),
}
flagged_items.append(card)
_google_flagged.append(card)
broadcast("scan_file_flagged", _with_disposition(card, _db))
total_flagged += 1
if _db and _db_scan_id:
@ -256,6 +301,10 @@ def _run_google_scan(options: dict):
):
if _check_abort():
return
_item_id = meta.get("id", "")
if _item_id in _g_scanned_ids:
total_scanned += 1
continue
total_scanned += 1
broadcast("scan_file", {"file": meta.get("name", "")})
broadcast("scan_progress", {
@ -267,14 +316,33 @@ def _run_google_scan(options: dict):
})
try:
meta["_account"] = _display_name
result = _scan_bytes(data, meta.get("name", "msg.txt"))
meta["_source_type"] = "gmail"
# Extract a plain-text excerpt before scanning (body is discarded after)
try:
import re as _re
_raw = data[:3000].decode("utf-8", errors="replace")
_plain = _re.sub(r"<[^>]+>", " ", _raw)
meta["_body_excerpt"] = " ".join(_plain.split())[:500]
except Exception:
meta["_body_excerpt"] = ""
result = _scan_bytes(data, meta.get("name", "msg.txt"), lang=ocr_lang)
except Exception as e:
broadcast("scan_error", {"file": meta.get("name", ""), "error": str(e)})
_g_scanned_ids.add(_item_id)
continue
cprs = result.get("cprs", [])
pii_counts = result.get("pii_counts")
if cprs or (pii_counts and any(pii_counts.values())):
_em = list(dict.fromkeys(e["formatted"] for e in result.get("emails", []))) if scan_emails else []
_ph = list(dict.fromkeys(p["formatted"] for p in result.get("phones", []))) if scan_phones else []
if cprs or (not cpr_only and ((pii_counts and any(pii_counts.values())) or _em or _ph)):
meta["_email_count"] = len(_em)
meta["_phone_count"] = len(_ph)
_broadcast_card(meta, cprs, pii_counts)
_g_scanned_ids.add(_item_id)
_g_items_since_save += 1
if _g_items_since_save >= _GCHECKPOINT_SAVE_EVERY:
_save_checkpoint(_gck_key, _g_scanned_ids, _google_flagged, {}, prefix=_gck_prefix)
_g_items_since_save = 0
except GoogleError as e:
broadcast("scan_error", {"file": f"Gmail/{user_email}", "error": str(e)})
except Exception as e:
@ -283,14 +351,45 @@ def _run_google_scan(options: dict):
# ── Google Drive ──────────────────────────────────────────────────────
if "gdrive" in sources:
try:
delta_key = f"gdrive:{user_email}"
saved_token = _drive_delta_tokens.get(delta_key) if delta_enabled else None
if delta_enabled and saved_token:
broadcast("scan_phase", {"phase": f"{user_email} — Google Drive (delta)"})
try:
drive_items, new_token = conn.get_drive_changes(
user_email, saved_token,
max_files=max_files, max_file_mb=max_file_mb,
)
_new_drive_tokens[delta_key] = new_token
except Exception as delta_err:
broadcast("scan_phase", {"phase": f"{user_email} — Google Drive (delta token invalid — full scan)"})
logger.warning("[gdrive delta] %s: %s — falling back to full scan", user_email, delta_err)
# Record start token BEFORE iterating so the next delta starts from here
try:
_new_drive_tokens[delta_key] = conn.get_drive_start_token(user_email)
except Exception:
pass
# Use a lazy generator (no list()) so _check_abort() fires between items
drive_items = conn.iter_drive_files(user_email, max_files=max_files, max_file_mb=max_file_mb)
else:
broadcast("scan_phase", {"phase": f"{user_email} — Google Drive"})
for meta, data in conn.iter_drive_files(
user_email,
max_files=max_files,
max_file_mb=max_file_mb,
):
# Record start token BEFORE iterating so the next delta starts from here
if delta_enabled:
try:
_new_drive_tokens[delta_key] = conn.get_drive_start_token(user_email)
except Exception:
pass
# Use a lazy generator (no list()) so _check_abort() fires between items
drive_items = conn.iter_drive_files(user_email, max_files=max_files, max_file_mb=max_file_mb)
for meta, data in drive_items:
if _check_abort():
return
_item_id = meta.get("id", "")
if _item_id in _g_scanned_ids:
total_scanned += 1
continue
total_scanned += 1
broadcast("scan_file", {"file": meta.get("name", "")})
broadcast("scan_progress", {
@ -302,27 +401,50 @@ def _run_google_scan(options: dict):
})
try:
meta["_account"] = _display_name
result = _scan_bytes(data, meta.get("name", "file"))
meta["_source_type"] = "gdrive"
result = _scan_bytes(data, meta.get("name", "file"), lang=ocr_lang)
except Exception as e:
broadcast("scan_error", {"file": meta.get("name", ""), "error": str(e)})
_g_scanned_ids.add(_item_id)
continue
cprs = result.get("cprs", [])
pii_counts = result.get("pii_counts")
if cprs or (pii_counts and any(pii_counts.values())):
_em = list(dict.fromkeys(e["formatted"] for e in result.get("emails", []))) if scan_emails else []
_ph = list(dict.fromkeys(p["formatted"] for p in result.get("phones", []))) if scan_phones else []
if cprs or (not cpr_only and ((pii_counts and any(pii_counts.values())) or _em or _ph)):
meta["_email_count"] = len(_em)
meta["_phone_count"] = len(_ph)
_broadcast_card(meta, cprs, pii_counts)
_g_scanned_ids.add(_item_id)
_g_items_since_save += 1
if _g_items_since_save >= _GCHECKPOINT_SAVE_EVERY:
_save_checkpoint(_gck_key, _g_scanned_ids, _google_flagged, {}, prefix=_gck_prefix)
_g_items_since_save = 0
except GoogleError as e:
broadcast("scan_error", {"file": f"Drive/{user_email}", "error": str(e)})
except Exception as e:
broadcast("scan_error", {"file": f"Drive/{user_email}", "error": str(e)})
if delta_enabled and _new_drive_tokens:
try:
current_tokens = _load_delta_tokens()
_save_delta_tokens({**current_tokens, **_new_drive_tokens})
except Exception as e:
logger.warning("[gdrive delta] token save failed: %s", e)
if not _scan_abort.is_set():
_clear_checkpoint(prefix=_gck_prefix)
elapsed = _time.monotonic() - t_start
broadcast("scan_done", {
broadcast("google_scan_done", {
"flagged_count": total_flagged,
"total_scanned": total_scanned,
"elapsed_seconds": round(elapsed, 1),
"delta": delta_enabled and bool(_new_drive_tokens),
"delta_sources": len(_new_drive_tokens),
})
if _db and _db_scan_id:
try:
_db.end_scan(_db_scan_id, total_scanned, total_flagged)
_db.finish_scan(_db_scan_id, total_scanned)
except Exception:
pass

View File

@ -4,6 +4,10 @@ Scan profiles
from __future__ import annotations
from flask import Blueprint, jsonify, request
from app_config import _profiles_load, _profile_save, _profile_delete, _profile_get
try:
from gdpr_db import log_audit_event as _audit
except ImportError:
def _audit(*a, **kw): pass # type: ignore[misc]
bp = Blueprint("profiles", __name__)
@ -21,6 +25,8 @@ def profiles_save():
if not profile.get("name"):
return jsonify({"error": "name required"}), 400
saved = _profile_save(profile)
_audit("profile_save", f"name={profile.get('name')!r}",
ip=request.remote_addr or "")
return jsonify({"status": "saved", "profile": saved})
@ -32,6 +38,8 @@ def profiles_delete():
if not key:
return jsonify({"error": "name or id required"}), 400
ok = _profile_delete(key)
if ok:
_audit("profile_delete", f"key={key!r}", ip=request.remote_addr or "")
return jsonify({"status": "deleted" if ok else "not_found"})
@ -43,5 +51,3 @@ def profiles_get():
if not p:
return jsonify({"error": "not found"}), 404
return jsonify({"profile": p})

View File

@ -3,18 +3,70 @@ Scan stream, start/stop, checkpoint, settings, delta
"""
from __future__ import annotations
import threading
import logging
from flask import Blueprint, jsonify, request
from routes import state
from app_config import (
_save_settings, _load_settings,
_load_src_toggles, _save_src_toggles,
_load_smtp_config,
)
from checkpoint import (
_checkpoint_key, _load_checkpoint, _clear_checkpoint,
_load_delta_tokens, _DELTA_PATH,
_load_delta_tokens, _DELTA_PATH, _cp_path,
)
bp = Blueprint("scan", __name__)
_log = logging.getLogger(__name__)
try:
from gdpr_db import log_audit_event as _audit
except ImportError:
def _audit(*a, **kw): pass # type: ignore[misc]
def _maybe_send_auto_email():
"""Send the scan report email after a manual scan if auto_email_manual is enabled."""
try:
smtp_cfg = _load_smtp_config()
if not smtp_cfg.get("auto_email_manual"):
return
if not state.flagged_items:
return
recipients = smtp_cfg.get("recipients", [])
if isinstance(recipients, str):
recipients = [r.strip() for r in recipients.replace(";", ",").split(",") if r.strip()]
if not recipients:
return
from routes.export import _build_excel_bytes
from routes.email import _send_report_email, _send_email_graph
import datetime as _dt
xl_bytes, fname = _build_excel_bytes()
subject = f"GDPR Scanner — scan report {_dt.datetime.now().strftime('%Y-%m-%d')}"
body_html = (
"<html><body style='font-family:Arial,sans-serif;color:#333;padding:24px'>"
"<h2 style='color:#1F3864'>☁️ GDPR Scanner — scan report</h2>"
f"<p>Please find the latest scan report attached ({fname}).</p>"
f"<p style='color:#888;font-size:12px'>Generated: {_dt.datetime.now().strftime('%Y-%m-%d %H:%M:%S')}<br>"
f"Items flagged: {len(state.flagged_items)}</p>"
"</body></html>"
)
if state.connector and state.connector.is_authenticated() and not smtp_cfg.get("prefer_smtp"):
try:
_send_email_graph(subject, body_html, recipients,
attachment_bytes=xl_bytes, attachment_name=fname)
_log.info("[auto-email] report sent via Graph to %s", recipients)
return
except Exception as e:
_log.warning("[auto-email] Graph failed, trying SMTP: %s", e)
_send_report_email(xl_bytes, fname, smtp_cfg, recipients)
_log.info("[auto-email] report sent via SMTP to %s", recipients)
except Exception as e:
_log.error("[auto-email] failed: %s", e)
@bp.route("/api/scan/status")
@ -24,8 +76,12 @@ def scan_status():
acquired = state._scan_lock.acquire(blocking=False)
if acquired:
state._scan_lock.release()
g_acquired = state._google_scan_lock.acquire(blocking=False)
if g_acquired:
state._google_scan_lock.release()
return jsonify({
"running": not acquired,
"running": not acquired, # M365 + file scan lock
"google_running": not g_acquired, # Google scan lock (separate)
"scan_id": _sse_mod._current_scan_id or None,
})
@ -57,15 +113,21 @@ def scan_start():
from scan_engine import run_scan
try:
run_scan(options)
_maybe_send_auto_email()
finally:
state._scan_lock.release()
threading.Thread(target=_run, daemon=True).start()
_audit("scan_start",
f"sources={options.get('sources',[])} profile_id={profile_id!r}",
ip=request.remote_addr or "")
return jsonify({"status": "started"})
@bp.route("/api/scan/stop", methods=["POST"])
def scan_stop():
state._scan_abort.set()
state._google_scan_abort.set()
_audit("scan_stop", "", ip=request.remote_addr or "")
return jsonify({"status": "stopping"})
@ -73,28 +135,80 @@ def scan_stop():
def scan_checkpoint_info():
"""Return info about any saved checkpoint for the given scan options.
If check_only=true, just reports whether a scan is currently running."""
import hashlib, json as _json
options = request.get_json() or {}
if options.get("check_only"):
acquired = state._scan_lock.acquire(blocking=False)
if acquired:
state._scan_lock.release()
return jsonify({"running": not acquired})
engines = {}
# M365
if options.get("sources"):
key = _checkpoint_key(options)
cp = _load_checkpoint(key)
if not cp:
return jsonify({"exists": False})
return jsonify({
cp = _load_checkpoint(key, prefix="m365")
if cp:
engines["m365"] = {
"exists": True,
"scanned_count": len(cp.get("scanned_ids", [])),
"flagged_count": len(cp.get("flagged", [])),
"started_at": cp.get("meta", {}).get("started_at"),
}
# Google
google_emails = options.get("googleUserEmails", [])
google_sources = options.get("googleSources", [])
if google_emails and google_sources:
gkey = hashlib.sha256(_json.dumps({
"emails": sorted(google_emails),
"sources": sorted(google_sources),
"older_than_days": options.get("options", {}).get("older_than_days", 0),
}, sort_keys=True).encode()).hexdigest()[:16]
cp = _load_checkpoint(gkey, prefix="google")
if cp:
engines["google"] = {
"exists": True,
"scanned_count": len(cp.get("scanned_ids", [])),
"flagged_count": len(cp.get("flagged", [])),
"started_at": cp.get("meta", {}).get("started_at"),
}
# File sources (one checkpoint per source ID)
for src_id in options.get("fileSources", []):
fkey = _checkpoint_key({"sources": ["file"], "user_ids": [src_id], "options": {}})
cp = _load_checkpoint(fkey, prefix=f"file_{src_id}")
if cp:
fe = engines.setdefault("file", {"exists": True, "scanned_count": 0, "flagged_count": 0, "started_at": None})
fe["scanned_count"] += len(cp.get("scanned_ids", []))
fe["flagged_count"] += len(cp.get("flagged", []))
if not fe["started_at"]:
fe["started_at"] = cp.get("meta", {}).get("started_at")
if not engines:
return jsonify({"exists": False})
started_ats = [v["started_at"] for v in engines.values() if v.get("started_at")]
return jsonify({
"exists": True,
"scanned_count": sum(v.get("scanned_count", 0) for v in engines.values()),
"flagged_count": sum(v.get("flagged_count", 0) for v in engines.values()),
"started_at": min(started_ats) if started_ats else None,
"engines": engines,
})
@bp.route("/api/scan/clear_checkpoint", methods=["POST"])
def scan_clear_checkpoint():
"""Discard any saved checkpoint so the next scan starts fresh."""
_clear_checkpoint()
"""Discard all saved checkpoints so the next scan starts fresh."""
from pathlib import Path
data_dir = Path.home() / ".gdprscanner"
for f in data_dir.glob("checkpoint_*.json"):
try:
f.unlink()
except Exception:
pass
return jsonify({"status": "cleared"})

View File

@ -4,6 +4,10 @@ Scheduler API routes — multi-job CRUD, status, history, run-now.
from __future__ import annotations
from flask import Blueprint, jsonify, request
import sys, os, threading
try:
from gdpr_db import log_audit_event as _audit
except ImportError:
def _audit(*a, **kw): pass # type: ignore[misc]
bp = Blueprint("scheduler", __name__)
@ -52,6 +56,9 @@ def scheduler_jobs_save():
_sched().reload()
except Exception:
pass
_audit("scheduler_job_save",
f"id={job_id!r} name={jobs[i].get('name','')!r}",
ip=request.remote_addr or "")
return jsonify({"ok": True, "job": jobs[i]})
# New job
job = sm._new_job(data)
@ -61,6 +68,9 @@ def scheduler_jobs_save():
_sched().reload()
except Exception:
pass
_audit("scheduler_job_save",
f"id={job.get('id','')!r} name={job.get('name','')!r}",
ip=request.remote_addr or "")
return jsonify({"ok": True, "job": job})
except Exception as e:
import traceback
@ -81,6 +91,7 @@ def scheduler_jobs_delete():
_sched().reload()
except Exception:
pass
_audit("scheduler_job_delete", f"id={job_id!r}", ip=request.remote_addr or "")
return jsonify({"ok": True})
except Exception as e:
import traceback

View File

@ -3,9 +3,15 @@ File sources and file scan
"""
from __future__ import annotations
import threading
import uuid as _uuid
from pathlib import Path
from flask import Blueprint, jsonify, request
from routes import state
from app_config import _load_file_sources, _save_file_sources
from app_config import _load_file_sources, _save_file_sources, _SFTP_KEYS_DIR
try:
from gdpr_db import log_audit_event as _audit
except ImportError:
def _audit(*a, **kw): pass # type: ignore[misc]
try:
from file_scanner import store_smb_password, SMB_OK as _SMB_OK
@ -15,6 +21,12 @@ except ImportError:
_SMB_OK = False
def store_smb_password(*a, **kw): return False # type: ignore[misc]
try:
from sftp_connector import store_sftp_password, SFTP_OK as _SFTP_OK
except ImportError:
_SFTP_OK = False
def store_sftp_password(*a, **kw): return False # type: ignore[misc]
bp = Blueprint("sources", __name__)
@ -25,6 +37,7 @@ def file_sources_list():
return jsonify({
"sources": sources,
"smb_available": _SMB_OK,
"sftp_available": _SFTP_OK,
"scanner_ok": _FILE_SCANNER_OK,
})
@ -32,61 +45,156 @@ def file_sources_list():
@bp.route("/api/file_sources/save", methods=["POST"])
def file_sources_save():
"""Add or update a file source. Assigns a UUID if id is missing."""
import uuid as _uuid
data = request.get_json() or {}
path = data.get("path", "").strip()
if not path:
source_type = data.get("source_type", "")
# Validate required fields per source type
if source_type == "sftp":
if not data.get("sftp_host", "").strip():
return jsonify({"error": "sftp_host required"}), 400
if not data.get("sftp_user", "").strip():
return jsonify({"error": "sftp_user required"}), 400
if not data.get("path", "").strip():
data["path"] = "/"
else:
if not data.get("path", "").strip():
return jsonify({"error": "path required"}), 400
sources = _load_file_sources()
uid = data.get("id") or ""
for i, s in enumerate(sources):
if s.get("id") == uid:
sources[i] = {**s, **data}
_save_file_sources(sources)
_audit("source_update",
f"name={data.get('name','')!r} type={data.get('source_type','local')!r}",
ip=request.remote_addr or "")
return jsonify({"ok": True, "source": sources[i]})
data["id"] = data.get("id") or str(_uuid.uuid4())
sources.append(data)
_save_file_sources(sources)
_audit("source_add",
f"name={data.get('name','')!r} type={data.get('source_type','local')!r}",
ip=request.remote_addr or "")
return jsonify({"ok": True, "source": data})
@bp.route("/api/file_sources/delete", methods=["POST"])
def file_sources_delete():
"""Remove a file source by id."""
"""Remove a file source by id. Also deletes any associated SFTP key file."""
uid = (request.get_json() or {}).get("id", "")
if not uid:
return jsonify({"error": "id required"}), 400
sources = [s for s in _load_file_sources() if s.get("id") != uid]
sources = _load_file_sources()
deleted = next((s for s in sources if s.get("id") == uid), None)
sources = [s for s in sources if s.get("id") != uid]
_save_file_sources(sources)
if deleted:
_audit("source_delete",
f"name={deleted.get('name','')!r} type={deleted.get('source_type','local')!r}",
ip=request.remote_addr or "")
# Clean up key file if this was an SFTP key-auth source
if deleted and deleted.get("sftp_key_path"):
key_file = Path(deleted["sftp_key_path"])
if key_file.parent == _SFTP_KEYS_DIR and key_file.exists():
try:
key_file.unlink()
except OSError:
pass
return jsonify({"ok": True})
@bp.route("/api/file_sources/store_creds", methods=["POST"])
def file_sources_store_creds():
"""Store SMB password in the OS keychain."""
"""Store SMB or SFTP password/passphrase in the OS keychain."""
data = request.get_json() or {}
source_type = data.get("source_type", "smb")
password = data.get("password", "")
if source_type == "sftp":
if not _SFTP_OK:
return jsonify({"error": "paramiko not installed — run: pip install paramiko"}), 503
host = data.get("sftp_host", "")
user = data.get("sftp_user", "")
if not user or not password:
return jsonify({"error": "sftp_user and password required"}), 400
key = data.get("keychain_key") or f"sftp:{user}@{host}"
ok = store_sftp_password(host, user, password, key)
if ok:
return jsonify({"ok": True, "keychain_key": key})
return jsonify({"error": "keyring not available — install: pip install keyring"}), 500
else:
if not _FILE_SCANNER_OK:
return jsonify({"error": "file_scanner not available"}), 503
data = request.get_json() or {}
smb_host = data.get("smb_host", "")
smb_user = data.get("smb_user", "")
password = data.get("password", "")
key = data.get("keychain_key") or smb_user
if not smb_user or not password:
return jsonify({"error": "smb_user and password required"}), 400
key = data.get("keychain_key") or smb_user
ok = store_smb_password(smb_host, smb_user, password, key)
if ok:
return jsonify({"ok": True, "keychain_key": key})
return jsonify({"error": "keyring not available — install: pip install keyring"}), 500
@bp.route("/api/file_sources/upload_key", methods=["POST"])
def file_sources_upload_key():
"""Accept an SSH private key file upload and store it in the SFTP keys directory.
Validates the file is a recognised private key format before saving.
Returns {"key_id": uuid, "key_path": absolute_path}.
"""
if not _SFTP_OK:
return jsonify({"error": "paramiko not installed — run: pip install paramiko"}), 503
if "key_file" not in request.files:
return jsonify({"error": "key_file required"}), 400
file = request.files["key_file"]
raw = file.read(65536) # 64 KB is more than enough for any private key
# Validate before saving — try loading the key material with paramiko
import io
import paramiko
loaded = False
for cls in (paramiko.RSAKey, paramiko.Ed25519Key, paramiko.ECDSAKey, paramiko.DSSKey):
try:
cls.from_private_key(io.BytesIO(raw))
loaded = True
break
except (paramiko.ssh_exception.SSHException, Exception):
continue
if not loaded:
# Might be passphrase-protected — still accept it; validation will happen at connect time
if b"-----BEGIN" not in raw and b"OPENSSH PRIVATE KEY" not in raw:
return jsonify({"error": "File does not appear to be a private key"}), 400
key_id = str(_uuid.uuid4())
key_path = _SFTP_KEYS_DIR / key_id
key_path.write_bytes(raw)
key_path.chmod(0o600)
return jsonify({"ok": True, "key_id": key_id, "key_path": str(key_path)})
@bp.route("/api/file_scan/start", methods=["POST"])
def file_scan_start():
"""Start a file system scan for a single file source."""
if not _FILE_SCANNER_OK:
"""Start a file system scan for a single file source (local, SMB, or SFTP)."""
source = request.get_json() or {}
source_type = source.get("source_type", "")
if source_type == "sftp":
if not _SFTP_OK:
return jsonify({"error": "paramiko not installed — run: pip install paramiko"}), 503
elif not _FILE_SCANNER_OK:
return jsonify({"error": "file_scanner not available"}), 503
if not state._scan_lock.acquire(blocking=False):
return jsonify({"error": "scan already running"}), 409
source = request.get_json() or {}
state._scan_abort.clear()
def _run():

216
routes/updates.py Normal file
View File

@ -0,0 +1,216 @@
"""
Software update routes: check origin for new commits, apply the update,
and an optional auto-update background thread.
Only available when running from a git checkout the frozen desktop
build (PyInstaller) reports supported=False and the UI hides the group.
Applying an update fast-forwards to origin/<branch>, reinstalls
dependencies if requirements.txt changed, then re-execs the process so
the new code is loaded. Local edits are stashed (kept), never discarded.
"""
from __future__ import annotations
import os
import subprocess
import sys
import threading
import time
from pathlib import Path
from flask import Blueprint, jsonify, request
from routes import state
from app_config import get_update_config, save_update_config
bp = Blueprint("updates", __name__)
_REPO_DIR = Path(__file__).parent.parent
_GIT_TIMEOUT = 30
_AUTO_CHECK_INTERVAL = 24 * 3600 # auto-update checks once per day
_last_auto_check = [0.0]
def _supported() -> bool:
return (not getattr(sys, "frozen", False)) and (_REPO_DIR / ".git").exists()
def _git(*args: str, timeout: int = _GIT_TIMEOUT) -> subprocess.CompletedProcess:
return subprocess.run(
["git", *args], cwd=_REPO_DIR,
capture_output=True, text=True, timeout=timeout,
)
def _scan_running() -> bool:
return state._scan_lock.locked() or state._google_scan_lock.locked()
def check_for_update() -> dict:
"""Fetch origin and compare HEAD against the tracked branch."""
if not _supported():
return {"supported": False}
try:
branch = _git("rev-parse", "--abbrev-ref", "HEAD").stdout.strip() or "main"
fetch = _git("fetch", "origin", branch, timeout=60)
if fetch.returncode != 0:
return {"supported": True, "error": fetch.stderr.strip()[:300] or "git fetch failed"}
local = _git("rev-parse", "HEAD").stdout.strip()
remote = _git("rev-parse", f"origin/{branch}").stdout.strip()
except (subprocess.TimeoutExpired, OSError) as e:
return {"supported": True, "error": str(e)[:300]}
info = {
"supported": True, "branch": branch,
"current": local[:7], "latest": remote[:7],
"up_to_date": local == remote, "commits": [],
}
if local != remote:
lg = _git("log", "--oneline", f"HEAD..origin/{branch}")
info["commits"] = lg.stdout.strip().splitlines()[:20]
return info
def apply_update() -> dict:
"""Fast-forward to origin/<branch>; returns {"ok", "updated", ...}.
Does NOT restart the process callers decide (the route schedules a
re-exec, the auto-update thread restarts directly).
"""
chk = check_for_update()
if not chk.get("supported"):
return {"ok": False, "code": "unsupported",
"error": "Updates require running from a git checkout."}
if chk.get("error"):
return {"ok": False, "code": "check_failed", "error": chk["error"]}
if chk.get("up_to_date"):
return {"ok": True, "updated": False, "current": chk["current"]}
if _scan_running():
return {"ok": False, "code": "scan_running",
"error": "Cannot update while a scan is running."}
branch = chk["branch"]
try:
if _git("diff-index", "--quiet", "HEAD", "--").returncode != 0:
_git("stash", "push", "-m",
"auto-stash before update " + time.strftime("%Y-%m-%d %H:%M:%S"))
reqs_changed = _git(
"diff", "--quiet", f"HEAD..origin/{branch}", "--", "requirements.txt"
).returncode != 0
merge = _git("merge", "--ff-only", f"origin/{branch}")
if merge.returncode != 0:
return {"ok": False, "code": "merge_failed",
"error": (merge.stderr.strip() or "git merge failed")[:300]}
if reqs_changed:
subprocess.run(
[sys.executable, "-m", "pip", "install", "-q", "-r",
str(_REPO_DIR / "requirements.txt")],
cwd=_REPO_DIR, capture_output=True, timeout=600,
)
except (subprocess.TimeoutExpired, OSError) as e:
return {"ok": False, "code": "apply_failed", "error": str(e)[:300]}
try:
from gdpr_db import log_audit_event as _audit
_audit("app_update", f"{chk['current']} -> {chk['latest']}",
ip=(request.remote_addr if request else ""))
except Exception:
pass
return {"ok": True, "updated": True,
"from": chk["current"], "to": chk["latest"]}
def _mark_fds_cloexec() -> None:
"""Mark every fd above stderr close-on-exec.
Werkzeug calls ``srv.socket.set_inheritable(True)`` unconditionally
(for its debug reloader), so without this the listening socket leaks
into the exec'd process: it sits on the port as a zombie listener no
one accepts from, the port probe sees the port as busy, and the new
server hops to port+1 while clients hang against the dead socket.
"""
try:
fds = [int(f) for f in os.listdir("/proc/self/fd")] # Linux
except (OSError, ValueError):
fds = list(range(3, 4096))
for fd in fds:
if fd > 2:
try:
os.set_inheritable(fd, False)
except OSError:
pass
def _restart_self() -> None:
"""Re-exec the current process so the updated code is loaded.
Keeps the same PID, so it works both under systemd and when launched
manually via start_gdpr.sh.
"""
_mark_fds_cloexec()
try:
os.execv(sys.executable, [sys.executable] + sys.argv)
except OSError:
# Last resort: exit and rely on a supervisor (systemd Restart=) to
# bring the app back up.
os._exit(0)
def _schedule_restart(delay: float = 1.5) -> None:
def _later():
time.sleep(delay)
_restart_self()
threading.Thread(target=_later, daemon=True, name="update-restart").start()
# ── Routes ────────────────────────────────────────────────────────────────────
@bp.route("/api/update/check")
def update_check():
return jsonify(check_for_update())
@bp.route("/api/update/apply", methods=["POST"])
def update_apply():
res = apply_update()
if res.get("updated"):
res["restarting"] = True
_schedule_restart()
return jsonify(res), (200 if res.get("ok") else 409)
@bp.route("/api/update/settings", methods=["GET", "POST"])
def update_settings():
if request.method == "GET":
return jsonify({"supported": _supported(), **get_update_config()})
data = request.get_json(silent=True) or {}
save_update_config(bool(data.get("auto_update", False)))
return jsonify({"ok": True})
# ── Auto-update background thread ─────────────────────────────────────────────
def _auto_update_loop() -> None:
while True:
time.sleep(3600)
try:
if not get_update_config().get("auto_update"):
continue
if time.time() - _last_auto_check[0] < _AUTO_CHECK_INTERVAL:
continue
_last_auto_check[0] = time.time()
if _scan_running():
_last_auto_check[0] = 0.0 # retry on the next hourly tick
continue
res = apply_update()
if res.get("updated"):
print(f" Auto-update: {res['from']} -> {res['to']} — restarting")
_restart_self()
except Exception:
pass
def start_auto_update_thread() -> bool:
"""Called once at startup from gdpr_scanner.py. No-op for frozen builds."""
if not _supported():
return False
threading.Thread(target=_auto_update_loop, daemon=True, name="auto-update").start()
return True

View File

@ -14,7 +14,15 @@ from app_config import (
set_viewer_pin,
verify_viewer_pin,
clear_viewer_pin,
get_interface_pin_hash,
set_interface_pin,
verify_interface_pin,
clear_interface_pin,
)
try:
from gdpr_db import log_audit_event as _audit
except ImportError:
def _audit(*a, **kw): pass # type: ignore[misc]
bp = Blueprint("viewer", __name__)
@ -52,6 +60,7 @@ def list_tokens():
"token_hint": t["token"][:8] + "",
"token": t["token"],
"label": t.get("label", ""),
"scope": t.get("scope", {}),
"created_at": t.get("created_at"),
"expires_at": t.get("expires_at"),
"last_used_at": t.get("last_used_at"),
@ -73,7 +82,49 @@ def create_token():
return jsonify({"error": "expires_days must be a positive integer"}), 400
except (TypeError, ValueError):
return jsonify({"error": "expires_days must be a positive integer"}), 400
entry = create_viewer_token(label=label, expires_days=expires_days)
raw_scope = body.get("scope", {})
if not isinstance(raw_scope, dict):
return jsonify({"error": "scope must be an object"}), 400
role = str(raw_scope.get("role", "")).strip()
# user may be a single email string (legacy) or a list of email strings
raw_user = raw_scope.get("user", "")
if isinstance(raw_user, str):
user_emails = [raw_user.strip().lower()] if raw_user.strip() else []
elif isinstance(raw_user, list):
user_emails = [str(e).strip().lower() for e in raw_user if str(e).strip()]
else:
user_emails = []
display_name = str(raw_scope.get("display_name", "")).strip()
if role and user_emails:
return jsonify({"error": "scope.role and scope.user are mutually exclusive"}), 400
if role not in ("", "student", "staff"):
return jsonify({"error": "scope.role must be '', 'student', or 'staff'"}), 400
if user_emails and not all("@" in e for e in user_emails):
return jsonify({"error": "scope.user entries must be valid email addresses"}), 400
valid_from = str(raw_scope.get("valid_from", "")).strip()
valid_to = str(raw_scope.get("valid_to", "")).strip()
from datetime import datetime as _dt
for _d, _lbl in ((valid_from, "valid_from"), (valid_to, "valid_to")):
if _d:
try:
_dt.strptime(_d, "%Y-%m-%d")
except ValueError:
return jsonify({"error": f"scope.{_lbl} must be YYYY-MM-DD"}), 400
if valid_from and valid_to and valid_from > valid_to:
return jsonify({"error": "scope.valid_from must be ≤ scope.valid_to"}), 400
if user_emails:
scope = {"user": user_emails, "display_name": display_name or user_emails[0]}
elif role:
scope = {"role": role}
else:
scope = {}
if valid_from:
scope["valid_from"] = valid_from
if valid_to:
scope["valid_to"] = valid_to
entry = create_viewer_token(label=label, expires_days=expires_days, scope=scope)
_audit("token_create", f"label={label!r} scope={scope}",
ip=request.remote_addr or "")
return jsonify(entry), 201
@ -84,6 +135,7 @@ def delete_token(token: str):
removed = revoke_viewer_token(token)
if not removed:
return jsonify({"error": "token not found"}), 404
_audit("token_revoke", f"token={token[:8]}...", ip=request.remote_addr or "")
return jsonify({"ok": True})
@ -117,10 +169,13 @@ def pin_set():
return jsonify({"error": "pin required"}), 400
if not new_pin.isdigit() or not (4 <= len(new_pin) <= 8):
return jsonify({"error": "PIN must be 48 digits"}), 400
if get_viewer_pin_hash():
had_pin = bool(get_viewer_pin_hash())
if had_pin:
if not verify_viewer_pin(str(body.get("current_pin", "")).strip()):
return jsonify({"error": "current PIN is incorrect"}), 403
set_viewer_pin(new_pin)
_audit("viewer_pin_change" if had_pin else "viewer_pin_set", "",
ip=request.remote_addr or "")
return jsonify({"ok": True})
@ -132,6 +187,49 @@ def pin_clear():
if not verify_viewer_pin(str(body.get("current_pin", "")).strip()):
return jsonify({"error": "current PIN is incorrect"}), 403
clear_viewer_pin()
_audit("viewer_pin_clear", "", ip=request.remote_addr or "")
return jsonify({"ok": True})
# ── Interface PIN management endpoints ───────────────────────────────────────
@bp.route("/api/interface/pin", methods=["GET"])
def interface_pin_status():
"""Return whether an interface PIN is currently set."""
return jsonify({"pin_set": bool(get_interface_pin_hash())})
@bp.route("/api/interface/pin", methods=["POST"])
def interface_pin_set():
"""Set or change the interface PIN.
Body: {pin: "...", current_pin: "..."}
current_pin required only when a PIN is already set.
"""
body = request.get_json(silent=True) or {}
new_pin = str(body.get("pin", "")).strip()
if not new_pin:
return jsonify({"error": "pin required"}), 400
if not new_pin.isdigit() or not (4 <= len(new_pin) <= 8):
return jsonify({"error": "PIN must be 48 digits"}), 400
had_ipin = bool(get_interface_pin_hash())
if had_ipin:
if not verify_interface_pin(str(body.get("current_pin", "")).strip()):
return jsonify({"error": "current PIN is incorrect"}), 403
set_interface_pin(new_pin)
_audit("interface_pin_change" if had_ipin else "interface_pin_set", "",
ip=request.remote_addr or "")
return jsonify({"ok": True})
@bp.route("/api/interface/pin", methods=["DELETE"])
def interface_pin_clear():
"""Remove the interface PIN. Requires current PIN if one is set."""
body = request.get_json(silent=True) or {}
if get_interface_pin_hash():
if not verify_interface_pin(str(body.get("current_pin", "")).strip()):
return jsonify({"error": "current PIN is incorrect"}), 403
clear_interface_pin()
_audit("interface_pin_clear", "", ip=request.remote_addr or "")
return jsonify({"ok": True})

View File

@ -54,6 +54,7 @@ def _get_scan_meta():
try:
from m365_connector import (
M365Connector, M365Error, M365PermissionError, M365DeltaTokenExpired,
M365DriveNotFound,
MSAL_OK, REQUESTS_OK,
)
CONNECTOR_OK = True
@ -62,6 +63,7 @@ except ImportError:
M365Error = Exception
M365PermissionError = Exception
M365DeltaTokenExpired = Exception
M365DriveNotFound = Exception
MSAL_OK = False
REQUESTS_OK = False
CONNECTOR_OK = False
@ -73,6 +75,12 @@ except ImportError:
FileScanner = None # type: ignore[assignment,misc]
FILE_SCANNER_OK = False
try:
from sftp_connector import SFTPScanner, SFTP_OK as _SFTP_OK
except ImportError:
SFTPScanner = None # type: ignore[assignment,misc]
_SFTP_OK = False
try:
import document_scanner as ds
SCANNER_OK = True
@ -97,13 +105,17 @@ except ImportError:
# Stubs for standalone import — overwritten by gdpr_scanner.py injections
LANG: dict = {}
PHOTO_EXTS: set = set()
VIDEO_EXTS: set = set()
AUDIO_EXTS: set = set()
SUPPORTED_EXTS: set = set()
# cpr_detector helpers — injected by gdpr_scanner.py
def _scan_bytes(content, filename, poppler_path=None): return {"cprs": [], "dates": []} # type: ignore[misc]
def _scan_bytes_timeout(content, filename, timeout=60): return {"cprs": [], "dates": []} # type: ignore[misc]
def _scan_bytes(content, filename, poppler_path=None, lang="dan+eng"): return {"cprs": [], "dates": []} # type: ignore[misc]
def _scan_bytes_timeout(content, filename, timeout=60, lang="dan+eng"): return {"cprs": [], "dates": []} # type: ignore[misc]
def _detect_photo_faces(content, filename): return 0 # type: ignore[misc]
def _extract_exif(content, filename): return {} # type: ignore[misc]
def _extract_video_metadata(content, filename): return {} # type: ignore[misc]
def _extract_audio_metadata(content, filename): return {} # type: ignore[misc]
def _make_thumb(content, filename): return "" # type: ignore[misc]
def _placeholder_svg(ext, name): return "" # type: ignore[misc]
def _check_special_category(text, cprs): return [] # type: ignore[misc]
@ -113,8 +125,8 @@ def _html_esc(s): return str(s) # type: ignore[misc]
# checkpoint helpers — injected by gdpr_scanner.py
def _checkpoint_key(opts): return "" # type: ignore[misc]
def _save_checkpoint(*a, **kw): pass # type: ignore[misc]
def _load_checkpoint(key): return None # type: ignore[misc]
def _clear_checkpoint(): pass # type: ignore[misc]
def _load_checkpoint(key, **kw): return None # type: ignore[misc]
def _clear_checkpoint(**kw): pass # type: ignore[misc]
def _load_delta_tokens(): return {} # type: ignore[misc]
def _save_delta_tokens(t): pass # type: ignore[misc]
@ -145,18 +157,21 @@ def _with_disposition(card: dict, db) -> dict:
def run_file_scan(source: dict):
"""Scan a single local or SMB file source for CPR numbers and PII.
"""Scan a single local, SMB, or SFTP file source for CPR numbers and PII.
Reuses _scan_bytes, _broadcast_card, _check_special_category,
_detect_photo_faces and all other existing scan helpers.
Args:
source: file source dict with keys:
path, label, smb_host, smb_user, smb_domain, keychain_key,
source_type ("local"|"smb"|"sftp"), path, label,
smb_host, smb_user, smb_domain, keychain_key,
sftp_host, sftp_port, sftp_user, sftp_auth, sftp_key_path,
scan_photos (bool), max_file_mb (int)
"""
# state vars accessed via _state module
source_kind = source.get("source_type", "")
path = source.get("path", "")
label = source.get("label") or path
smb_host = source.get("smb_host") or None
@ -165,9 +180,19 @@ def run_file_scan(source: dict):
keychain_key= source.get("keychain_key") or None
smb_password= source.get("smb_password") or None
scan_photos = bool(source.get("scan_photos", False))
skip_gps_images = bool(source.get("skip_gps_images", False))
min_cpr_count = max(1, int(source.get("min_cpr_count", 1)))
scan_emails = bool(source.get("scan_emails", False))
scan_phones = bool(source.get("scan_phones", False))
cpr_only = bool(source.get("cpr_only", False))
ocr_lang = str(source.get("ocr_lang", "dan+eng")) or "dan+eng"
max_mb = int(source.get("max_file_mb", 50))
if not FILE_SCANNER_OK:
if source_kind == "sftp":
if not _SFTP_OK:
broadcast("scan_error", {"file": label, "error": "paramiko not installed — run: pip install paramiko"})
return
elif not FILE_SCANNER_OK:
broadcast("scan_error", {"file": label, "error": "file_scanner.py not found"})
return
@ -178,21 +203,52 @@ def run_file_scan(source: dict):
_db_scan_id: int | None = None
if _db:
try:
_db_scan_id = _db.begin_scan(
sources=[source.get("source_type", "local")],
user_count=0,
options=source,
)
_db_scan_id = _db.begin_scan({
"sources": [source.get("source_type", "local")],
"user_ids": [],
"options": source,
})
except Exception as e:
logger.error("[db] start_scan failed: %s", e)
# \u2500\u2500 Checkpoint: resume from a previous interrupted file scan \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500
_ck_prefix = f"file_{source.get('id', 'local')}"
_ck_key = _checkpoint_key({"sources": [source.get("source_type", "local")], "user_ids": [source.get("id", path)], "options": {}})
_ck = _load_checkpoint(_ck_key, prefix=_ck_prefix)
_file_scanned_ids: set = set(_ck["scanned_ids"]) if _ck else set()
_file_flagged: list = [] # items found by this file scan run (for checkpoint)
_ck_resumed = len(_file_scanned_ids)
if _ck:
_file_flagged = list(_ck.get("flagged", []))
for card in _file_flagged:
_state.flagged_items.append(card)
broadcast("scan_phase", {"phase": LANG.get("m365_resuming", f"Resuming \u2014 skipping {_ck_resumed} already-scanned items\u2026")})
for card in _file_flagged:
broadcast("scan_file_flagged", _with_disposition(card, _db))
_CHECKPOINT_SAVE_EVERY_FILE = 25
_file_items_since_save = 0
total_scanned = 0
total_flagged = 0
broadcast("scan_start", {"sources": [label]})
broadcast("scan_phase", {"phase": f"Files \u2014 {label}"})
try:
if source_kind == "sftp":
fs = SFTPScanner(
host=source.get("sftp_host", ""),
root_path=path,
username=source.get("sftp_user", ""),
port=int(source.get("sftp_port", 22)),
auth_type=source.get("sftp_auth", "password"),
password=source.get("sftp_password") or None,
key_path=source.get("sftp_key_path") or None,
passphrase=source.get("sftp_passphrase") or None,
keychain_key=keychain_key,
max_file_bytes=max_mb * 1_048_576,
label=label,
)
else:
fs = FileScanner(
path=path,
smb_host=smb_host,
@ -210,6 +266,10 @@ def run_file_scan(source: dict):
if _state._scan_abort.is_set():
break
if rel_path in _file_scanned_ids:
total_scanned += 1
continue
total_scanned += 1
broadcast("scan_progress", {"scanned": total_scanned, "flagged": total_flagged, "file": rel_path, "pct": min(90, 10 + total_scanned // 10), "source": "file"})
@ -224,26 +284,41 @@ def run_file_scan(source: dict):
ext = Path(rel_path).suffix.lower()
# CPR scan — skip for images (no text layer; EXIF/face detection handles them)
# CPR scan — skip for images, video and audio (no text layer)
result: dict = {"cprs": [], "dates": []}
if ext not in PHOTO_EXTS:
if ext not in PHOTO_EXTS and ext not in VIDEO_EXTS and ext not in AUDIO_EXTS:
try:
result = _scan_bytes_timeout(content, rel_path)
result = _scan_bytes_timeout(content, rel_path, lang=ocr_lang)
except Exception as e:
broadcast("scan_error", {"file": rel_path, "error": str(e)})
continue
cprs = result.get("cprs", [])
emails = result.get("emails", []) if scan_emails else []
phones = result.get("phones", []) if scan_phones else []
# Photo / biometric scan + EXIF extraction
# Photo / biometric scan + EXIF/video/audio metadata extraction
_face_count = 0
_exif = {}
if ext in PHOTO_EXTS:
if scan_photos:
_face_count = _detect_photo_faces(content, rel_path)
_exif = _extract_exif(content, rel_path)
elif ext in VIDEO_EXTS:
_exif = _extract_video_metadata(content, rel_path)
elif ext in AUDIO_EXTS:
_exif = _extract_audio_metadata(content, rel_path)
if not cprs and _face_count == 0 and not _exif.get("has_pii"):
# Apply filters: distinct CPR threshold and GPS suppression
_distinct_cprs = list(dict.fromkeys(c["formatted"] for c in cprs))
_cpr_qualifies = len(_distinct_cprs) >= min_cpr_count
_distinct_emails = list(dict.fromkeys(e["formatted"] for e in emails))
_distinct_phones = list(dict.fromkeys(p["formatted"] for p in phones))
_exif_has_pii = _exif.get("has_pii") and (
not skip_gps_images or bool(_exif.get("pii_fields") or _exif.get("author"))
)
if not (_cpr_qualifies and cprs) and (cpr_only or (not _distinct_emails and not _distinct_phones and _face_count == 0 and not _exif_has_pii)):
continue
# Build card metadata
@ -256,9 +331,9 @@ def run_file_scan(source: dict):
_sc = _check_special_category(_file_text, cprs)
if _face_count > 0 and "biometric" not in _sc:
_sc = sorted(_sc + ["biometric"])
if _exif.get("gps") and "gps_location" not in _sc:
if _exif.get("gps") and not skip_gps_images and "gps_location" not in _sc:
_sc = sorted(_sc + ["gps_location"])
if _exif.get("has_pii") and "exif_pii" not in _sc:
if _exif_has_pii and "exif_pii" not in _sc:
_sc = sorted(_sc + ["exif_pii"])
# Thumbnail for images
@ -279,6 +354,8 @@ def run_file_scan(source: dict):
"source": label,
"source_type": source_type,
"cpr_count": len(cprs),
"email_count": len(_distinct_emails),
"phone_count": len(_distinct_phones),
"url": "",
"size_kb": meta["size_kb"],
"modified": meta["modified"],
@ -299,6 +376,7 @@ def run_file_scan(source: dict):
}
_state.flagged_items.append(card)
_file_flagged.append(card)
total_flagged += 1
broadcast("scan_file_flagged", _with_disposition(card, _db))
@ -308,10 +386,19 @@ def run_file_scan(source: dict):
except Exception as e:
logger.error("[db] save_item failed: %s", e)
_file_scanned_ids.add(rel_path)
_file_items_since_save += 1
if _file_items_since_save >= _CHECKPOINT_SAVE_EVERY_FILE:
_save_checkpoint(_ck_key, _file_scanned_ids, _file_flagged, _state.scan_meta, prefix=_ck_prefix)
_file_items_since_save = 0
except Exception as e:
import traceback
broadcast("scan_error", {"file": label, "error": str(e)})
logger.error("[file_scan] error:\n%s", traceback.format_exc())
else:
if not _state._scan_abort.is_set():
_clear_checkpoint(prefix=_ck_prefix)
finally:
if _db and _db_scan_id:
try:
@ -389,6 +476,12 @@ def run_scan(options: dict):
max_emails = int(scan_opts.get("max_emails", 2000))
delta_enabled = bool(scan_opts.get("delta", False))
scan_photos = bool(scan_opts.get("scan_photos", False)) # biometric photo scan (#9)
skip_gps_images= bool(scan_opts.get("skip_gps_images", False))
min_cpr_count = max(1, int(scan_opts.get("min_cpr_count", 1)))
ocr_lang = str(scan_opts.get("ocr_lang", "dan+eng")) or "dan+eng"
cpr_only = bool(scan_opts.get("cpr_only", False))
scan_emails = bool(scan_opts.get("scan_emails", False))
scan_phones = bool(scan_opts.get("scan_phones", False))
# Delta token state — loaded once, updated per-source, saved on completion
delta_tokens: dict = _load_delta_tokens() if delta_enabled else {}
@ -442,6 +535,8 @@ def run_scan(options: dict):
"source": item_meta.get("_source", ""),
"source_type": item_meta.get("_source_type", ""),
"cpr_count": len(cprs),
"email_count": item_meta.get("_email_count", 0),
"phone_count": item_meta.get("_phone_count", 0),
"url": item_meta.get("webUrl", "") or item_meta.get("_url", ""),
"size_kb": round(item_meta.get("size", 0) / 1024, 1),
"modified": (item_meta.get("lastModifiedDateTime") or item_meta.get("receivedDateTime") or "")[:10],
@ -458,6 +553,7 @@ def run_scan(options: dict):
"special_category": item_meta.get("_special_category", []),
"face_count": item_meta.get("_face_count", 0),
"exif": item_meta.get("_exif", {}),
"body_excerpt": item_meta.get("_body_excerpt", ""),
}
_state.flagged_items.append(card)
broadcast("scan_file_flagged", _with_disposition(card, _db))
@ -757,6 +853,10 @@ def run_scan(options: dict):
work_items.append(("file", item, None))
except M365PermissionError:
broadcast("scan_error", {"file": f"OneDrive ({uname})", "error": _permission_msg("OneDrive", uname)})
except M365DriveNotFound:
# OneDrive not provisioned for this user (no licence, service plan
# disabled, or drive never initialised). Not a scan error — skip silently.
broadcast("scan_phase", {"phase": f"OneDrive ({uname}): not provisioned — skipped"})
except Exception as e:
broadcast("scan_error", {"file": f"OneDrive ({uname})", "error": str(e)})
else:
@ -978,6 +1078,14 @@ def run_scan(options: dict):
if _check_abort():
# Save checkpoint so scan can be resumed later
_save_checkpoint(ck_key, scanned_ids, _state.flagged_items, _state.scan_meta)
# Finalise the DB scan record so items found before the stop stay
# visible — this early return otherwise skips finish_scan below,
# stranding them (invisible to get_session_items / get_open_items).
if _db and _db_scan_id:
try:
_db.finish_scan(_db_scan_id, resumed_count + idx + 1)
except Exception as _e:
logger.error("[db] finish_scan (aborted) failed: %s", _e)
return
idx += 1
kind, meta, _ = _work_q.popleft() # releases this item from the deque immediately
@ -1005,11 +1113,17 @@ def run_scan(options: dict):
# Scan body — use pre-extracted text (body HTML was stripped at
# collection time to keep work_items memory footprint small)
all_cprs = []
all_emails = []
all_phones = []
body_text = ""
if scan_email_body:
body_text = meta.pop("_precomputed_body", "")
body_result = _scan_text_direct(body_text)
all_cprs = list(body_result.get("cprs", []))
if scan_emails:
all_emails = list(body_result.get("emails", []))
if scan_phones:
all_phones = list(body_result.get("phones", []))
# <span data-i18n="m365_opt_attachments" data-i18n="m365_opt_attachments">Scan attachments</span>
uid = meta.get("_account_id", "me")
@ -1029,21 +1143,31 @@ def run_scan(options: dict):
try:
att_bytes = (conn.download_attachment_for(uid, msg_id, att["id"])
if uid != "me" else conn.download_attachment(msg_id, att["id"]))
att_result = _scan_bytes(att_bytes, att_name)
att_result = _scan_bytes(att_bytes, att_name, lang=ocr_lang)
att_cprs = att_result.get("cprs", [])
all_cprs.extend(att_cprs)
if scan_emails:
all_emails.extend(att_result.get("emails", []))
if scan_phones:
all_phones.extend(att_result.get("phones", []))
att_results.append({"name": att_name, "cpr_count": len(att_cprs)})
except Exception as att_err:
broadcast("scan_error", {"file": att_name, "error": str(att_err)})
if all_cprs:
_distinct_emails = list(dict.fromkeys(e["formatted"] for e in all_emails))
_distinct_phones = list(dict.fromkeys(p["formatted"] for p in all_phones))
if all_cprs or (not cpr_only and (_distinct_emails or _distinct_phones)):
meta["_thumb"] = _placeholder_svg(".eml", subject)
meta["_thumb_is_jpeg"] = False
meta["_attachments"] = att_results
meta["_email_count"] = len(_distinct_emails)
meta["_phone_count"] = len(_distinct_phones)
_email_pii = _get_pii_counts(body_text) if scan_email_body else {}
meta["_transfer_risk"] = _check_transfer_risk(meta)
meta["_special_category"] = _check_special_category(
body_text if scan_email_body else "", all_cprs)
# Store a short excerpt so preview still works if Graph is unavailable
meta["_body_excerpt"] = body_text[:500].strip() if body_text else ""
_broadcast_card(meta, all_cprs, pii_counts=_email_pii)
del body_text # free email text — may be large for HTML-rich emails
@ -1068,19 +1192,37 @@ def run_scan(options: dict):
content = conn.download_drive_item_for(uid, item_id)
else:
content = conn.download_item(meta)
result = _scan_bytes(content, name)
cprs = result.get("cprs", [])
# ── Biometric photo scan (#9) + EXIF (#18) ───────────────
# CPR/email/phone scan — skip for video and audio (metadata-only; no text layer)
_media_only = ext in VIDEO_EXTS or ext in AUDIO_EXTS
result = {"cprs": [], "dates": [], "emails": [], "phones": []} if _media_only else _scan_bytes(content, name, lang=ocr_lang)
cprs = result.get("cprs", [])
emails = result.get("emails", []) if scan_emails else []
phones = result.get("phones", []) if scan_phones else []
# ── Biometric photo scan (#9) + EXIF/video/audio metadata (#18) ─
_face_count = 0
_exif = {}
if ext in PHOTO_EXTS:
if scan_photos:
_face_count = _detect_photo_faces(content, name)
_exif = _extract_exif(content, name)
elif ext in VIDEO_EXTS:
_exif = _extract_video_metadata(content, name)
elif ext in AUDIO_EXTS:
_exif = _extract_audio_metadata(content, name)
# Flag item if CPRs found, faces detected, or EXIF PII found
if cprs or _face_count > 0 or _exif.get("has_pii"):
# Apply filters: distinct CPR threshold and GPS suppression
_distinct_cprs = list(dict.fromkeys(c["formatted"] for c in cprs))
_cpr_qualifies = len(_distinct_cprs) >= min_cpr_count
_distinct_emails = list(dict.fromkeys(e["formatted"] for e in emails))
_distinct_phones = list(dict.fromkeys(p["formatted"] for p in phones))
_exif_has_pii = _exif.get("has_pii") and (
not skip_gps_images or bool(_exif.get("pii_fields") or _exif.get("author"))
)
# Flag item if CPRs/emails/phones found, faces detected, or EXIF PII found
if (_cpr_qualifies and cprs) or (not cpr_only and (_distinct_emails or _distinct_phones or _face_count > 0 or _exif_has_pii)):
# Make thumbnail
if ext in {".jpg", ".jpeg", ".png"} and PIL_OK:
thumb = _make_thumb(content, name)
@ -1109,13 +1251,15 @@ def run_scan(options: dict):
# the category even when no CPR is present in the file.
if _face_count > 0 and "biometric" not in _sc:
_sc = sorted(_sc + ["biometric"])
if _exif.get("gps") and "gps_location" not in _sc:
if _exif.get("gps") and not skip_gps_images and "gps_location" not in _sc:
_sc = sorted(_sc + ["gps_location"])
if _exif.get("has_pii") and "exif_pii" not in _sc:
if _exif_has_pii and "exif_pii" not in _sc:
_sc = sorted(_sc + ["exif_pii"])
meta["_special_category"] = _sc
meta["_face_count"] = _face_count
meta["_exif"] = _exif
meta["_email_count"] = len(_distinct_emails)
meta["_phone_count"] = len(_distinct_phones)
_broadcast_card(meta, cprs, pii_counts=_file_pii)
else:
del content # no hits — free raw bytes immediately

View File

@ -43,6 +43,7 @@ _DEFAULT_JOB: dict[str, Any] = {
"profile_id": "",
"auto_email": False,
"auto_retention": False,
"report_only": False,
"retention_years": None,
"fiscal_year_end": None,
}
@ -270,6 +271,35 @@ class ScanScheduler:
})
from routes import state
# ── Report-only path: skip scan, email latest DB results ──────────
if job_cfg.get("report_only"):
if not _m.flagged_items and _m.DB_OK:
try:
_db_inst = _m._get_db()
_db_rows = _db_inst.get_session_items() if _db_inst else []
if _db_rows:
_m.flagged_items[:] = _db_rows
except Exception:
pass
if not _m.flagged_items:
raise RuntimeError(
"No scan results available — run a scan first")
run["flagged"] = len(_m.flagged_items)
run["scanned"] = 0
run["status"] = "completed"
try:
self._send_email_report(job_cfg)
run["emailed"] = 1
except Exception as _re:
run["status"] = "failed"
run["error"] = f"Email failed: {_re}"
_m.broadcast("scheduler_done", {
"flagged": run["flagged"], "scanned": 0,
"emailed": run["emailed"], "job_name": job_cfg.get("name", ""),
})
return
# If connector not set, attempt to restore from saved config
if not state.connector or not state.connector.is_authenticated():
try:
@ -310,6 +340,16 @@ class ScanScheduler:
# Fire file scan for each file source in the profile
# file_sources may be IDs (strings) or full dicts — resolve either
_all_file_sources = {s["id"]: s for s in (_m._load_file_sources() or []) if isinstance(s, dict)}
# Merge per-scan options from the profile so the file scan honours
# cpr_only/ocr_lang/scan_photos/etc. (the browser does this in
# startScan(); the scheduler must mirror it).
_profile_opts = options.get("options", {}) or {}
_FS_OPT_KEYS = (
"scan_photos", "skip_gps_images", "min_cpr_count",
"scan_emails", "scan_phones", "cpr_only", "ocr_lang",
"max_file_mb",
)
_fs_extra = {k: _profile_opts[k] for k in _FS_OPT_KEYS if k in _profile_opts}
for fs in options.get("file_sources", []):
# Resolve string IDs to full source dicts
if isinstance(fs, str):
@ -317,6 +357,7 @@ class ScanScheduler:
if not isinstance(fs, dict) or not fs.get("path"):
logger.warning("[scheduler] skipping invalid file source: %r", fs)
continue
fs = {**fs, **_fs_extra}
try:
_m.run_file_scan(fs)
except Exception as _fse:
@ -432,7 +473,7 @@ class ScanScheduler:
logger.info("[scheduler] Profile '%s': sources=%s, users=%d",
p.get("name", pid), opts["sources"], len(opts.get("user_ids", [])))
_m.broadcast("scheduler_debug", {
"msg": f"Using profile '{p.get('name',pid)}': sources={opts['sources']}, users={len(opts.get("user_ids",[]))}"})
"msg": f"Using profile '{p.get('name',pid)}': sources={opts['sources']}, users={len(opts.get('user_ids',[]))}"})
return opts
logger.info("[scheduler] Profile '%s' not found — using saved settings", pid)
_m.broadcast("scheduler_debug", {"msg": f"Profile id '{pid}' not found — falling back to saved settings"})
@ -455,11 +496,15 @@ class ScanScheduler:
raise RuntimeError("No email recipients configured")
job_name = job_cfg.get("name", "scheduled scan")
subject = f"GDPR Scanner — {job_name} {datetime.now().strftime('%Y-%m-%d %H:%M')}"
if job_cfg.get("report_only"):
scan_line = f"Report on latest scan results. {len(_m.flagged_items)} item(s) flagged."
else:
scan_line = f"Scan completed. {len(_m.flagged_items)} item(s) flagged."
body = (
"<html><body style='font-family:Arial,sans-serif;color:#333;padding:24px'>"
"<h2 style='color:#1F3864'>&#128336; GDPR Scanner — scheduled scan report</h2>"
f"<p>Job: <strong>{job_name}</strong></p>"
f"<p>Scan completed. {len(_m.flagged_items)} item(s) flagged.</p>"
f"<p>{scan_line}</p>"
f"<p>Report attached: {fname}</p></body></html>")
from routes.email import _send_email_graph
from routes import state

292
sftp_connector.py Normal file
View File

@ -0,0 +1,292 @@
"""
sftp_connector.py SFTP file iterator for GDPR Scanner.
Provides SFTPScanner.iter_files() which yields (relative_path, bytes, metadata)
for files on an SFTP/SSH server, using the same interface as FileScanner so that
run_file_scan() in scan_engine.py works identically for all three source types.
Optional dependency:
paramiko>=3.4 SSH/SFTP client (pip install paramiko)
If paramiko is not installed, SFTP_OK is False and callers must check before use.
"""
from __future__ import annotations
import stat
import time
from pathlib import PurePosixPath
from typing import Iterator
from file_scanner import SKIP_DIRS, MAX_FILE_BYTES, _skip, _error, KEYCHAIN_SERVICE
# ── Optional dependency ───────────────────────────────────────────────────────
try:
import paramiko
SFTP_OK = True
except ImportError:
SFTP_OK = False
try:
import keyring as _keyring
_KEYRING_OK = True
except ImportError:
_KEYRING_OK = False
# ── Credential helpers ────────────────────────────────────────────────────────
def get_sftp_password(host: str, user: str, keychain_key: str | None = None) -> str | None:
"""Return SFTP password or key passphrase from OS keychain."""
if not _KEYRING_OK:
return None
account = keychain_key or f"sftp:{user}@{host}"
try:
return _keyring.get_password(KEYCHAIN_SERVICE, account) or None
except Exception:
return None
def store_sftp_password(host: str, user: str, password: str,
keychain_key: str | None = None) -> bool:
"""Store SFTP password or passphrase in the OS keychain. Returns True on success."""
if not _KEYRING_OK:
return False
account = keychain_key or f"sftp:{user}@{host}"
try:
_keyring.set_password(KEYCHAIN_SERVICE, account, password)
return True
except Exception:
return False
# ── SFTPScanner ───────────────────────────────────────────────────────────────
class SFTPScanner:
"""SFTP file iterator — identical iter_files() interface to FileScanner."""
def __init__(
self,
host: str,
root_path: str,
username: str,
port: int = 22,
auth_type: str = "password", # "password" | "key"
password: str | None = None,
key_path: str | None = None,
passphrase: str | None = None,
keychain_key: str | None = None,
max_file_bytes: int = MAX_FILE_BYTES,
label: str = "",
):
self.host = host
self.port = port
self.root_path = root_path.rstrip("/") or "/"
self.username = username
self.auth_type = auth_type
self.key_path = key_path
self.keychain_key = keychain_key
self.max_file_bytes = max_file_bytes
self.label = label or f"{username}@{host}"
# Resolve credentials from keychain if not provided directly
self._password = password
self._passphrase = passphrase
if not self._password and auth_type == "password":
self._password = get_sftp_password(host, username, keychain_key)
if not self._passphrase and auth_type == "key" and key_path:
self._passphrase = get_sftp_password(host, username, keychain_key)
@staticmethod
def sftp_available() -> bool:
return SFTP_OK
@property
def source_type(self) -> str:
return "sftp"
# ── Public ────────────────────────────────────────────────────────────────
def iter_files(
self,
extensions: set[str] | None = None,
progress_cb=None,
) -> Iterator[tuple[str, bytes | None, dict]]:
"""Yield (relative_path, content_bytes, metadata) for every scannable file.
Same contract as FileScanner.iter_files() oversized and unreadable files
yield a sentinel with content=None and meta['skipped']=True.
"""
if not SFTP_OK:
raise RuntimeError("paramiko not installed — run: pip install paramiko")
from cpr_detector import SUPPORTED_EXTS as DEFAULT_EXTENSIONS
exts = extensions or DEFAULT_EXTENSIONS
ssh = paramiko.SSHClient()
ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
connect_kwargs: dict = {
"hostname": self.host,
"port": self.port,
"username": self.username,
"timeout": 30,
}
if self.auth_type == "key" and self.key_path:
pkey = _load_pkey(self.key_path, self._passphrase)
connect_kwargs["pkey"] = pkey
else:
connect_kwargs["password"] = self._password or ""
# Disable agent and key lookup when using password so paramiko doesn't
# prompt interactively when the server advertises pubkey auth.
connect_kwargs["look_for_keys"] = False
connect_kwargs["allow_agent"] = False
ssh.connect(**connect_kwargs)
try:
sftp = ssh.open_sftp()
try:
yield from self._walk(sftp, self.root_path, exts, progress_cb)
finally:
sftp.close()
finally:
ssh.close()
def _ssh_connect(self):
"""Return a connected paramiko SSHClient. Caller must call .close()."""
if not SFTP_OK:
raise RuntimeError("paramiko not installed — run: pip install paramiko")
ssh = paramiko.SSHClient()
ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
kw: dict = {
"hostname": self.host,
"port": self.port,
"username": self.username,
"timeout": 30,
}
if self.auth_type == "key" and self.key_path:
kw["pkey"] = _load_pkey(self.key_path, self._passphrase)
else:
kw["password"] = self._password or ""
kw["look_for_keys"] = False
kw["allow_agent"] = False
ssh.connect(**kw)
return ssh
def read_file(self, remote_path: str) -> bytes:
"""Download and return the raw bytes of a single remote file."""
ssh = self._ssh_connect()
try:
sftp = ssh.open_sftp()
try:
with sftp.open(remote_path, "rb") as fh:
return fh.read()
finally:
sftp.close()
finally:
ssh.close()
def write_file(self, remote_path: str, content: bytes) -> None:
"""Write content to remote_path on the SFTP server, overwriting if it exists."""
ssh = self._ssh_connect()
try:
sftp = ssh.open_sftp()
try:
with sftp.open(remote_path, "wb") as fh:
fh.write(content)
finally:
sftp.close()
finally:
ssh.close()
# ── Private walker ────────────────────────────────────────────────────────
def _walk(
self,
sftp,
directory: str,
exts: set[str],
progress_cb,
) -> Iterator[tuple[str, bytes | None, dict]]:
source_root = f"sftp://{self.username}@{self.host}{self.root_path}"
try:
entries = sftp.listdir_attr(directory)
except OSError as e:
rel = _rel(directory, self.root_path) or "."
yield _error(rel, str(e), "sftp", source_root)
return
for attr in entries:
name = attr.filename
if name.startswith("."):
continue
if name.lower() in SKIP_DIRS:
continue
full_remote = f"{directory}/{name}".replace("//", "/")
rel = _rel(full_remote, self.root_path)
if attr.st_mode is not None and stat.S_ISDIR(attr.st_mode):
yield from self._walk(sftp, full_remote, exts, progress_cb)
continue
ext = PurePosixPath(name).suffix.lower()
if ext not in exts:
continue
size = attr.st_size or 0
if size > self.max_file_bytes:
yield _skip(rel, size, "sftp", source_root)
continue
if progress_cb:
progress_cb(rel)
modified = (
time.strftime("%Y-%m-%d", time.gmtime(attr.st_mtime))
if attr.st_mtime else ""
)
meta = {
"size_kb": round(size / 1024, 1),
"modified": modified,
"source_type": "sftp",
"source_root": source_root,
"full_path": full_remote,
"skipped": False,
}
try:
with sftp.open(full_remote, "rb") as fh:
content = fh.read(self.max_file_bytes)
yield rel, content, meta
except OSError as e:
yield _error(rel, str(e), "sftp", source_root)
# ── Helpers ───────────────────────────────────────────────────────────────────
def _rel(full_path: str, root: str) -> str:
"""Return path relative to root, stripping leading slash."""
if full_path.startswith(root):
return full_path[len(root):].lstrip("/")
return full_path.lstrip("/")
def _load_pkey(key_path: str, passphrase: str | None):
"""Load a private key from disk, trying RSA → Ed25519 → ECDSA → DSS."""
for cls in (
paramiko.RSAKey,
paramiko.Ed25519Key,
paramiko.ECDSAKey,
paramiko.DSSKey,
):
try:
return cls.from_private_key_file(key_path, password=passphrase)
except paramiko.ssh_exception.SSHException:
continue
except FileNotFoundError:
raise
raise ValueError(f"Unrecognised private key format: {key_path}")

View File

@ -22,8 +22,65 @@ Never revert to `!!window._googleConnected` / `_fileSources.length > 0` — thos
`_PHASE_SOURCE_MAP` ordering matters — `Google Workspace` must appear before `Gmail` in the map. The email regex uses `/iu` flags — do not drop the `i`.
## Profile startup race conditions — profiles.js + users.js
`loadProfiles()` (fast, local file) resolves before `loadUsers()` (slow, Graph API). The user can select a profile before `S._allUsers` or the sources panel is populated.
- **`user_ids = "all"` must be deferred** — if `S._allUsers` is empty when `_applyProfile()` runs, set `window._pendingProfileAllUsers = true` instead of calling `.forEach()` on an empty array. `loadUsers()` checks this flag after populating `S._allUsers` and selects everyone. Do not remove this — reverting will silently leave all accounts unchecked whenever a profile is chosen on a fast machine before the user list loads.
- **Source checkboxes may not exist yet**`_applyProfile()` calls `renderSourcesPanel()` first if `#sourcesPanel` contains no `input[data-source-id]` nodes. Same guard used in `loadUsers()`. Without it, `querySelectorAll` returns nothing and the profile's source selection is discarded; the next `renderSourcesPanel()` call re-renders all sources as checked (their default).
## SSE teardown — scan.js
- **Do not close `S.es` in `scan_done` if other scans are still running** — M365 (`scan_done`), Google (`google_scan_done`), and File (`file_scan_done`) each emit their own done event. Close `S.es` only when all concurrent scans have finished: `scan_done` checks `!S._googleScanRunning && !S._fileScanRunning`; `google_scan_done` checks `!S._m365ScanRunning && !S._fileScanRunning`; `file_scan_done` checks `!S._m365ScanRunning && !S._googleScanRunning`.
- **Scheduled scans**`S._userStartedScan` is false for scheduler-triggered runs, so SSE is never closed and future scheduler events continue to arrive.
- **Two separate abort events**`state._scan_abort` (M365 + file) and `state._google_scan_abort` (Google). `POST /api/scan/stop` sets **both**. `_check_abort()` inside `_run_google_scan` must use the module-level `_scan_abort` alias (`= state._google_scan_abort`), not `gdpr_scanner._scan_abort`.
- **`_check_abort()` emits `google_scan_done`, not `scan_cancelled`** — `scan_cancelled` unconditionally closes the SSE; `google_scan_done` checks whether other scans are still running before closing.
- **`scan_phase` replay sets running flags — handled by `sse_replay_done`** — the `scan_phase` handler sets running flags to `true` whenever all flags are `false` and a source keyword is found in the phase text. On page refresh this fires during SSE replay of a completed scan, temporarily making the scan appear running. The `sse_replay_done` handler retries `loadHistorySession(null)` if no scan is running and `S._historyRefScanId` is still `null` after replay. Do not remove either the flag-setting logic or the retry.
- **Google Drive uses a lazy generator, not `list()`**`iter_drive_files()` iterated directly so `_check_abort()` fires between items. Wrapping in `list()` blocks the thread for the entire enumeration.
## Scan history browser — history.js + results.js
- **`S._historyRefScanId`** — `null` = live/SSE mode **or** the default open-items view; positive int = viewing a past session. Set by `loadHistorySession()`; cleared by `exitHistoryMode()`.
- **`loadHistorySession(null)``loadOpenItems()`** — passing `null` no longer resolves to the latest session. It now loads **all open (unactioned) items across every scan** via `GET /api/db/flagged` (no `ref`), leaves `_historyRefScanId` null, and shows no history banner. The "Open items" banner button (`onclick="loadHistorySession(null)"`, key `history_btn_latest`) therefore returns to this open-items view. Specific sessions are still loaded with a positive `ref`, which keeps the re-scan resolved-diff. Do not revert `null` to "resolve latest ref" — that reintroduces the "only the last scan is shown" complaint.
- **Auto-load on page load**`_sseWatchdog()` in `results.js` calls `window.loadHistorySession?.(null)` whenever `/api/scan/status` reports neither `running` (M365 + file lock) nor `google_running` (Google lock) **and** nothing is shown yet (`!S._historyRefScanId && !S.flaggedData.length`). This is **not one-shot** — it retries on every 4s poll until a session is restored, because (a) the replay buffer is empty after a server restart so `sse_replay_done` never fires, and (b) a completed scan's replayed `scan_phase` can leave a running flag set that would otherwise block the load forever. Because both locks are confirmed free, the watchdog clears the stale `_m365/_google/_fileScanRunning` flags before calling. Do not revert to a one-shot `_initialStatusChecked` gate — that reintroduces the "blank grid after refresh/restart" bug. `/api/scan/status` **must** report `google_running` separately; `running` alone misses live Google scans. The `sse_replay_done` handler in `scan.js` still retries for the non-empty-buffer (no-restart) case.
- **History banner** (`#historyBanner`) — shown when `S._historyRefScanId` is set. Do not hide/show from outside `history.js`.
- **Session picker** (`#historyDropdown`) — rendered inside `[data-history-wrap]` so the outside-click handler works correctly. Do not move the picker outside this wrapper.
- **Cache invalidation**`invalidateHistoryCache()` clears `_sessions` and `_latestRefScanId`. All three `*_done` SSE handlers call `window.invalidateHistoryCache?.()`.
- **Re-scan diff** — items present in the previous session but absent from the current one are tagged `_resolved: true`, rendered with `.card-resolved` and a green ✓ badge, and NOT added to `S.flaggedData` (grid-only, cannot be bulk-selected or exported).
- **Mode transitions**`startScan()` calls `window.exitHistoryMode?.()` before clearing the grid.
- **`renderGrid(files)` hides the landing cards** — whenever `files.length > 0` it hides `#emptyState` and `#lastScanSummary` and shows `#grid`. This is centralised here because the live `scan_file_flagged` handler (`scan.js`) shows the grid but does NOT clear those panels, so results would render *underneath* a still-visible landing/last-scan card until a manual refresh. Do not move this hiding back into individual callers — every render path (live SSE, `loadOpenItems`, history, filters) must clear the landing. The empty case (`files.length === 0`) is left untouched so callers still control the empty/landing state.
## Card user/group badge — results.js
- **`_accountPill(f)`** builds the account/role pill for both card layouts (list + grid). The **group badge is driven by `f.user_role`** (`student`/`staff`) alone, so it renders even with no display name — items from scans saved before `account_name` was persisted (DB migration 11) have only `user_role` + `account_id`. The user label resolves best-effort: `f.account_name``S._allUsers` match (by `id` or `email`) → email-style `account_id` → omit. Do not re-nest the role badge inside an `account_name` check (the old bug) — that hides the group badge for legacy items. Both layouts call `_accountPill(f)`; keep them sharing the one helper.
## CPR cross-referencing — results.js
- **`_loadRelated(f)`** — async; hides `#previewRelated` if `f.cpr_count` is 0, otherwise fetches `/api/db/related/<id>?ref=N` and renders a clickable list with per-item shared-CPR badge. Called from `openPreview`.
- **`window._openRelated(id, itemData)`** — looks up `id` in `S.flaggedData` first, falls back to `itemData` from the API response for items not yet in the grid.
## Sources panel resize — log.js + sources.js
- **`_fitSourcesPanel()`** — called at the end of every `renderSourcesPanel()`. Clears inline height, reads `scrollHeight`, then restores a saved preference from `localStorage` (`gdpr_sources_h`) or pins to `scrollHeight`.
- **`_initSourcesResize()`** — attaches pointer-drag to `#sourcesResizeHandle`. Captures `scrollHeight` as hard max on `pointerdown`; saves to `localStorage` on release.
- **Do not add a fixed `max-height` or `height` to `#sourcesPanel` in HTML** — height controlled entirely by `_fitSourcesPanel()` at runtime.
- **Do not call `_fitSourcesPanel()` before the panel has rendered**`scrollHeight` will be 0.
## Viewer mode — viewer.js
- **`window.VIEWER_MODE`** — injected by Jinja2. `auth.js` adds `viewer-mode` class to `<body>`; all hide rules are CSS (`body.viewer-mode …`) except `delBtn` which is also guarded in JS.
- **`window.VIEWER_SCOPE`** — injected alongside `VIEWER_MODE`. If `VIEWER_SCOPE.role` is set, `auth.js` pre-sets `#filterRole` and hides the dropdown.
- **Token onclick attributes** — Copy/Revoke buttons pass the token as a single-quoted JS string literal, never via `JSON.stringify` (which produces double-quoted strings that break `onclick="…"` attributes).
- **Share link base URL**`_getShareBaseUrl()` uses `window.location.origin` whenever the page is served over HTTPS or from a non-localhost host (a reverse-proxied hostname or LAN IP is already routable, and rewriting it to `http://<LAN-IP>` would bypass the proxy's TLS). Only when browsing at `localhost`/`127.0.0.1` over HTTP does it fetch `/api/local_ip` (LAN IP via UDP probe to `8.8.8.8`) so copied links work from other machines. The result is cached in `_shareBaseUrl` so Copy buttons stay within the click gesture. Both `createShareLink` and `copyTokenLink` are `async`. Do not make it return bare `window.location.origin` unconditionally — that reintroduces unusable `127.0.0.1` links.
- **Settings Security pane** — Admin PIN and Viewer PIN groups live in `stPaneSecurity`. `switchSettingsTab('security')` triggers both `stLoadPinStatus()` and `stLoadViewerPinStatus()`.
## Gotchas
- **`navigator.clipboard` is `undefined` over plain HTTP** — the app is normally reached at `http://<LAN-IP>:5100`, a non-secure context where the Clipboard API does not exist, so calling `navigator.clipboard.writeText(...)` throws synchronously (a `.catch()` on it never runs). Always copy via `window._copyText(text, btn)` (defined in `viewer.js`) — it feature-detects the API and falls back to `document.execCommand('copy')`, then to a `prompt()`. Because `execCommand` needs a user gesture, don't `await` network calls between the click and the copy; `_getShareBaseUrl()` caches its result for this reason.
- **`scheduler.js` strings must use `t()`** — frequency labels, "Next", "Running...", "Disabled", empty-job text, and empty-history text all have translation keys. Do not hard-code English strings in `schedLoad()` or `schedRenderJobs()`.
- **Scheduler UI — `schedToggleReportOnly()`** — dims the Profile row, shows/hides `#schedReportOnlyHint`, and forces `#schedAutoEmail` checked. Called from the checkbox `onchange` handler and at the start of `schedAddJob()` / `schedEditJob()`.
- **Profile editor accounts** — default to unchecked. Only explicitly saved `user_ids` are checked.
- **Date presets** — stored as `years * 365` (integer days). Do not use `* 365.25`.
- **`copyTokenLink` is async** — called from `onclick` attributes as a fire-and-forget (the Promise is unhandled, which is fine). It `await`s `_getShareBaseUrl()` to get the machine's LAN IP before building the URL. Do not make it synchronous or revert to `window.location.origin` directly.
- **`copyTokenLink` is async** — called from `onclick` as fire-and-forget. Do not make it synchronous.
- **Escape scan-derived strings with `esc()`**`results.js` defines `esc()` (escapes `& < > " '`). Every value that originates from scanned content (`f.name`, `f.account_name`, `f.folder`, `f.source`, `f.modified`, `label`, image `alt`, and the same fields on `item`/related rows) must pass through `esc()` before going into `innerHTML` or a `title=`/`alt=` attribute. These are attacker-influenceable (e.g. a file named with markup), so an unescaped interpolation is stored XSS — including in shared read-only viewer sessions. Numeric counts (`cpr_count`, `size_kb`) don't need it. When embedding an object in an `onclick` payload, also `.replace(/"/g,'&quot;')` the `JSON.stringify(...)`.

View File

@ -159,6 +159,24 @@ if (window.VIEWER_MODE) {
document.body.classList.add('viewer-mode');
document.getElementById('authScreen').style.display = 'none';
document.getElementById('scannerScreen').style.display = 'flex';
// If this token is role-scoped, lock the filter to that role and hide the dropdown.
const _scopeRole = (window.VIEWER_SCOPE || {}).role || '';
if (_scopeRole) {
const _fr = document.getElementById('filterRole');
if (_fr) { _fr.value = _scopeRole; _fr.style.display = 'none'; }
}
// If this token is user-scoped, show a locked identity badge and hide irrelevant filters.
const _scopeUserRaw = (window.VIEWER_SCOPE || {}).user;
if (_scopeUserRaw && (Array.isArray(_scopeUserRaw) ? _scopeUserRaw.length : _scopeUserRaw)) {
const _fr = document.getElementById('filterRole');
if (_fr) _fr.style.display = 'none';
const _badge = document.getElementById('viewerIdentityBadge');
if (_badge) {
_badge.textContent = (window.VIEWER_SCOPE || {}).display_name
|| (Array.isArray(_scopeUserRaw) ? _scopeUserRaw[0] : _scopeUserRaw);
_badge.style.display = '';
}
}
try { loadTrend(); } catch(e) {}
} else {
(async function() {

View File

@ -378,6 +378,19 @@ function getGoogleScanOptions() {
// ── File sources pane ─────────────────────────────────────────────────────────
function _srcIcon(s) {
if (s.source_type === 'sftp') return '\uD83D\uDD12';
const isSmb = s.path && (s.path.startsWith('//') || s.path.startsWith('\\\\'));
return isSmb ? '\uD83C\uDF10' : '\uD83D\uDCC1';
}
function _srcSubtitle(s) {
if (s.source_type === 'sftp') {
return _esc((s.sftp_user||'')+'@'+(s.sftp_host||'')+(s.path||'/'));
}
return _esc(s.path||'')+(s.smb_user?' \u00b7 \uD83D\uDC64 '+_esc(s.smb_user):'');
}
function srcFileRenderList() {
const list = document.getElementById('srcFileList');
if (!list) return;
@ -386,8 +399,7 @@ function srcFileRenderList() {
return;
}
list.innerHTML = S._fileSources.map(function(s) {
const isSmb = s.path && (s.path.startsWith('//') || s.path.startsWith('\\\\'));
const icon = isSmb ? '\uD83C\uDF10' : '\uD83D\uDCC1';
const icon = _srcIcon(s);
const sid = _esc(s.id||'');
const slabel = _esc(s.label||s.path||'');
return '<div class="fsrc-row">'
@ -398,11 +410,47 @@ function srcFileRenderList() {
+'<button class="btn-edit" onclick="srcFileEdit(\''+sid+'\')" style="background:none;border:1px solid var(--border);color:var(--muted);padding:2px 7px;border-radius:4px;font-size:10px;cursor:pointer">'+t('m365_fsrc_edit_btn','Edit')+'</button>'
+'<button class="btn-del" onclick="srcFileDelete(\''+sid+'\',\''+slabel+'\')">'+t('m365_profile_delete','Delete')+'</button>'
+'</div></div>'
+'<div class="fsrc-row-path">'+_esc(s.path||'')+(s.smb_user?' \u00b7 \uD83D\uDC64 '+_esc(s.smb_user):'')+'</div>'
+'<div class="fsrc-row-path">'+_srcSubtitle(s)+'</div>'
+'</div>';
}).join('');
}
function srcFileTypeSelect(type) {
document.getElementById('srcFileSourceType').value = type;
var pathRow = document.getElementById('srcFilePathRow');
var smbFields = document.getElementById('srcFileSmbFields');
var sftpFields= document.getElementById('srcFileSftpFields');
if (pathRow) pathRow.style.display = type === 'sftp' ? 'none' : '';
if (smbFields) smbFields.style.display = type === 'smb' ? 'flex' : 'none';
if (sftpFields)sftpFields.style.display= type === 'sftp' ? 'flex' : 'none';
['srcTypeLocal','srcTypeSmb','srcTypeSftp'].forEach(function(id) {
var btn = document.getElementById(id);
if (!btn) return;
var active = (id === 'srcType' + type.charAt(0).toUpperCase() + type.slice(1));
btn.style.background = active ? 'var(--accent)' : 'none';
btn.style.color = active ? '#fff' : 'var(--muted)';
});
}
function srcFileAutoNameSftp() {
var labelEl = document.getElementById('srcFileLabel');
if (labelEl && labelEl._userEdited) return;
var host = (document.getElementById('srcFileSftpHost')||{}).value || '';
if (labelEl && host) labelEl.value = host;
}
function srcFileSftpAuthSelect(authType) {
document.getElementById('srcFileSftpAuth').value = authType;
var pwFields = document.getElementById('srcSftpPwFields');
var keyFields = document.getElementById('srcSftpKeyFields');
var btnPw = document.getElementById('srcSftpAuthPw');
var btnKey = document.getElementById('srcSftpAuthKey');
if (pwFields) pwFields.style.display = authType === 'password' ? '' : 'none';
if (keyFields) keyFields.style.display = authType === 'key' ? 'flex' : 'none';
if (btnPw) { btnPw.style.background = authType==='password'?'var(--accent)':'none'; btnPw.style.color = authType==='password'?'#fff':'var(--muted)'; }
if (btnKey) { btnKey.style.background = authType==='key'?'var(--accent)':'none'; btnKey.style.color = authType==='key'?'#fff':'var(--muted)'; }
}
function srcFileDetectSmb() {
const p = document.getElementById('srcFilePath').value;
const isSmb = p.startsWith('//') || p.startsWith('\\\\');
@ -428,29 +476,79 @@ function srcFileAutoName() {
async function srcFileAdd() {
const label = document.getElementById('srcFileLabel').value.trim();
const sourceType = (document.getElementById('srcFileSourceType')||{}).value || 'local';
const stat = document.getElementById('srcFileStatus');
const editIdEl = document.getElementById('srcFileEditId');
const existingId = editIdEl ? editIdEl.value : '';
if (!label) { stat.style.color='var(--danger)'; stat.textContent=t('m365_fsrc_name_required','Name is required.'); document.getElementById('srcFileLabel').focus(); return; }
stat.style.color='var(--muted)'; stat.textContent=t('m365_fsrc_saving','Saving...');
var body = {label, source_type: sourceType};
if (existingId) body.id = existingId;
if (sourceType === 'sftp') {
const sftpHost = document.getElementById('srcFileSftpHost').value.trim();
const sftpUser = document.getElementById('srcFileSftpUser').value.trim();
const sftpPath = document.getElementById('srcFileSftpPath').value.trim() || '/';
const sftpPort = parseInt(document.getElementById('srcFileSftpPort').value) || 22;
const sftpAuth = document.getElementById('srcFileSftpAuth').value || 'password';
if (!sftpHost) { stat.style.color='var(--danger)'; stat.textContent=t('m365_fsrc_sftp_host_required','SFTP host is required.'); return; }
if (!sftpUser) { stat.style.color='var(--danger)'; stat.textContent=t('m365_fsrc_sftp_user_required','SFTP username is required.'); return; }
Object.assign(body, {sftp_host:sftpHost, sftp_port:sftpPort, sftp_user:sftpUser, sftp_auth:sftpAuth, path:sftpPath});
if (sftpAuth === 'password') {
const sftpPw = document.getElementById('srcFileSftpPw').value;
if (sftpPw) {
try { await fetch('/api/file_sources/store_creds',{method:'POST',headers:{'Content-Type':'application/json'},body:JSON.stringify({source_type:'sftp',sftp_host:sftpHost,sftp_user:sftpUser,password:sftpPw})}); } catch(e){}
}
} else {
// Upload key file if one is selected
const keyFileEl = document.getElementById('srcFileSftpKeyFile');
const keyStatusEl = document.getElementById('srcFileSftpKeyStatus');
const keyPathEl = document.getElementById('srcFileSftpKeyPath');
if (keyFileEl && keyFileEl.files.length && !keyPathEl.value) {
try {
const fd = new FormData(); fd.append('key_file', keyFileEl.files[0]);
const kr = await fetch('/api/file_sources/upload_key',{method:'POST',body:fd});
const kd = await kr.json();
if (kd.error) { stat.style.color='var(--danger)'; stat.textContent=kd.error; return; }
keyPathEl.value = kd.key_path;
if (keyStatusEl) keyStatusEl.textContent = t('m365_fsrc_sftp_key_uploaded','Key uploaded');
} catch(e){ stat.style.color='var(--danger)'; stat.textContent=e.message; return; }
}
body.sftp_key_path = keyPathEl ? keyPathEl.value : '';
const passphrase = (document.getElementById('srcFileSftpPassphrase')||{}).value || '';
if (passphrase) {
const passphraseKey = sftpHost+':'+sftpUser+':passphrase';
try { await fetch('/api/file_sources/store_creds',{method:'POST',headers:{'Content-Type':'application/json'},body:JSON.stringify({source_type:'sftp',sftp_host:sftpHost,sftp_user:sftpUser,password:passphrase,keychain_key:passphraseKey})}); } catch(e){}
body.keychain_key = passphraseKey;
}
}
} else {
const path = document.getElementById('srcFilePath').value.trim();
const smbHost = document.getElementById('srcFileSmbHost').value.trim();
const smbUser = document.getElementById('srcFileSmbUser').value.trim();
const smbPw = document.getElementById('srcFileSmbPw').value;
const stat = document.getElementById('srcFileStatus');
if (!label) { stat.style.color='var(--danger)'; stat.textContent=t('m365_fsrc_name_required','Name is required.'); document.getElementById('srcFileLabel').focus(); return; }
if (!path) { stat.style.color='var(--danger)'; stat.textContent=t('m365_fsrc_path_required','Path is required.'); return; }
stat.style.color='var(--muted)'; stat.textContent=t('m365_fsrc_saving','Saving...');
Object.assign(body, {path, smb_host:smbHost, smb_user:smbUser});
if (smbPw && smbUser) {
try { await fetch('/api/file_sources/store_creds',{method:'POST',headers:{'Content-Type':'application/json'},body:JSON.stringify({smb_host:smbHost,smb_user:smbUser,password:smbPw})}); } catch(e){}
try { await fetch('/api/file_sources/store_creds',{method:'POST',headers:{'Content-Type':'application/json'},body:JSON.stringify({source_type:'smb',smb_host:smbHost,smb_user:smbUser,password:smbPw})}); } catch(e){}
}
}
try {
const editId = document.getElementById('srcFileEditId');
const existingId = editId ? editId.value : '';
const body = {label, path, smb_host:smbHost, smb_user:smbUser};
if (existingId) body.id = existingId;
const r = await fetch('/api/file_sources/save',{method:'POST',headers:{'Content-Type':'application/json'},body:JSON.stringify(body)});
const d = await r.json();
if (d.error) { stat.style.color='var(--danger)'; stat.textContent=d.error; return; }
['srcFileLabel','srcFilePath','srcFileSmbHost','srcFileSmbUser','srcFileSmbPw'].forEach(function(id){const el=document.getElementById(id);if(el){el.value='';el._userEdited=false;}});
if (editId) editId.value='';
// Reset form
['srcFileLabel','srcFilePath','srcFileSmbHost','srcFileSmbUser','srcFileSmbPw',
'srcFileSftpHost','srcFileSftpUser','srcFileSftpPw','srcFileSftpPassphrase','srcFileSftpKeyPath'].forEach(function(id){const el=document.getElementById(id);if(el){el.value='';if(el._userEdited!==undefined)el._userEdited=false;}});
var portEl = document.getElementById('srcFileSftpPort'); if(portEl) portEl.value='22';
if (editIdEl) editIdEl.value='';
const addBtn=document.getElementById('srcFileAddBtn'); if(addBtn) addBtn.textContent=t('m365_fsrc_add_btn','Add');
document.getElementById('srcFileSmbFields').style.display='none';
srcFileTypeSelect('local');
stat.style.color='var(--accent)'; stat.textContent='\u2714 '+t('m365_fsrc_saved','Source saved');
await _loadFileSources();
srcFileRenderList();
@ -462,20 +560,28 @@ function srcFileEdit(id) {
const s = S._fileSources.find(function(x){return x.id===id;});
if (!s) return;
const labelEl = document.getElementById('srcFileLabel');
const pathEl = document.getElementById('srcFilePath');
const hostEl = document.getElementById('srcFileSmbHost');
const userEl = document.getElementById('srcFileSmbUser');
const pwEl = document.getElementById('srcFileSmbPw');
const editId = document.getElementById('srcFileEditId');
if (labelEl) { labelEl.value = s.label||''; labelEl._userEdited = true; }
if (pathEl) pathEl.value = s.path||'';
if (hostEl) hostEl.value = s.smb_host||'';
if (userEl) userEl.value = s.smb_user||'';
if (pwEl) pwEl.value = s.smb_user ? '\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022' : '';
if (editId) editId.value = id;
const isSmb = (s.path||'').startsWith('//') || (s.path||'').startsWith('\\\\');
const smbFields = document.getElementById('srcFileSmbFields');
if (smbFields) smbFields.style.display = isSmb ? 'flex' : 'none';
var sourceType = s.source_type || (((s.path||'').startsWith('//')||(s.path||'').startsWith('\\\\')) ? 'smb' : 'local');
srcFileTypeSelect(sourceType);
if (sourceType === 'sftp') {
var hostEl = document.getElementById('srcFileSftpHost'); if(hostEl) hostEl.value = s.sftp_host||'';
var portEl = document.getElementById('srcFileSftpPort'); if(portEl) portEl.value = s.sftp_port||22;
var userEl = document.getElementById('srcFileSftpUser'); if(userEl) userEl.value = s.sftp_user||'';
var pathEl = document.getElementById('srcFileSftpPath'); if(pathEl) pathEl.value = s.path||'/';
var authEl = document.getElementById('srcFileSftpAuth'); if(authEl) authEl.value = s.sftp_auth||'password';
srcFileSftpAuthSelect(s.sftp_auth||'password');
if (s.sftp_key_path) { var kp = document.getElementById('srcFileSftpKeyPath'); if(kp) kp.value=s.sftp_key_path; }
} else {
var pathEl2 = document.getElementById('srcFilePath'); if(pathEl2) pathEl2.value = s.path||'';
var smbHostEl = document.getElementById('srcFileSmbHost'); if(smbHostEl) smbHostEl.value = s.smb_host||'';
var smbUserEl = document.getElementById('srcFileSmbUser'); if(smbUserEl) smbUserEl.value = s.smb_user||'';
var smbPwEl = document.getElementById('srcFileSmbPw'); if(smbPwEl) smbPwEl.value = s.smb_user ? '\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022' : '';
}
const btn = document.getElementById('srcFileAddBtn');
if (btn) btn.textContent = t('m365_fsrc_save_changes','Save changes');
const stat = document.getElementById('srcFileStatus');
@ -547,9 +653,7 @@ function _renderFileSources() {
return;
}
list.innerHTML = S._fileSources.map(function(s) {
const isSmb = s.path && (s.path.startsWith('//') || s.path.startsWith('\\\\'));
const icon = isSmb ? '\uD83C\uDF10' : '\uD83D\uDCC1';
const userPart = s.smb_user ? ' \u00b7 \uD83D\uDC64 ' + _esc(s.smb_user) : '';
const icon = _srcIcon(s);
const sid = _esc(s.id || '');
const slabel = _esc(s.label || s.path || '');
return '<div class="fsrc-row">'
@ -559,7 +663,7 @@ function _renderFileSources() {
+ '<button class="btn-scan" onclick="fsrcScan(\'' + sid + '\')">&#9654; ' + t('m365_fsrc_scan_btn','Scan') + '</button>'
+ '<button class="btn-del" onclick="fsrcDelete(\'' + sid + '\',\'' + slabel + '\')">' + t('m365_profile_delete','Delete') + '</button>'
+ '</div></div>'
+ '<div class="fsrc-row-path">' + _esc(s.path || '') + userPart + '</div>'
+ '<div class="fsrc-row-path">' + _srcSubtitle(s) + '</div>'
+ '</div>';
}).join('');
}
@ -667,6 +771,9 @@ window.getGoogleScanOptions = getGoogleScanOptions;
window.srcFileRenderList = srcFileRenderList;
window.srcFileDetectSmb = srcFileDetectSmb;
window.srcFileAutoName = srcFileAutoName;
window.srcFileAutoNameSftp = srcFileAutoNameSftp;
window.srcFileTypeSelect = srcFileTypeSelect;
window.srcFileSftpAuthSelect = srcFileSftpAuthSelect;
window.srcFileAdd = srcFileAdd;
window.srcFileEdit = srcFileEdit;
window.srcFileDelete = srcFileDelete;

255
static/js/history.js Normal file
View File

@ -0,0 +1,255 @@
// ── Scan history browser ──────────────────────────────────────────────────────
// Lets the user load and browse results from any past scan session without
// running a new scan. Sessions are groups of concurrent M365 + Google + File
// scans (same 300-second window used by get_session_items on the server).
import { S } from './state.js';
const _SRC_LABELS = {
email: 'Outlook',
onedrive: 'OneDrive',
sharepoint: 'SharePoint',
teams: 'Teams',
gmail: 'Gmail',
gdrive: 'Google Drive',
local: 'Lokal',
smb: 'SMB',
};
let _sessions = null; // cached list; null = stale
let _latestRefScanId = null; // ref_scan_id of the newest session
// ── Session cache ─────────────────────────────────────────────────────────────
async function _fetchSessions() {
try {
const r = await fetch('/api/db/sessions');
_sessions = await r.json();
} catch(e) {
_sessions = [];
}
_latestRefScanId = _sessions.length ? _sessions[0].ref_scan_id : null;
return _sessions;
}
function invalidateHistoryCache() {
_sessions = null;
_latestRefScanId = null;
}
// ── Load a session into the results grid ──────────────────────────────────────
// Default landing view: every flagged item still awaiting action, across all
// scans (not just the latest session). Leaves S._historyRefScanId null (live
// mode) and shows no history banner — this is "now", not a past session.
async function loadOpenItems() {
// Bail if a scan is running — live SSE owns the grid then.
if (S._m365ScanRunning || S._googleScanRunning || S._fileScanRunning) return;
try {
const r = await fetch('/api/db/flagged');
const items = await r.json();
if (S._m365ScanRunning || S._googleScanRunning || S._fileScanRunning) return;
closeHistoryPicker();
if (!Array.isArray(items) || items.length === 0) {
S._historyRefScanId = null;
_setHistoryBanner(false);
window.loadLastScanSummary?.();
return;
}
S._historyRefScanId = null;
S.flaggedData = items;
S.filteredData = [];
const grid = document.getElementById('grid');
const emptyState = document.getElementById('emptyState');
const lastScan = document.getElementById('lastScanSummary');
if (emptyState) emptyState.style.display = 'none';
if (lastScan) lastScan.style.display = 'none';
if (grid) { grid.innerHTML = ''; grid.style.display = 'grid'; }
window.renderGrid(items);
try { window.markOverdueCards(); } catch(_) {}
try { window.loadTrend(); } catch(_) {}
_setHistoryBanner(false);
} catch(e) {
console.error('[history] failed to load open items:', e);
}
}
async function loadHistorySession(refScanId) {
// refScanId: null → all open (unreviewed) items across every scan,
// positive int → a specific past session
if (refScanId === null) return loadOpenItems();
const resolvedRef = refScanId;
try {
const r = await fetch('/api/db/flagged?ref=' + resolvedRef);
const items = await r.json();
// Bail if a scan started while we were fetching flagged items
if (S._m365ScanRunning || S._googleScanRunning || S._fileScanRunning) return;
closeHistoryPicker();
if (!Array.isArray(items) || items.length === 0) {
S._historyRefScanId = null;
_setHistoryBanner(false);
window.loadLastScanSummary?.();
return;
}
S._historyRefScanId = resolvedRef;
S.flaggedData = items;
S.filteredData = [];
const grid = document.getElementById('grid');
const emptyState = document.getElementById('emptyState');
const lastScan = document.getElementById('lastScanSummary');
if (emptyState) emptyState.style.display = 'none';
if (lastScan) lastScan.style.display = 'none';
if (grid) { grid.innerHTML = ''; grid.style.display = 'grid'; }
window.renderGrid(items);
try { window.markOverdueCards(); } catch(_) {}
try { window.loadTrend(); } catch(_) {}
_setHistoryBanner(true, resolvedRef);
// ── Re-scan diff: append items from previous session no longer present ────
const allSessions = _sessions !== null ? _sessions : await _fetchSessions();
const idx = allSessions.findIndex(s => s.ref_scan_id === resolvedRef);
if (idx !== -1 && idx + 1 < allSessions.length) {
const prevRef = allSessions[idx + 1].ref_scan_id;
try {
const pr = await fetch('/api/db/flagged?ref=' + prevRef);
const prevItems = await pr.json();
if (Array.isArray(prevItems) && prevItems.length) {
const currentIds = new Set(items.map(f => f.id));
const resolved = prevItems.filter(f => !currentIds.has(f.id));
if (resolved.length) {
const divider = document.createElement('div');
divider.className = 'resolved-divider';
divider.textContent = resolved.length + ' ' + t('history_resolved_label', 'items no longer present');
document.getElementById('grid')?.appendChild(divider);
resolved.forEach(f => { f._resolved = true; window.appendCard(f); });
_setHistoryBanner(true, resolvedRef, resolved.length);
}
}
} catch(e) {
console.warn('[history] diff failed:', e);
}
}
} catch(e) {
console.error('[history] failed to load session:', e);
}
}
// ── Banner ────────────────────────────────────────────────────────────────────
function _setHistoryBanner(visible, resolvedRef, resolvedCount) {
const banner = document.getElementById('historyBanner');
const bannerTxt = document.getElementById('historyBannerText');
const latestBtn = document.getElementById('historyLatestBtn');
if (!banner) return;
if (!visible) { banner.style.display = 'none'; return; }
const sess = (_sessions || []).find(s => s.ref_scan_id === resolvedRef);
let label = '';
if (sess) {
const date = new Date(sess.started_at * 1000).toLocaleDateString(undefined,
{day: 'numeric', month: 'short', year: 'numeric'});
const time = new Date(sess.started_at * 1000).toLocaleTimeString(undefined,
{hour: '2-digit', minute: '2-digit'});
const srcStr = (sess.sources || []).map(s => _SRC_LABELS[s] || s).join(' · ');
label = date + ' ' + time
+ (srcStr ? ' · ' + srcStr : '')
+ ' · ' + sess.flagged_count + ' ' + t('history_items', 'items');
if (resolvedCount) label += ' · ' + resolvedCount + ' ' + t('history_resolved_badge', 'resolved');
} else {
label = S.flaggedData.length + ' ' + t('history_items', 'items');
}
if (bannerTxt) bannerTxt.textContent = label;
if (latestBtn) latestBtn.style.display = (resolvedRef !== _latestRefScanId) ? '' : 'none';
banner.style.display = 'flex';
}
function exitHistoryMode() {
S._historyRefScanId = null;
const banner = document.getElementById('historyBanner');
if (banner) banner.style.display = 'none';
closeHistoryPicker();
}
// ── Session picker dropdown ───────────────────────────────────────────────────
async function openHistoryPicker() {
const drop = document.getElementById('historyDropdown');
if (!drop) return;
// Toggle
if (drop.style.display !== 'none') { drop.style.display = 'none'; return; }
drop.innerHTML = '<div style="padding:10px 12px;font-size:12px;color:var(--muted)">'
+ t('lbl_loading', 'Loading\u2026') + '</div>';
drop.style.display = '';
const sessions = _sessions !== null ? _sessions : await _fetchSessions();
if (!sessions.length) {
drop.innerHTML = '<div style="padding:12px;font-size:12px;color:var(--muted);text-align:center">'
+ t('history_picker_empty', 'No past scans') + '</div>';
return;
}
drop.innerHTML = '';
sessions.forEach((sess, i) => {
const date = new Date(sess.started_at * 1000).toLocaleDateString(undefined,
{day: 'numeric', month: 'short', year: 'numeric'});
const time = new Date(sess.started_at * 1000).toLocaleTimeString(undefined,
{hour: '2-digit', minute: '2-digit'});
const srcStr = (sess.sources || []).map(s => _SRC_LABELS[s] || s).join(' · ');
const isActive = sess.ref_scan_id === S._historyRefScanId;
const row = document.createElement('div');
row.style.cssText = 'padding:8px 12px;cursor:pointer'
+ (i < sessions.length - 1 ? ';border-bottom:1px solid var(--border)' : '')
+ (isActive ? ';background:var(--bg)' : '');
row.innerHTML =
'<div style="display:flex;align-items:center;gap:6px;margin-bottom:2px">' +
'<span style="font-size:12px;font-weight:500;color:var(--text)">' + date + '</span>' +
'<span style="font-size:10px;color:var(--muted)">' + time + '</span>' +
(sess.delta
? '<span style="font-size:9px;padding:1px 5px;border-radius:10px;background:var(--muted);color:#fff;font-weight:600">'
+ t('history_delta_badge', 'Delta') + '</span>'
: '') +
(i === 0
? '<span style="font-size:9px;padding:1px 5px;border-radius:10px;background:var(--accent);color:#fff;font-weight:600">'
+ t('history_latest_badge', 'Latest') + '</span>'
: '') +
'</div>' +
'<div style="font-size:10px;color:var(--muted)">' +
srcStr + ' &nbsp;\u00b7&nbsp; ' + sess.flagged_count + ' ' + t('history_items', 'items') +
'</div>';
row.addEventListener('mouseenter', () => { if (!isActive) row.style.background = 'var(--surface)'; });
row.addEventListener('mouseleave', () => { row.style.background = isActive ? 'var(--bg)' : ''; });
row.addEventListener('click', () => loadHistorySession(sess.ref_scan_id));
drop.appendChild(row);
});
}
function closeHistoryPicker() {
const drop = document.getElementById('historyDropdown');
if (drop) drop.style.display = 'none';
}
// Close picker when clicking outside its container
document.addEventListener('click', e => {
const wrap = document.getElementById('historyPickerBtn')?.closest('[data-history-wrap]');
if (wrap && !wrap.contains(e.target)) closeHistoryPicker();
}, true);
// ── Window exports ────────────────────────────────────────────────────────────
window.loadHistorySession = loadHistorySession;
window.openHistoryPicker = openHistoryPicker;
window.closeHistoryPicker = closeHistoryPicker;
window.exitHistoryMode = exitHistoryMode;
window.invalidateHistoryCache = invalidateHistoryCache;

View File

@ -161,10 +161,9 @@ function copyLog() {
document.querySelectorAll('#logPanel .log-line:not(#logLive)').forEach(function(d) {
lines.push(d.textContent);
});
navigator.clipboard.writeText(lines.join('\n')).then(function() {
const btn = document.querySelector('.log-copy-btn');
if (btn) { btn.textContent = '✓ Copied'; setTimeout(function(){ btn.textContent = '⎘ Copy'; }, 1500); }
}).catch(function() {});
// _copyText (viewer.js) handles HTTP contexts where navigator.clipboard is undefined.
if (btn) window._copyText(lines.join('\n'), btn);
}
function _restoreLog() {

View File

@ -69,6 +69,11 @@ function _applyProfile(profile) {
// File sources may not be rendered yet (they load async), so store their IDs
// in S._pendingProfileSources for renderSourcesPanel() to apply after re-render.
const profileSources = profile.sources || [];
// Ensure at least M365 source checkboxes are present before reading the DOM.
// renderSourcesPanel() is idempotent and fast — safe to call here.
if (!document.querySelector('#sourcesPanel input[data-source-id]') && typeof renderSourcesPanel === 'function') {
renderSourcesPanel();
}
document.querySelectorAll('#sourcesPanel input[data-source-id]').forEach(function(cb) {
cb.checked = profileSources.includes(cb.dataset.sourceId);
});
@ -122,6 +127,36 @@ function _applyProfile(profile) {
if (el) el.checked = opts.scan_photos;
}
if (opts.skip_gps_images !== undefined) {
const el = document.getElementById('optSkipGps');
if (el) el.checked = opts.skip_gps_images;
}
if (opts.min_cpr_count !== undefined) {
const el = document.getElementById('optMinCpr');
if (el) el.value = opts.min_cpr_count;
}
if (opts.ocr_lang !== undefined) {
const el = document.getElementById('optOcrLang');
if (el) el.value = opts.ocr_lang;
}
if (opts.cpr_only !== undefined) {
const el = document.getElementById('optCprOnly');
if (el) el.checked = opts.cpr_only;
}
if (opts.scan_emails !== undefined) {
const el = document.getElementById('optScanEmails');
if (el) el.checked = opts.scan_emails;
}
if (opts.scan_phones !== undefined) {
const el = document.getElementById('optScanPhones');
if (el) el.checked = opts.scan_phones;
}
// ── Date filter ───────────────────────────────────────────────────────────
const days = opts.older_than_days;
if (days !== undefined) {
@ -171,8 +206,13 @@ function _applyProfile(profile) {
// ── User selection ────────────────────────────────────────────────────────
if (profile.user_ids === 'all') {
if (S._allUsers.length) {
S._allUsers.forEach(u => { u.selected = true; });
if (S._allUsers.length) renderAccountList();
renderAccountList();
} else {
// Users not loaded yet — defer until loadUsers() resolves
window._pendingProfileAllUsers = true;
}
} else if (Array.isArray(profile.user_ids) && profile.user_ids.length) {
window._pendingProfileUserIds = profile.user_ids.map(u => u.id || u);
_applyPendingProfileUsers();
@ -335,7 +375,8 @@ function _openEditorForProfile(profile) {
: (u.platform || 'm365') === 'google' ? '<span style="font-size:9px;padding:1px 5px;border-radius:10px;background:#EAF3DE;color:#3B6D11;font-weight:500">GWS</span>'
: '<span style="font-size:9px;padding:1px 5px;border-radius:10px;background:#E6F1FB;color:#185FA5;font-weight:500">M365</span>';
const roleBadge = u.userRole === 'student' ? t('role_student','Elev') : u.userRole === 'staff' ? t('role_staff','Ansat') : t('role_other','Anden');
return `<label class="pmgmt-acct-row" data-uid="${_esc(u.id)}"><input type="checkbox" ${checked} data-uid="${_esc(u.id)}"><span style="flex:1;color:var(--color-text-primary);overflow:hidden;text-overflow:ellipsis;white-space:nowrap">${_esc(u.displayName)}</span>${platBadge}<span style="font-size:9px;padding:1px 5px;border-radius:10px;background:#D3D1C7;color:#444441">${roleBadge}</span></label>`;
const roleOverrideStyle = u.roleOverride ? 'color:var(--color-text-info);outline:1px solid var(--color-border-info);' : '';
return `<label class="pmgmt-acct-row" data-uid="${_esc(u.id)}" data-role="${_esc(u.userRole || 'other')}"><input type="checkbox" ${checked} data-uid="${_esc(u.id)}"><span style="flex:1;color:var(--color-text-primary);overflow:hidden;text-overflow:ellipsis;white-space:nowrap">${_esc(u.displayName)}</span>${platBadge}<button type="button" class="pmgmt-role-badge" data-uid="${_esc(u.id)}" onclick="_pmgmtCycleRole(this.getAttribute('data-uid'),event)" style="font-size:9px;padding:1px 5px;border-radius:10px;background:#D3D1C7;border:none;cursor:pointer;${roleOverrideStyle}">${roleBadge}</button></label>`;
}).join('');
body.innerHTML = `
@ -394,6 +435,12 @@ function _openEditorForProfile(profile) {
<div class="pmgmt-opt-row"><span>${t('m365_opt_max_emails','Maks. e-mails pr. bruger')}</span><input type="number" id="peOptMaxEmails" value="${opts.max_emails || 2000}" min="10" max="50000" style="width:56px;padding:3px 6px;font-size:11px;text-align:right"></div>
<div class="pmgmt-opt-row"><span>${t('m365_opt_delta','Delta-scanning')}</span><label class="toggle"><input type="checkbox" id="peOptDelta" ${opts.delta ? 'checked' : ''}><span class="toggle-slider"></span></label></div>
<div class="pmgmt-opt-row"><span>${t('m365_opt_scan_photos','Søg efter ansigter i billeder')}</span><label class="toggle"><input type="checkbox" id="peOptPhotos" ${opts.scan_photos ? 'checked' : ''}><span class="toggle-slider"></span></label></div>
<div class="pmgmt-opt-row"><span>${t('m365_opt_skip_gps','Ignorer GPS i billeder')}</span><label class="toggle"><input type="checkbox" id="peOptSkipGps" ${opts.skip_gps_images ? 'checked' : ''}><span class="toggle-slider"></span></label></div>
<div class="pmgmt-opt-row"><span style="color:var(--muted)">${t('m365_opt_min_cpr','Min. CPR-antal pr. fil')}</span><input type="number" id="peOptMinCpr" value="${opts.min_cpr_count || 1}" min="1" max="50" style="width:46px;padding:3px 6px;font-size:11px;text-align:right"></div>
<div class="pmgmt-opt-row"><span>${t('m365_opt_cpr_only','CPR-only mode')}</span><label class="toggle"><input type="checkbox" id="peOptCprOnly" ${opts.cpr_only ? 'checked' : ''}><span class="toggle-slider"></span></label></div>
<div class="pmgmt-opt-row"><span style="color:var(--muted)">${t('m365_opt_ocr_lang','OCR-sprog')}</span><select id="peOptOcrLang" style="font-size:11px;padding:2px 4px;background:var(--surface);border:1px solid var(--border);color:var(--text);border-radius:4px"><option value="dan+eng" ${(opts.ocr_lang||'dan+eng')==='dan+eng'?'selected':''}>dan+eng</option><option value="dan" ${opts.ocr_lang==='dan'?'selected':''}>dan</option><option value="eng" ${opts.ocr_lang==='eng'?'selected':''}>eng</option><option value="dan+eng+deu" ${opts.ocr_lang==='dan+eng+deu'?'selected':''}>dan+eng+deu</option><option value="dan+eng+swe" ${opts.ocr_lang==='dan+eng+swe'?'selected':''}>dan+eng+swe</option><option value="dan+eng+fra" ${opts.ocr_lang==='dan+eng+fra'?'selected':''}>dan+eng+fra</option></select></div>
<div class="pmgmt-opt-row"><span>${t('m365_opt_scan_emails','Søg efter e-mailadresser')}</span><label class="toggle"><input type="checkbox" id="peOptEmails" ${opts.scan_emails ? 'checked' : ''}><span class="toggle-slider"></span></label></div>
<div class="pmgmt-opt-row"><span>${t('m365_opt_scan_phones','Søg efter telefonnumre')}</span><label class="toggle"><input type="checkbox" id="peOptPhones" ${opts.scan_phones ? 'checked' : ''}><span class="toggle-slider"></span></label></div>
<hr style="border:none;border-top:1px solid var(--pmgmt-divider);margin:2px 0">
<div class="pmgmt-opt-row"><span>${t('m365_opt_retention','Opbevaringspolitik')}</span><label class="toggle"><input type="checkbox" id="peOptRetention" ${profile.retention_years ? 'checked' : ''}><span class="toggle-slider"></span></label></div>
<div style="padding:7px 8px;background:var(--bg);border-radius:6px">
@ -503,6 +550,26 @@ function _pmgmtCloseEditor() {
closeProfileMgmt();
}
async function _pmgmtCycleRole(uid, event) {
event.stopPropagation();
if (typeof cycleUserRole !== 'function') return;
await cycleUserRole(uid);
// Refresh the badge inside the profile modal to reflect the new role
const u = S._allUsers.find(function(u){ return u.id === uid; });
if (!u) return;
const lbl = document.querySelector('#pmgmtAcctList label[data-uid="' + uid.replace(/"/g, '\\"') + '"]');
if (!lbl) return;
const badge = lbl.querySelector('.pmgmt-role-badge');
if (!badge) return;
const roleText = u.userRole === 'student' ? t('role_student','Elev')
: u.userRole === 'staff' ? t('role_staff','Ansat')
: t('role_other','Anden');
badge.textContent = roleText;
lbl.dataset.role = u.userRole || 'other';
badge.style.color = u.roleOverride ? 'var(--color-text-info)' : '';
badge.style.outline = u.roleOverride ? '1px solid var(--color-border-info)' : '';
}
function _pmgmtSelectAllAccounts(checked) {
document.querySelectorAll('#pmgmtAcctList label input[type=checkbox]').forEach(function(cb) {
if (cb.closest('label').style.display !== 'none') cb.checked = checked;
@ -542,9 +609,8 @@ function _pmgmtFilterAccounts(q) {
q = (q || '').toLowerCase();
document.querySelectorAll('#pmgmtAcctList label').forEach(function(lbl) {
var name = (lbl.querySelector('span') || {}).textContent || '';
var uid = lbl.querySelector('input')?.dataset?.uid || '';
var user = S._allUsers.find(u => u.id === uid);
var roleOk = !_pmgmtRoleActive || (user && user.userRole === _pmgmtRoleActive);
var role = lbl.dataset.role || 'other';
var roleOk = !_pmgmtRoleActive || role === _pmgmtRoleActive;
var nameOk = !q || name.toLowerCase().includes(q);
lbl.style.display = (roleOk && nameOk) ? '' : 'none';
});
@ -589,6 +655,12 @@ async function _pmgmtSaveFullEdit() {
max_emails: parseInt(document.getElementById('peOptMaxEmails')?.value) || 2000,
delta: document.getElementById('peOptDelta')?.checked ?? false,
scan_photos: document.getElementById('peOptPhotos')?.checked ?? false,
skip_gps_images: document.getElementById('peOptSkipGps')?.checked ?? false,
min_cpr_count: parseInt(document.getElementById('peOptMinCpr')?.value) || 1,
ocr_lang: document.getElementById('peOptOcrLang')?.value || 'dan+eng',
cpr_only: document.getElementById('peOptCprOnly')?.checked ?? false,
scan_emails: document.getElementById('peOptEmails')?.checked ?? false,
scan_phones: document.getElementById('peOptPhones')?.checked ?? false,
},
retention_years: document.getElementById('peOptRetention')?.checked ? (parseInt(document.getElementById('peOptRetYears')?.value) || 5) : null,
fiscal_year_end: document.getElementById('peOptRetention')?.checked ? (document.getElementById('peOptFiscalYearEnd')?.value || '') : '',
@ -601,6 +673,7 @@ async function _pmgmtSaveFullEdit() {
const d = await r.json();
if (d.error) { alert(d.error); return; }
await loadProfiles();
_renderProfileMgmt();
window._pmgmtNewDraft = null;
log(t('m365_profile_saved','Profile saved') + ': ' + name);
// Show inline saved feedback without closing the modal
@ -614,7 +687,10 @@ async function _pmgmtSaveFullEdit() {
}
// Re-open the editor for the saved profile so it reflects the saved state
const saved = S._profiles.find(function(p) { return p.name === name; });
if (saved) { window._pmgmtEditId = saved.id; }
if (saved) {
window._pmgmtEditId = saved.id;
document.querySelectorAll('.pmgmt-row').forEach(r => r.classList.toggle('active', r.dataset.id === saved.id));
}
} catch(e) { alert('Save failed: ' + e.message); }
}
@ -698,6 +774,7 @@ window._peSetYear = _peSetYear;
window._renderEditorSources = _renderEditorSources;
window._pmgmtNewProfile = _pmgmtNewProfile;
window._pmgmtCloseEditor = _pmgmtCloseEditor;
window._pmgmtCycleRole = _pmgmtCycleRole;
window._pmgmtSelectAllAccounts = _pmgmtSelectAllAccounts;
window._pmgmtRoleFilter = _pmgmtRoleFilter;
window._pmgmtAddManual = _pmgmtAddManual;

View File

@ -1,4 +1,18 @@
import { S } from './state.js';
// Escape untrusted strings (filenames, account/display names, folders) before
// embedding them in innerHTML / title attributes. Scan-derived values can come
// from attacker-controlled content (e.g. a OneDrive file named with markup),
// so every such field must pass through esc() to prevent stored XSS.
function esc(s) {
return String(s == null ? '' : s)
.replace(/&/g, '&amp;')
.replace(/</g, '&lt;')
.replace(/>/g, '&gt;')
.replace(/"/g, '&quot;')
.replace(/'/g, '&#39;');
}
// ── Cards ─────────────────────────────────────────────────────────────────────
const SOURCE_BADGES = {
email: ['📧', 'badge-email', 'Outlook'],
@ -11,6 +25,31 @@ const SOURCE_BADGES = {
smb: ['🌐', 'badge-smb', 'Network'],
};
// Build the user/group pill for a card. The group (role) badge is driven by
// user_role alone so it shows even when no display name is available — e.g.
// items from earlier scans saved before account_name was persisted. For those
// the user label is resolved best-effort from the loaded user list (by id or
// email), falling back to an email-style account_id. Returns '' when there is
// neither a label nor a role to show.
function _accountPill(f) {
const roleBadge =
f.user_role === 'student' ? '<span class="role-badge">' + t('role_student', 'Elev') + '</span>' :
f.user_role === 'staff' ? '<span class="role-badge">' + t('role_staff', 'Ansat') + '</span>' : '';
let label = f.account_name || '';
if (!label && f.account_id) {
const aid = String(f.account_id);
const u = (S._allUsers || []).find(function(u) {
return u.id === f.account_id ||
(u.email && u.email.toLowerCase() === aid.toLowerCase());
});
if (u) label = u.displayName || '';
else if (aid.includes('@')) label = aid; // an email is already human-readable
}
if (!label && !roleBadge) return '';
const title = label || f.user_role || '';
return '<span class="account-pill" title="' + esc(title) + '">' + roleBadge + (label ? esc(label) : '') + '</span>';
}
function appendCard(f) {
const search = document.getElementById('filterSearch').value.trim().toLowerCase();
const srcVal = document.getElementById('filterSource').value;
@ -24,36 +63,57 @@ function appendCard(f) {
: '/api/thumb?name=' + encodeURIComponent(f.name) + '&type=' + encodeURIComponent(f.source_type);
const card = document.createElement('div');
card.className = 'card' + (S.isListView ? ' list-view' : '');
card.className = 'card' + (S.isListView ? ' list-view' : '') + (S._selectedIds.has(f.id) ? ' card-selected-bulk' : '') + ((f._resolved || f._redacted || f._deleted) ? ' card-resolved' : '');
card.dataset.id = f.id;
card.onclick = () => openPreview(f);
card.onclick = (e) => { if (S._selectMode) { toggleCardSelect(f.id, e); } else { openPreview(f); } };
const delBtn = window.VIEWER_MODE ? '' : `<button class="card-delete-btn" title="${t('m365_delete_confirm','Delete')}" onclick="event.stopPropagation();deleteItem(${JSON.stringify(f).replace(/"/g,'&quot;')},this.closest('.card'))">🗑</button>`;
const cb = document.createElement('input');
cb.type = 'checkbox';
cb.className = 'card-cb';
cb.checked = S._selectedIds.has(f.id);
cb.onclick = (e) => { e.stopPropagation(); toggleCardSelect(f.id, e); };
card.appendChild(cb);
const delBtn = (window.VIEWER_MODE || f._resolved || f._redacted || f._deleted) ? '' : `<button class="card-delete-btn" title="${t('m365_delete_confirm','Delete')}" onclick="event.stopPropagation();deleteItem(${JSON.stringify(f).replace(/"/g,'&quot;')},this.closest('.card'))">🗑</button>`;
const _redactExts = new Set(['.docx', '.xlsx', '.txt', '.csv', '.pdf']);
const _cloudRedactExts = new Set(['.docx', '.xlsx', '.pdf']);
const _m365Types = new Set(['onedrive', 'sharepoint', 'teams']);
const _fileExt = (f.name || '').substring((f.name || '').lastIndexOf('.')).toLowerCase();
const _redactable = !window.VIEWER_MODE && !f._resolved && !f._redacted && !f._deleted && f.cpr_count > 0 && (
f.source_type === 'local' ? _redactExts.has(_fileExt) :
_m365Types.has(f.source_type) ? _cloudRedactExts.has(_fileExt) :
f.source_type === 'gdrive' ? _cloudRedactExts.has(_fileExt) :
(f.source_type === 'smb' || f.source_type === 'sftp') ? _redactExts.has(_fileExt) : false
);
const redactBtn = _redactable ? `<button class="card-redact-btn" title="${t('redact_btn','Redact CPR')}" onclick="event.stopPropagation();redactItem(${JSON.stringify(f).replace(/"/g,'&quot;')},this.closest('.card'))">✏</button>` : '';
const acctPill = _accountPill(f);
if (S.isListView) {
card.innerHTML = `
<div style="font-size:24px; flex-shrink:0">${icon}</div>
<div class="card-info list-info">
<div class="card-name" title="${f.name}">${f.name}</div>
<div class="card-meta">${f.size_kb} KB · ${f.modified || ''}${f.folder ? ' · 📂 ' + f.folder : ''}</div>
<div class="card-source"><span class="source-badge ${badgeCls}">${label}</span> ${f.source || ''}${f.account_name ? ' · <span class="account-pill" title="' + f.account_name + '">' + (f.user_role === 'student' ? '<span class="role-badge">' + t('role_student','Elev') + '</span>' : f.user_role === 'staff' ? '<span class="role-badge">' + t('role_staff','Ansat') + '</span>' : '') + f.account_name + '</span>' : ''}${f.transfer_risk === 'external-recipient' ? ' <span class="role-pill" style="background:#7B2D00;color:#FFD0B0"> Ext.</span>' : f.transfer_risk ? ' <span class="role-pill" style="background:#003D7B;color:#B0D4FF">🔗</span>' : ''}</div>
<div class="card-name" title="${esc(f.name)}">${esc(f.name)}</div>
<div class="card-meta">${f.size_kb} KB · ${esc(f.modified || '')}${f.folder ? ' · 📂 ' + esc(f.folder) : ''}</div>
<div class="card-source"><span class="source-badge ${badgeCls}">${esc(label)}</span> ${esc(f.source || '')}${acctPill ? ' · ' + acctPill : ''}${f.transfer_risk === 'external-recipient' ? ' <span class="role-pill" style="background:#7B2D00;color:#FFD0B0"> Ext.</span>' : f.transfer_risk ? ' <span class="role-pill" style="background:#003D7B;color:#B0D4FF">🔗</span>' : ''}</div>
</div>
<span class="cpr-badge">${f.cpr_count} CPR</span>
${f.email_count > 0 ? '<span class="email-badge">' + f.email_count + ' ' + t('m365_badge_emails', 'e-mail') + '</span> ' : ''}
${f.phone_count > 0 ? '<span class="phone-badge">' + f.phone_count + ' ' + t('m365_badge_phones', 'tlf.') + '</span> ' : ''}
${f.face_count > 0 ? '<span class="photo-face-badge">' + f.face_count + ' ' + t('m365_badge_faces', f.face_count === 1 ? 'face' : 'faces') + '</span> ' : ''}
${f.exif && f.exif.gps ? '<span class="photo-face-badge" style="background:#0a3a5a;color:#7ec8d0">🌍 GPS</span> ' : ''}
${f.special_category && f.special_category.length ? '<span class="special-cat-badge">⚠ Art.9 — ' + f.special_category.filter(function(s){return s !== 'gps_location' && s !== 'exif_pii';}).join(', ') + '</span> ' : ''}${f.overdue ? '<span class="overdue-badge">🗓 Overdue</span>' : ''}
${delBtn}`;
${f.special_category && f.special_category.length ? '<span class="special-cat-badge">⚠ Art.9 — ' + f.special_category.filter(function(s){return s !== 'gps_location' && s !== 'exif_pii';}).join(', ') + '</span> ' : ''}${f._deleted ? '<span class="resolved-badge" style="background:#3a1a1a;color:#ff9b9b">🗑 ' + t('delete_badge', 'Deleted') + '</span> ' : ''}${f._redacted ? '<span class="resolved-badge">✏ ' + t('redact_badge', 'Redacted') + '</span> ' : ''}${f._resolved ? '<span class="resolved-badge">✓ ' + t('history_resolved_badge', 'Resolved') + '</span> ' : ''}${f.overdue ? '<span class="overdue-badge">🗓 Overdue</span>' : ''}
${delBtn}${redactBtn}`;
} else {
card.innerHTML = `
<div class="thumb-wrap"><img src="${src}" alt="${f.name}" loading="lazy"></div>
<div class="thumb-wrap"><img src="${src}" alt="${esc(f.name)}" loading="lazy"></div>
<div class="card-info">
<div class="card-name" title="${f.name}">${f.name}</div>
<div class="card-meta">${f.size_kb} KB · ${f.modified || ''}</div>
${f.folder ? `<div class="card-meta" style="font-size:10px" title="${f.folder}">📂 ${f.folder}</div>` : ''}
<div class="card-source"><span class="source-badge ${badgeCls}">${label}</span>${f.account_name ? ' <span class="account-pill" title="' + f.account_name + '">' + (f.user_role === "student" ? '<span class="role-badge">' + t("role_student","Elev") + "</span>" : f.user_role === "staff" ? '<span class="role-badge">' + t("role_staff","Ansat") + "</span>" : "") + f.account_name + '</span>' : ''}${f.transfer_risk === "external-recipient" ? ' <span class="role-pill" style="background:#7B2D00;color:#FFD0B0"> Ext.</span>' : f.transfer_risk ? ' <span class="role-pill" style="background:#003D7B;color:#B0D4FF">🔗</span>' : ''}</div>
<span class="cpr-badge">${f.cpr_count} CPR</span>${f.face_count > 0 ? ' <span class="photo-face-badge">' + f.face_count + ' ' + t('m365_badge_faces', f.face_count === 1 ? 'face' : 'faces') + '</span>' : ''}${f.exif && f.exif.gps ? ' <span class="photo-face-badge" style="background:#0a3a5a;color:#7ec8d0">🌍 GPS</span>' : ''}${f.overdue ? ' <span class="overdue-badge">🗓 Overdue</span>' : ''}
<div class="card-name" title="${esc(f.name)}">${esc(f.name)}</div>
<div class="card-meta">${f.size_kb} KB · ${esc(f.modified || '')}</div>
${f.folder ? `<div class="card-meta" style="font-size:10px" title="${esc(f.folder)}">📂 ${esc(f.folder)}</div>` : ''}
<div class="card-source"><span class="source-badge ${badgeCls}">${esc(label)}</span>${acctPill ? ' ' + acctPill : ''}${f.transfer_risk === "external-recipient" ? ' <span class="role-pill" style="background:#7B2D00;color:#FFD0B0"> Ext.</span>' : f.transfer_risk ? ' <span class="role-pill" style="background:#003D7B;color:#B0D4FF">🔗</span>' : ''}</div>
<span class="cpr-badge">${f.cpr_count} CPR</span>${f.email_count > 0 ? ' <span class="email-badge">' + f.email_count + ' ' + t('m365_badge_emails', 'e-mail') + '</span>' : ''}${f.phone_count > 0 ? ' <span class="phone-badge">' + f.phone_count + ' ' + t('m365_badge_phones', 'tlf.') + '</span>' : ''}${f.face_count > 0 ? ' <span class="photo-face-badge">' + f.face_count + ' ' + t('m365_badge_faces', f.face_count === 1 ? 'face' : 'faces') + '</span>' : ''}${f.exif && f.exif.gps ? ' <span class="photo-face-badge" style="background:#0a3a5a;color:#7ec8d0">🌍 GPS</span>' : ''}${f._deleted ? ' <span class="resolved-badge" style="background:#3a1a1a;color:#ff9b9b">🗑 ' + t('delete_badge', 'Deleted') + '</span>' : ''}${f._redacted ? ' <span class="resolved-badge"> ' + t('redact_badge', 'Redacted') + '</span>' : ''}${f._resolved ? ' <span class="resolved-badge"> ' + t('history_resolved_badge', 'Resolved') + '</span>' : ''}${f.overdue ? ' <span class="overdue-badge">🗓 Overdue</span>' : ''}
</div>
${delBtn}`;
${delBtn}${redactBtn}`;
}
grid.appendChild(card);
}
@ -62,6 +122,19 @@ function renderGrid(files) {
const grid = document.getElementById('grid');
grid.innerHTML = '';
files.forEach(f => appendCard(f));
// Whenever results are rendered, the landing/last-scan cards must be hidden —
// the live scan_file_flagged path shows the grid but does not clear them, so
// results would otherwise appear underneath the still-visible landing page
// until a manual refresh. Centralised here so every render path is covered.
if (files && files.length) {
const es = document.getElementById('emptyState');
if (es) es.style.display = 'none';
const ls = document.getElementById('lastScanSummary');
if (ls) ls.style.display = 'none';
if (grid) grid.style.display = S.isListView ? 'block' : 'grid';
}
_updateBulkBar();
updateDispositionStats();
}
// ── Preview panel ─────────────────────────────────────────────────────────────
@ -82,22 +155,30 @@ async function openPreview(f) {
panel.classList.remove('hidden');
const _savedW = sessionStorage.getItem('gdpr_preview_width');
if (_savedW) panel.style.width = _savedW + 'px';
// Opening the panel narrows .grid-area and reflows the grid to fewer columns,
// moving the selected card to a new row. Defer the scroll by two frames so it
// runs against the settled layout, and centre the card so it stays visible.
if (cardEl) requestAnimationFrame(() => requestAnimationFrame(() =>
cardEl.scrollIntoView({ behavior: 'smooth', block: 'center' })));
title.textContent = f.name;
frame.style.display = 'none';
loading.style.display = 'flex';
loading.textContent = 'Loading preview…';
meta.innerHTML = [
f.account_name ? `<span style="font-weight:500">👤 ${f.account_name}</span>` : '',
f.source ? `<span>${f.source}</span>` : '',
f.account_name ? `<span style="font-weight:500">👤 ${esc(f.account_name)}</span>` : '',
f.source ? `<span>${esc(f.source)}</span>` : '',
f.size_kb ? `<span>${f.size_kb} KB</span>` : '',
f.modified ? `<span>${f.modified}</span>` : '',
f.modified ? `<span>${esc(f.modified)}</span>` : '',
f.cpr_count ? `<span style="color:var(--danger)">${f.cpr_count} CPR</span>` : '',
f.email_count ? `<span style="color:#7ec8f0">${f.email_count} ${t('m365_badge_emails','e-mail')}</span>` : '',
f.phone_count ? `<span style="color:#7eeac0">${f.phone_count} ${t('m365_badge_phones','tlf.')}</span>` : '',
f.url ? `<button class="preview-open-btn" onclick="window.open('${f.url}','_blank')">${t("m365_preview_open","Open in M365 ↗")}</button>` : '',
].filter(Boolean).join('');
_previewItemId = f.id;
loadDisposition(f.id); // load disposition for this item (#6)
loadDisposition(f.id);
_loadRelated(f);
try {
const r = await fetch('/api/preview/' + encodeURIComponent(f.id)
@ -163,6 +244,44 @@ async function openPreview(f) {
}
}
// ── Related documents (CPR cross-reference) ───────────────────────────────────
async function _loadRelated(f) {
const el = document.getElementById('previewRelated');
if (!el) return;
if (!f.cpr_count) { el.style.display = 'none'; return; }
const ref = S._historyRefScanId ? `&ref=${S._historyRefScanId}` : '';
try {
const r = await fetch(`/api/db/related/${encodeURIComponent(f.id)}?${ref}`);
const items = await r.json();
if (f.id !== _previewItemId) return; // stale
if (!items.length) { el.style.display = 'none'; return; }
const rows = items.map(item => {
const shared = item.shared_cprs ?? '';
const badge = shared ? `<span style="font-size:9px;padding:1px 5px;border-radius:10px;background:var(--danger);color:#fff;font-weight:500;flex-shrink:0">${shared} CPR</span>` : '';
const src = item.source ? `<span style="color:var(--muted);font-size:10px;flex-shrink:0">${esc(item.source)}</span>` : '';
return `<div onclick="window._openRelated('${item.id.replace(/'/g,"\\'")}',${JSON.stringify(item).replace(/"/g,'&quot;')})"
style="display:flex;align-items:center;gap:6px;padding:4px 0;cursor:pointer;border-radius:4px"
onmouseover="this.style.background='var(--surface)'" onmouseout="this.style.background=''">
<span style="flex:1;font-size:11px;overflow:hidden;text-overflow:ellipsis;white-space:nowrap" title="${esc(item.name)}">${esc(item.name)}</span>
${src}${badge}
</div>`;
}).join('');
el.innerHTML = `<div style="font-size:10px;font-weight:600;color:var(--muted);margin-bottom:4px;text-transform:uppercase;letter-spacing:.04em">${t('m365_related_docs','Related documents')} <span style="font-weight:400">(${items.length})</span></div>${rows}`;
el.style.display = 'block';
} catch(e) {
el.style.display = 'none';
}
}
window._openRelated = function(id, itemData) {
const cached = (S.flaggedData || []).find(x => x.id === id);
openPreview(cached || itemData);
};
// ── Retention policy (#1) ────────────────────────────────────────────────────
function toggleRetentionPanel() {
@ -287,9 +406,9 @@ async function runSubjectLookup() {
_dsubItems = d.items;
resultsEl.innerHTML = d.items.map(item => `
<div class="dsub-result-row">
<div class="dsub-result-name" title="${item.name}">${item.name}</div>
<div class="dsub-result-meta">${item.source_type || ""}</div>
<div class="dsub-result-meta">${item.modified || ""}</div>
<div class="dsub-result-name" title="${esc(item.name)}">${esc(item.name)}</div>
<div class="dsub-result-meta">${esc(item.source_type || "")}</div>
<div class="dsub-result-meta">${esc(item.modified || "")}</div>
<div class="dsub-result-meta" style="color:var(--danger)">${item.cpr_count} CPR</div>
</div>
`).join("");
@ -317,10 +436,13 @@ async function deleteSubjectItems() {
document.getElementById("dsubDeleteBtn").style.display = "none";
document.getElementById("dsubResults").innerHTML = "";
_dsubItems = [];
// Refresh grid
S.flaggedData = S.flaggedData.filter(f => !ids.includes(f.id));
S.filteredData = S.filteredData.filter(f => !ids.includes(f.id));
renderGrid();
// Keep the deleted items in the grid (marked, greyed, buttons hidden)
// until the next scan run — only those the server actually deleted.
const deletedSet = new Set(d.deleted_ids || ids);
const _mark = (x) => { if (deletedSet.has(x.id)) x._deleted = true; };
S.flaggedData.forEach(_mark);
S.filteredData.forEach(_mark);
renderGrid(S.filteredData.length ? S.filteredData : S.flaggedData);
updateStats();
} catch(e) {
statusEl.textContent = "Delete failed: " + e.message;
@ -367,6 +489,7 @@ async function saveDisposition() {
// Update cached value on the S.flaggedData item
const item = S.flaggedData.find(f => f.id === _dispositionItemId);
if (item) item.disposition = status;
updateDispositionStats();
// Refresh card badge if a disposition filter is active
const dispFilter = document.getElementById("filterDisposition")?.value;
if (dispFilter) applyFilters();
@ -375,6 +498,133 @@ async function saveDisposition() {
}
}
// ── Disposition stats ─────────────────────────────────────────────────────────
function updateDispositionStats() {
const el = document.getElementById('dispStats');
if (!el) return;
const data = S.flaggedData;
if (!data.length) { el.style.display = 'none'; return; }
let unreviewed = 0, retain = 0, del = 0, other = 0;
for (const f of data) {
const d = f.disposition || 'unreviewed';
if (d === 'unreviewed') unreviewed++;
else if (d.startsWith('retain')) retain++;
else if (d.startsWith('delete') || d === 'deleted') del++;
else other++;
}
const reviewed = data.length - unreviewed;
const pct = data.length ? Math.round(reviewed / data.length * 100) : 0;
el.style.display = 'flex';
el.innerHTML =
`<span>${data.length} ${t('disp_stats_total','total')}</span>` +
`<span class="disp-stat-sep"></span>` +
`<span class="${unreviewed ? 'disp-stat-warn' : 'disp-stat-ok'}">${unreviewed} ${t('disp_stats_unreviewed','unreviewed')}</span>` +
`<span class="disp-stat-sep"></span>` +
`<span>${retain} ${t('disp_stats_retain','retain')}</span>` +
`<span class="disp-stat-sep"></span>` +
`<span>${del} ${t('disp_stats_delete','delete')}</span>` +
(other ? `<span class="disp-stat-sep"></span><span>${other} ${t('disp_stats_other','other')}</span>` : '') +
`<span class="disp-stat-sep" style="margin-left:auto"></span>` +
`<span style="font-weight:600;color:var(--accent)">${pct}% ${t('disp_stats_reviewed','reviewed')}</span>`;
}
// ── Bulk disposition tagging ──────────────────────────────────────────────────
function toggleSelectMode() {
S._selectMode = !S._selectMode;
document.body.classList.toggle('select-mode', S._selectMode);
const btn = document.getElementById('selectModeBtn');
if (btn) {
btn.style.background = S._selectMode ? 'var(--accent)' : 'none';
btn.style.color = S._selectMode ? '#fff' : 'var(--muted)';
btn.style.borderColor = S._selectMode ? 'var(--accent)' : 'var(--border)';
}
if (!S._selectMode) {
S._selectedIds.clear();
_updateBulkBar();
} else {
closePreview();
}
// Re-render so card onclick handlers respect new mode
renderGrid(S.filteredData.length ? S.filteredData : S.flaggedData);
}
function toggleCardSelect(id, ev) {
if (ev) ev.stopPropagation();
if (S._selectedIds.has(id)) S._selectedIds.delete(id);
else S._selectedIds.add(id);
const cb = document.querySelector(`.card[data-id="${CSS.escape(id)}"] .card-cb`);
if (cb) cb.checked = S._selectedIds.has(id);
const card = document.querySelector(`.card[data-id="${CSS.escape(id)}"]`);
if (card) card.classList.toggle('card-selected-bulk', S._selectedIds.has(id));
_updateBulkBar();
}
function selectAllVisible() {
const allChecked = S.filteredData.every(f => S._selectedIds.has(f.id));
if (allChecked) {
S.filteredData.forEach(f => { S._selectedIds.delete(f.id); });
} else {
S.filteredData.forEach(f => { S._selectedIds.add(f.id); });
}
renderGrid(S.filteredData.length ? S.filteredData : S.flaggedData);
_updateBulkBar();
}
function _updateBulkBar() {
const bar = document.getElementById('bulkTagBar');
const cnt = document.getElementById('bulkTagCount');
const saEl = document.getElementById('bulkSelectAll');
if (!bar) return;
const n = S._selectedIds.size;
bar.style.display = (S._selectMode && n > 0) ? 'flex' : 'none';
if (cnt) cnt.textContent = n + ' ' + t('bulk_selected', 'selected');
if (saEl) {
const allVis = S.filteredData.length > 0 && S.filteredData.every(f => S._selectedIds.has(f.id));
saEl.textContent = allVis
? t('bulk_deselect_all', 'Deselect all')
: t('bulk_select_all', 'Select all visible');
}
}
async function applyBulkDisposition() {
const status = document.getElementById('bulkDispSelect')?.value;
if (!status || S._selectedIds.size === 0) return;
const ids = [...S._selectedIds];
const btn = document.getElementById('bulkTagApplyBtn');
const statusEl = document.getElementById('bulkTagStatus');
if (btn) btn.disabled = true;
if (statusEl) statusEl.textContent = '';
try {
const r = await fetch('/api/db/disposition/bulk', {
method: 'POST', headers: {'Content-Type': 'application/json'},
body: JSON.stringify({item_ids: ids, status}),
});
const d = await r.json();
if (d.error) throw new Error(d.error);
// Update in-memory items
for (const f of S.flaggedData) {
if (S._selectedIds.has(f.id)) f.disposition = status;
}
if (statusEl) {
statusEl.textContent = '✓ ' + d.saved + ' ' + t('bulk_applied', 'updated');
setTimeout(() => { if (statusEl) statusEl.textContent = ''; }, 2000);
}
S._selectedIds.clear();
_updateBulkBar();
// Refresh filter if disposition filter is active
const dispFilter = document.getElementById('filterDisposition')?.value;
if (dispFilter) applyFilters();
else renderGrid(S.filteredData.length ? S.filteredData : S.flaggedData);
updateDispositionStats();
} catch(e) {
if (statusEl) statusEl.textContent = e.message;
} finally {
if (btn) btn.disabled = false;
}
}
function closePreview() {
const panel = document.getElementById('previewPanel');
panel.style.width = ''; // clear inline width so CSS .hidden { width:0 } takes effect
@ -399,9 +649,13 @@ async function deleteItem(f, cardEl) {
});
const d = await r.json();
if (d.ok) {
S.flaggedData = S.flaggedData.filter(x => x.id !== f.id);
S.filteredData = S.filteredData.filter(x => x.id !== f.id);
if (cardEl) cardEl.remove();
// Keep the deleted item in the grid (marked, greyed, action buttons
// hidden) until the next scan run, so the operator can see what was
// handled. The grid is rebuilt on the next scan, clearing these.
const _mark = (x) => { if (x.id === f.id) x._deleted = true; };
S.flaggedData.forEach(_mark);
S.filteredData.forEach(_mark);
renderGrid(S.filteredData.length ? S.filteredData : S.flaggedData);
updateStats();
log(t('m365_log_deleted', 'Deleted:') + ' ' + f.name, 'ok');
if (_previewItemId === f.id) closePreview();
@ -413,6 +667,36 @@ async function deleteItem(f, cardEl) {
}
}
async function redactItem(f, cardEl) {
if (!confirm(t('redact_confirm', 'Redact all CPR numbers in') + ' "' + f.name + '"?\n\n' + t('redact_warning', 'CPR numbers will be replaced with █ characters. This cannot be undone.'))) return;
if (cardEl) { cardEl.style.opacity = '0.5'; cardEl.style.pointerEvents = 'none'; }
try {
const r = await fetch('/api/redact_item', {
method: 'POST', headers: {'Content-Type': 'application/json'},
body: JSON.stringify({id: f.id, source_type: f.source_type})
});
const d = await r.json();
if (d.ok) {
// Keep the redacted item in the grid (marked, greyed, action buttons
// hidden) until the next scan run, so the operator can see what was
// handled. The grid is rebuilt on the next scan, clearing these.
const _mark = (x) => { if (x.id === f.id) x._redacted = true; };
S.flaggedData.forEach(_mark);
S.filteredData.forEach(_mark);
renderGrid(S.filteredData.length ? S.filteredData : S.flaggedData);
updateStats();
log(t('redact_done', 'Redacted') + ' ' + f.name + ' (' + (d.redacted || 0) + ' ' + t('redact_spans', 'CPR spans') + ')', 'ok');
if (_previewItemId === f.id) closePreview();
} else {
if (cardEl) { cardEl.style.opacity = ''; cardEl.style.pointerEvents = ''; }
log(t('redact_failed', 'Redaction failed:') + ' ' + (d.error || '?'), 'err');
}
} catch(e) {
if (cardEl) { cardEl.style.opacity = ''; cardEl.style.pointerEvents = ''; }
log(t('redact_failed', 'Redaction failed:') + ' ' + e.message, 'err');
}
}
// ── Bulk delete modal ─────────────────────────────────────────────────────────
function openBulkDelete() {
@ -436,6 +720,7 @@ function _bdFilters() {
function _bdMatches() {
const f = _bdFilters();
return S.flaggedData.filter(x => {
if (x._deleted || x._redacted) return false; // already handled this session
if (f.source_type && x.source_type !== f.source_type) return false;
if (x.cpr_count < f.min_cpr) return false;
if (f.older_than_date && x.modified > f.older_than_date) return false;
@ -488,25 +773,34 @@ function _ensureSSE() {
function _sseWatchdog() {
fetch('/api/scan/status').then(function(r) { return r.json(); }).then(function(status) {
if (status.running) {
var anyRunning = status.running || status.google_running;
if (anyRunning) {
// A scan is in progress — make sure SSE is connected and progress UI is visible
_ensureSSE();
if (!S._m365ScanRunning && !S._googleScanRunning && !S._fileScanRunning) {
if (status.running && !S._m365ScanRunning && !S._googleScanRunning && !S._fileScanRunning) {
document.getElementById('scanBtn').disabled = true;
document.getElementById('stopBtn').style.display = 'inline-block';
// /api/scan/status checks the M365 lock — if running=true it's an M365 scan
// status.running reflects the M365 + file lock; treat as an M365 reconnect
S._m365ScanRunning = true; _renderProgressSegments();
document.getElementById('progressFile').textContent = t('m365_sse_reconnecting', 'Reconnecting to running scan…');
log(t('m365_sse_reconnecting', 'Reconnecting to running scan…'));
}
} else if (!S._historyRefScanId && !(S.flaggedData && S.flaggedData.length)) {
// No scan of any kind is running (authoritative, both locks free) and
// nothing is shown yet — restore the last saved session from the DB.
// Retried on every poll, not one-shot: the initial attempt can be blocked
// by running flags that SSE replay of a *completed* scan set but never
// cleared, and sse_replay_done only fires for a non-empty buffer (so it
// never retries after a server restart clears the replay buffer).
// Both locks are confirmed free, so clear any stale flags first.
S._m365ScanRunning = false;
S._googleScanRunning = false;
S._fileScanRunning = false;
window.loadHistorySession?.(null);
}
if (!_initialStatusChecked) {
_initialStatusChecked = true;
if (!status.running) loadLastScanSummary();
}
// When no scan is running, we still keep polling — the SSE connection
// may have died and we need to detect the *next* scheduled scan.
// The SSE itself is only opened/reopened when a scan is detected.
// Keep polling even when idle — the SSE connection may have died and we
// need to detect the next scheduled scan (SSE is only opened on demand).
}).catch(function(err) {
// Status endpoint unavailable — server might be restarting
console.warn('[SSE] status poll failed:', err);
@ -641,9 +935,12 @@ async function executeBulkDelete() {
});
const d = await r.json();
if (d.ok) {
const deletedSet = new Set(matches.map(x => x.id));
S.flaggedData = S.flaggedData.filter(x => !deletedSet.has(x.id));
S.filteredData = S.filteredData.filter(x => !deletedSet.has(x.id));
// Keep the deleted items in the grid (marked, greyed, buttons hidden)
// until the next scan run — only those the server actually deleted.
const deletedSet = new Set(d.deleted_ids || matches.map(x => x.id));
const _mark = (x) => { if (deletedSet.has(x.id)) x._deleted = true; };
S.flaggedData.forEach(_mark);
S.filteredData.forEach(_mark);
renderGrid(S.filteredData.length ? S.filteredData : S.flaggedData);
updateStats();
prog.innerHTML = `<span style="color:var(--ok,#4c4)">✓ ${d.deleted} ${t('m365_bulk_deleted', 'deleted')}</span>` +
@ -669,6 +966,7 @@ function applyFilters() {
const dispVal = document.getElementById('filterDisposition')?.value || '';
const transferVal = document.getElementById('filterTransfer')?.value || '';
const specialVal = document.getElementById('filterSpecial')?.value || '';
const roleVal = document.getElementById('filterRole')?.value || '';
S.filteredData = S.flaggedData.filter(f => {
if (search && !f.name.toLowerCase().includes(search)) return false;
if (srcVal && f.source_type !== srcVal) return false;
@ -676,6 +974,8 @@ function applyFilters() {
if (transferVal && (f.transfer_risk || '') !== transferVal) return false;
if (specialVal === '1' && !(f.special_category && f.special_category.length)) return false;
if (specialVal === 'photo' && !(f.face_count > 0)) return false;
if (roleVal === 'student' && f.user_role !== 'student') return false;
if (roleVal === 'staff' && f.user_role === 'student') return false;
return true;
});
const grid = document.getElementById('grid');
@ -721,7 +1021,8 @@ async function exportExcel() {
return;
}
// Browser / localhost fallback: fetch as blob and trigger download
const r = await fetch('/api/export_excel');
const _roleParam = document.getElementById('filterRole')?.value || '';
const r = await fetch('/api/export_excel' + (_roleParam ? '?role=' + encodeURIComponent(_roleParam) : ''));
if (!r.ok) {
const err = await r.json().catch(() => ({error: 'Export failed'}));
log('Export error: ' + (err.error || r.status), 'err');
@ -762,7 +1063,8 @@ async function exportArticle30() {
const btn = document.getElementById('exportA30Btn');
if (btn) { btn.disabled = true; btn.textContent = '⏳'; }
try {
const r = await fetch('/api/export_article30');
const _roleParam30 = document.getElementById('filterRole')?.value || '';
const r = await fetch('/api/export_article30' + (_roleParam30 ? '?role=' + encodeURIComponent(_roleParam30) : ''));
if (!r.ok) {
const err = await r.json().catch(() => ({error: 'Export failed'}));
log('Article 30 export error: ' + (err.error || r.status), 'err');
@ -796,6 +1098,8 @@ function clearFilters() {
if (ft) ft.value = '';
const fs = document.getElementById('filterSpecial');
if (fs) fs.value = '';
const fr = document.getElementById('filterRole');
if (fr) fr.value = '';
applyFilters();
}
@ -861,6 +1165,7 @@ window.loadDisposition = loadDisposition;
window.saveDisposition = saveDisposition;
window.closePreview = closePreview;
window.deleteItem = deleteItem;
window.redactItem = redactItem;
window.openBulkDelete = openBulkDelete;
window.closeBulkDelete = closeBulkDelete;
window._bdFilters = _bdFilters;
@ -872,6 +1177,10 @@ window._autoConnectSSEIfRunning = _autoConnectSSEIfRunning;
window._loadViewerResults = _loadViewerResults;
window.executeBulkDelete = executeBulkDelete;
window.applyFilters = applyFilters;
window.toggleSelectMode = toggleSelectMode;
window.toggleCardSelect = toggleCardSelect;
window.selectAllVisible = selectAllVisible;
window.applyBulkDisposition = applyBulkDisposition;
window.exportExcel = exportExcel;
window.exportArticle30 = exportArticle30;
window.clearFilters = clearFilters;

View File

@ -67,7 +67,7 @@ async function doImportDB() {
}
if (mode === 'replace') {
if (!confirm(t('m365_db_import_replace_confirm',
'Replace mode will erase ALL existing scan data and restore from the archive.\n\nMake sure you have a manual backup of ~/.gdpr_scanner.db.\n\nProceed?'))) return;
'Replace mode will erase ALL existing scan data and restore from the archive.\n\nMake sure you have a manual backup of ~/.gdprscanner/scanner.db.\n\nProceed?'))) return;
}
btn.disabled = true;
stat.style.color = 'var(--muted)';
@ -125,6 +125,12 @@ function buildScanPayload() {
max_emails: parseInt(document.getElementById('optMaxEmails').value) || 200,
delta: document.getElementById('optDelta') ? document.getElementById('optDelta').checked : false,
scan_photos: document.getElementById('optScanPhotos') ? document.getElementById('optScanPhotos').checked : false,
skip_gps_images: document.getElementById('optSkipGps') ? document.getElementById('optSkipGps').checked : false,
min_cpr_count: document.getElementById('optMinCpr') ? (parseInt(document.getElementById('optMinCpr').value) || 1) : 1,
ocr_lang: document.getElementById('optOcrLang')?.value || 'dan+eng',
cpr_only: document.getElementById('optCprOnly') ? document.getElementById('optCprOnly').checked : false,
scan_emails: document.getElementById('optScanEmails') ? document.getElementById('optScanEmails').checked : false,
scan_phones: document.getElementById('optScanPhones') ? document.getElementById('optScanPhones').checked : false,
retention_enabled: document.getElementById('optRetention') ? document.getElementById('optRetention').checked : false,
retention_years: parseInt(document.getElementById('optRetentionYears')?.value) || 5,
fiscal_year_end: document.getElementById('optFiscalYearEnd')?.value || '',
@ -132,26 +138,39 @@ function buildScanPayload() {
return { sources, fileSources, allSources, googleSources, user_ids, options };
}
async function checkCheckpoint() {
async function checkCheckpoint(onNoCheckpoint) {
const payload = buildScanPayload();
if (!payload.sources.length && !payload.fileSources.length) return;
if (payload.sources.length && !payload.user_ids.length) return;
const banner = document.getElementById('resumeBanner');
const hasSources = payload.sources.length > 0 || payload.fileSources.length > 0 || payload.googleSources.length > 0;
if (!hasSources) {
if (banner) banner.style.display = 'none';
onNoCheckpoint?.(); return;
}
// M365 sources without users — scan button will handle the alert
if (payload.sources.length && !payload.user_ids.length && !payload.googleSources.length) {
if (banner) banner.style.display = 'none';
onNoCheckpoint?.(); return;
}
// Collect Google user emails for server-side checkpoint key computation
const googleUserEmails = payload.googleSources.length > 0
? (S._allUsers || []).filter(u => u.selected !== false && (u.platform === 'google' || u.platform === 'both')).map(u => u.email || u.id).filter(Boolean)
: [];
try {
const r = await fetch('/api/scan/checkpoint', {
method: 'POST', headers: {'Content-Type':'application/json'},
body: JSON.stringify(payload)
body: JSON.stringify({...payload, googleUserEmails})
});
const d = await r.json();
const banner = document.getElementById('resumeBanner');
if (d.exists) {
const ts = d.started_at ? new Date(d.started_at * 1000).toLocaleString([], {dateStyle:'short', timeStyle:'short'}) : '';
document.getElementById('resumeBannerText').textContent =
t('m365_resume_banner', `Previous scan interrupted (${d.scanned_count} scanned, ${d.flagged_count} found${ts ? ' — ' + ts : ''})`);
banner.style.display = 'flex';
if (banner) banner.style.display = 'flex';
} else {
banner.style.display = 'none';
if (banner) banner.style.display = 'none';
onNoCheckpoint?.();
}
} catch(e) { /* ignore */ }
} catch(e) { onNoCheckpoint?.(); }
}
async function clearCheckpointAndScan() {
@ -169,8 +188,7 @@ async function checkDeltaStatus() {
const row = document.getElementById('deltaStatusRow');
const txt = document.getElementById('deltaStatusText');
if (d.exists) {
const src = d.count === 1 ? '1 source' : `${d.count} sources`;
txt.textContent = t('m365_delta_tokens_saved', `Tokens saved for ${src}`);
txt.textContent = t('m365_delta_tokens_saved', 'Tokens saved for {n} source(s)').replace('{n}', d.count);
row.style.display = 'flex';
row.style.alignItems = 'center';
} else {
@ -318,16 +336,16 @@ function _attachScanListeners(source) {
var fill = document.getElementById('progressFill_' + src);
if (fill) fill.style.width = pct + '%';
document.getElementById('progressFile').textContent = d.file || '';
// Only update stats/ETA from M365 (has meaningful totals and ETA)
if (src === 'm365') {
var statsEl = document.getElementById('progressStats');
if (statsEl && d.total) {
statsEl.textContent = (d.index || 0) + ' / ' + d.total;
}
var etaEl = document.getElementById('progressEta');
if (etaEl && d.eta !== undefined) {
etaEl.textContent = d.eta ? ('ETA ' + d.eta) : '';
}
if (src === 'm365') {
// M365 sends index + total + ETA — show exact counter
if (statsEl && d.total) statsEl.textContent = (d.index || 0) + ' / ' + d.total;
if (etaEl && d.eta !== undefined) etaEl.textContent = d.eta ? ('ETA ' + d.eta) : '';
} else if (!S._m365ScanRunning) {
// Google / file: no total known upfront — show running count once M365 is done
if (statsEl && d.scanned !== undefined) statsEl.textContent = d.scanned + ' scanned';
if (etaEl) etaEl.textContent = '';
}
});
source.addEventListener('scan_file', function(e) {
@ -363,17 +381,24 @@ function _attachScanListeners(source) {
source.addEventListener('scan_done', function(e) {
var d = JSON.parse(e.data);
console.log('[SSE] scan_done:', d);
// Only close SSE if the user started this scan via the Scan button.
// For scheduled scans, keep the SSE connection alive so future
// scheduler events are still received.
if (S._userStartedScan) {
S._userStartedScan = false;
if (S.es) { S.es.close(); S.es = null; }
}
S._srcPct.m365 = 100;
S._m365ScanRunning = false;
_renderProgressSegments();
var _anyRunning = S._googleScanRunning || S._fileScanRunning;
// Clear M365 counter/ETA so Google/file progress can take over the display
if (_anyRunning) {
var _se = document.getElementById('progressStats');
var _ee = document.getElementById('progressEta');
if (_se) _se.textContent = '';
if (_ee) _ee.textContent = '';
}
// Only close SSE once all concurrent scans have finished.
// Closing early would drop google_scan_done / file_scan_done events and
// leave the UI stuck in scanning state.
if (S._userStartedScan && !_anyRunning) {
S._userStartedScan = false;
if (S.es) { S.es.close(); S.es = null; }
}
if (!_anyRunning) setLogLive('');
document.getElementById('scanBtn').disabled = _anyRunning;
document.getElementById('stopBtn').style.display = _anyRunning ? 'inline-block' : 'none';
@ -397,6 +422,7 @@ function _attachScanListeners(source) {
if (d.delta) checkDeltaStatus();
markOverdueCards();
loadTrend();
window.invalidateHistoryCache?.();
});
source.addEventListener('google_scan_done', function(e) {
var d = JSON.parse(e.data);
@ -405,6 +431,10 @@ function _attachScanListeners(source) {
S._googleScanRunning = false;
_renderProgressSegments();
if (!S._m365ScanRunning && !S._fileScanRunning) {
if (S._userStartedScan) {
S._userStartedScan = false;
if (S.es) { S.es.close(); S.es = null; }
}
setLogLive('');
document.getElementById('scanBtn').disabled = false;
document.getElementById('stopBtn').style.display = 'none';
@ -421,6 +451,7 @@ function _attachScanListeners(source) {
log('Google scan complete \u2014 ' + d.flagged_count + ' flagged of ' + d.total_scanned, 'ok');
markOverdueCards();
loadTrend();
window.invalidateHistoryCache?.();
});
source.addEventListener('file_scan_done', function(e) {
var d = JSON.parse(e.data);
@ -429,6 +460,10 @@ function _attachScanListeners(source) {
S._fileScanRunning = false;
_renderProgressSegments();
if (!S._m365ScanRunning && !S._googleScanRunning) {
if (S._userStartedScan) {
S._userStartedScan = false;
if (S.es) { S.es.close(); S.es = null; }
}
setLogLive('');
document.getElementById('scanBtn').disabled = false;
document.getElementById('stopBtn').style.display = 'none';
@ -442,14 +477,21 @@ function _attachScanListeners(source) {
applyFilters();
}
}
log('Bestandsscan fuldført \u2014 ' + d.flagged_count + ' flagget af ' + d.total_scanned, 'ok');
log('Bestandsscan fuldf\u00f8rt \u2014 ' + d.flagged_count + ' flagget af ' + d.total_scanned, 'ok');
markOverdueCards();
loadTrend();
window.invalidateHistoryCache?.();
});
// sse_replay_done marks end of buffer replay — log a note so the user knows
// earlier events above were replayed from an already-running scan
// earlier events above were replayed from an already-running scan.
// Also retry loadHistorySession if it bailed during replay: scan_phase events
// from a completed scan's replay temporarily set running flags to true, causing
// the watchdog's loadHistorySession call to bail before scan_done clears them.
source.addEventListener('sse_replay_done', function() {
log(t('m365_sse_replay_note', 'Live log resumed \u2014 earlier entries replayed from running scan.'));
if (!S._m365ScanRunning && !S._googleScanRunning && !S._fileScanRunning && !S._historyRefScanId) {
window.loadHistorySession?.(null);
}
});
}
@ -510,6 +552,8 @@ function startScan(resume) {
document.getElementById('statsSection').style.display = 'none';
document.getElementById('statsPill').style.display = 'none';
}
// Exit history mode — live SSE takes over
window.exitHistoryMode?.();
document.getElementById('resumeBanner').style.display = 'none';
document.getElementById('logPanel').innerHTML = '<div class="log-line log-live" id="logLive" style="display:none"></div>';
try { sessionStorage.removeItem(_LOG_SESSION_KEY); } catch(e) {}
@ -540,6 +584,22 @@ function startScan(resume) {
S._userStartedScan = true;
_ensureSSE();
// Revert to idle if every scan type that was supposed to start got rejected.
// Called after each 409 so we don't leave the UI stuck in "running" state
// while the previous scan's thread finishes winding down.
function _onScanConflict(label) {
log(label + ' ' + t('scan_already_running_err', 'already running — previous scan still stopping. Please wait and try again.'), 'err');
if (label === 'm365') S._m365ScanRunning = false;
if (label === 'file') S._fileScanRunning = false;
if (label === 'google') S._googleScanRunning = false;
if (!S._m365ScanRunning && !S._googleScanRunning && !S._fileScanRunning) {
document.getElementById('scanBtn').disabled = false;
document.getElementById('stopBtn').style.display = 'none';
if (S.es) { S.es.close(); S.es = null; }
S._userStartedScan = false;
}
}
setTimeout(() => {
// Fire M365 scan if any M365 sources are selected
if (sources.length > 0) {
@ -548,7 +608,7 @@ function startScan(resume) {
body: JSON.stringify({sources, user_ids, options, resume: !!resume,
profile_id: S._activeProfileId || null})
}).then(r => {
if (r.status === 409) { log('Scan already running', 'err'); }
if (r.status === 409) { _onScanConflict('m365'); }
}).catch(e => { log('Scan start failed: ' + e, 'err'); });
}
@ -562,7 +622,17 @@ function startScan(resume) {
if (!source) return;
fetch('/api/file_scan/start', {
method: 'POST', headers: {'Content-Type':'application/json'},
body: JSON.stringify(Object.assign({}, source, {scan_photos: options.scan_photos || false}))
body: JSON.stringify(Object.assign({}, source, {
scan_photos: options.scan_photos || false,
skip_gps_images: options.skip_gps_images || false,
min_cpr_count: options.min_cpr_count || 1,
scan_emails: options.scan_emails || false,
scan_phones: options.scan_phones || false,
cpr_only: options.cpr_only || false,
ocr_lang: options.ocr_lang || 'dan+eng',
}))
}).then(r => {
if (r.status === 409) { _onScanConflict('file'); }
}).catch(e => { log('File scan error: ' + e, 'err'); });
});
@ -585,7 +655,7 @@ function startScan(resume) {
options: options
})
}).then(r => {
if (r.status === 409) { log('Google scan already running', 'err'); }
if (r.status === 409) { _onScanConflict('google'); }
}).catch(e => { log('Google scan error: ' + e, 'err'); });
}

View File

@ -18,19 +18,19 @@ function schedLoad() {
var descEl = document.getElementById('schedDesc_' + js.id);
if (!descEl) return;
var j2 = _schedJobs.find(function(x){ return x.id === js.id; });
var freqLabel = !j2 ? '' : (j2.frequency === 'weekly' ? 'Weekly' : j2.frequency === 'monthly' ? 'Monthly' : 'Daily');
var freqLabel = !j2 ? '' : (j2.frequency === 'weekly' ? t('m365_sched_freq_weekly','Weekly') : j2.frequency === 'monthly' ? t('m365_sched_freq_monthly','Monthly') : t('m365_sched_freq_daily','Daily'));
var timeStr = !j2 ? '' : String(j2.hour||0).padStart(2,'0') + ':' + String(j2.minute||0).padStart(2,'0');
var base = freqLabel + ' ' + timeStr;
var runBtn = document.getElementById('schedRunBtn_' + js.id);
if (js.is_running) {
descEl.textContent = base + ' \u00b7 Running...';
descEl.textContent = base + ' \u00b7 ' + t('m365_sched_running','Running...');
if (runBtn) { runBtn.style.borderColor='#22c55e'; runBtn.style.color='#22c55e'; }
} else if (js.next_run) {
var dt = new Date(js.next_run);
descEl.textContent = base + ' \u00b7 Next: ' + dt.toLocaleString(undefined,{month:'short',day:'numeric',hour:'2-digit',minute:'2-digit'});
descEl.textContent = base + ' \u00b7 ' + t('m365_sched_next','Next') + ': ' + dt.toLocaleString(undefined,{month:'short',day:'numeric',hour:'2-digit',minute:'2-digit'});
if (runBtn) { runBtn.style.borderColor='var(--border)'; runBtn.style.color='var(--muted)'; }
} else {
descEl.textContent = base + (js.enabled ? '' : ' \u00b7 Disabled');
descEl.textContent = base + (js.enabled ? '' : ' \u00b7 ' + t('m365_sched_disabled','Disabled'));
if (runBtn) { runBtn.style.borderColor='var(--border)'; runBtn.style.color='var(--muted)'; }
}
});
@ -41,20 +41,23 @@ function schedRenderJobs() {
var list = document.getElementById('schedJobList');
if (!list) return;
if (!_schedJobs.length) {
list.innerHTML = '<div style="font-size:11px;color:var(--muted);padding:4px 0">No scheduled scans yet.</div>';
list.innerHTML = '<div style="font-size:11px;color:var(--muted);padding:4px 0">' + t('m365_sched_no_jobs','No scheduled scans yet.') + '</div>';
return;
}
list.innerHTML = _schedJobs.map(function(j) {
var sid = _esc(j.id);
var sname = _esc(j.name || 'Unnamed');
var freqLabel = j.frequency === 'weekly' ? 'Weekly' : j.frequency === 'monthly' ? 'Monthly' : 'Daily';
var freqLabel = j.frequency === 'weekly' ? t('m365_sched_freq_weekly','Weekly') : j.frequency === 'monthly' ? t('m365_sched_freq_monthly','Monthly') : t('m365_sched_freq_daily','Daily');
var timeStr = String(j.hour||0).padStart(2,'0') + ':' + String(j.minute||0).padStart(2,'0');
var desc = freqLabel + ' ' + timeStr;
var chk = j.enabled ? ' checked' : '';
var roBadge = j.report_only
? '<span style="font-size:9px;padding:1px 5px;border-radius:10px;background:#E8F4FD;color:#2980B9;border:1px solid #AED6F1;margin-left:4px">' + t('m365_sched_report_only','Report only') + '</span>'
: '';
return '<div style="display:flex;align-items:center;gap:6px;padding:5px 6px;border:1px solid var(--border);border-radius:6px;background:var(--surface)">'
+ '<label class="toggle" style="flex:unset;margin:0"><input type="checkbox"'+chk+' onchange="schedToggleEnabled(\''+sid+'\',this.checked)"><span class="toggle-slider"></span></label>'
+ '<div style="flex:1;min-width:0">'
+ '<div style="font-size:12px;font-weight:600;white-space:nowrap;overflow:hidden;text-overflow:ellipsis">'+sname+'</div>'
+ '<div style="font-size:12px;font-weight:600;white-space:nowrap;overflow:hidden;text-overflow:ellipsis">'+sname+roBadge+'</div>'
+ '<div id="schedDesc_'+sid+'" style="font-size:10px;color:var(--muted)">'+desc+'</div>'
+ '</div>'
+ '<button onclick="schedRunJob(\''+sid+'\')" id="schedRunBtn_'+sid+'" style="background:none;border:1px solid var(--border);color:var(--muted);padding:2px 7px;border-radius:4px;font-size:10px;cursor:pointer" title="Run now">&#9654;</button>'
@ -89,6 +92,8 @@ function schedAddJob() {
document.getElementById('schedMinute').value = 0;
document.getElementById('schedAutoEmail').checked = false;
document.getElementById('schedAutoRetention').checked = false;
document.getElementById('schedReportOnly').checked = false;
schedToggleReportOnly();
var titleEl = document.getElementById('schedEditorTitle');
if (titleEl) titleEl.textContent = t('m365_sched_editor_new', 'New scheduled scan');
schedPopulateProfiles('');
@ -111,6 +116,8 @@ function schedEditJob(id) {
document.getElementById('schedMinute').value = j.minute != null ? j.minute : 0;
document.getElementById('schedAutoEmail').checked = !!j.auto_email;
document.getElementById('schedAutoRetention').checked = !!j.auto_retention;
document.getElementById('schedReportOnly').checked = !!j.report_only;
schedToggleReportOnly();
var titleEl = document.getElementById('schedEditorTitle');
if (titleEl) titleEl.textContent = t('m365_sched_editor_edit', 'Edit scheduled scan');
schedPopulateProfiles(j.profile_id || '');
@ -123,6 +130,19 @@ function schedCancelEdit() {
document.getElementById('schedJobEditor').style.display = 'none';
}
function schedToggleReportOnly() {
var ro = !!(document.getElementById('schedReportOnly') || {}).checked;
var profileRow = document.getElementById('schedProfileRow');
var hint = document.getElementById('schedReportOnlyHint');
if (profileRow) profileRow.style.opacity = ro ? '0.4' : '';
if (hint) hint.style.display = ro ? 'block' : 'none';
// Enforce auto_email when switching to report-only
if (ro) {
var ae = document.getElementById('schedAutoEmail');
if (ae) ae.checked = true;
}
}
function schedSaveJob() {
var name = document.getElementById('schedName').value.trim();
if (!name) {
@ -144,6 +164,7 @@ function schedSaveJob() {
profile_id: document.getElementById('schedProfile').value,
auto_email: document.getElementById('schedAutoEmail').checked,
auto_retention: document.getElementById('schedAutoRetention').checked,
report_only: document.getElementById('schedReportOnly').checked,
};
var st = document.getElementById('schedSaveStatus');
st.style.color = 'var(--muted)'; st.textContent = 'Saving...';
@ -217,7 +238,7 @@ function schedLoadHistory() {
if (!el) return;
fetch('/api/scheduler/history?limit=10').then(function(r){ return r.json(); }).then(function(d) {
var runs = d.runs || [];
if (!runs.length) { el.innerHTML = '<em>No scheduled runs yet</em>'; return; }
if (!runs.length) { el.innerHTML = '<em>' + t('m365_sched_no_runs','No scheduled runs yet') + '</em>'; return; }
var html = '';
runs.forEach(function(r) {
var ts = r.started_at ? new Date(r.started_at * 1000).toLocaleString() : '-';
@ -293,13 +314,17 @@ function stLoadSmtp() {
const set = function(id, val) { const el=document.getElementById(id); if(el) el.value=val||''; };
set('st-smtpHost', d.host);
set('st-smtpPort', d.port || 587);
set('st-smtpUser', d.user);
set('st-smtpUser', d.username);
set('st-smtpFrom', d.from_addr);
set('st-smtpTo', Array.isArray(d.recipients) ? d.recipients.join(', ') : (d.recipients||''));
const tls = document.getElementById('st-smtpTls');
if (tls) tls.checked = d.starttls !== false;
if (tls) tls.checked = d.use_tls !== false;
const pw = document.getElementById('st-smtpPw');
if (pw) pw.value = d.has_password ? '\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022' : '';
const ae = document.getElementById('st-smtpAutoEmail');
if (ae) ae.checked = !!d.auto_email_manual;
const ps = document.getElementById('st-smtpPreferSmtp');
if (ps) ps.checked = !!d.prefer_smtp;
}).catch(function(){});
}
@ -310,10 +335,15 @@ async function stSmtpSave() {
const body = {
host: document.getElementById('st-smtpHost').value.trim(),
port: parseInt(document.getElementById('st-smtpPort').value) || 587,
user: document.getElementById('st-smtpUser').value.trim(),
// Backend (routes/email.py) reads these exact keys — `username`/`use_tls`,
// not `user`/`starttls`. Sending the wrong keys leaves username empty so
// server.login() is skipped and the SMTP server rejects the send.
username: document.getElementById('st-smtpUser').value.trim(),
from_addr: document.getElementById('st-smtpFrom').value.trim(),
recipients: document.getElementById('st-smtpTo').value.split(/[,;]/).map(function(s){return s.trim();}).filter(Boolean),
starttls: document.getElementById('st-smtpTls').checked,
use_tls: document.getElementById('st-smtpTls').checked,
auto_email_manual: !!(document.getElementById('st-smtpAutoEmail') || {}).checked,
prefer_smtp: !!(document.getElementById('st-smtpPreferSmtp') || {}).checked,
};
if (pw !== null) body.password = pw;
st.style.color = 'var(--muted)'; st.textContent = t('m365_smtp_saving','Saving...');
@ -334,7 +364,16 @@ async function stSmtpTest() {
body:JSON.stringify({})});
const d = await r.json();
if (d.ok) {
if (st) { st.style.color='var(--accent)'; st.textContent='\u2714 ' + (d.message || t('m365_smtp_test_ok','Connection successful')); }
let msg;
if (d.method === 'graph') {
msg = t('m365_smtp_test_ok_graph','Test email sent via Microsoft Graph to') + ' ' + (d.recipients||[]).join(', ');
} else if (d.method === 'smtp') {
msg = t('m365_smtp_test_ok_smtp','Test email sent via SMTP to') + ' ' + (d.recipients||[]).join(', ');
if (d.graph_also_failed) msg += ' ' + t('m365_smtp_graph_also_failed','(⚠ Graph also failed — Mail.Send not granted)');
} else {
msg = d.message || t('m365_smtp_test_ok','Test email sent');
}
if (st) { st.style.color='var(--accent)'; st.textContent='\u2714 ' + msg; }
} else {
if (st) { st.style.color='var(--danger)'; st.textContent='\u2717 ' + (d.error || t('m365_smtp_test_fail','Connection failed')); }
}
@ -425,6 +464,7 @@ window.schedSaveJob = schedSaveJob;
window.schedDeleteJob = schedDeleteJob;
window.schedRunJob = schedRunJob;
window.schedToggleFreqRows = schedToggleFreqRows;
window.schedToggleReportOnly = schedToggleReportOnly;
window.schedPopulateProfiles = schedPopulateProfiles;
window.schedLoadHistory = schedLoadHistory;
window.schedUpdateSidebarIndicator = schedUpdateSidebarIndicator;

View File

@ -62,13 +62,14 @@ function renderSourcesPanel() {
S._pendingGoogleSources = null;
}
// File sources (local / SMB) — one entry per saved source
// File sources (local / SMB / SFTP) — one entry per saved source
if (S._fileSources.length > 0) {
html += '<div style="margin:6px 0 2px;font-size:10px;color:var(--muted);text-transform:uppercase;letter-spacing:.04em">'
+ '<hr style="border:none;border-top:1px solid var(--border);margin:1px 0 2px">';
S._fileSources.forEach(function(s) {
const isSmb = s.path && (s.path.startsWith('//') || s.path.startsWith('\\\\'));
const icon = isSmb ? '\uD83C\uDF10' : '\uD83D\uDCC1';
const isSftp = s.source_type === 'sftp';
const isSmb = !isSftp && s.path && (s.path.startsWith('//') || s.path.startsWith('\\\\'));
const icon = isSftp ? '\uD83D\uDD12' : (isSmb ? '\uD83C\uDF10' : '\uD83D\uDCC1');
const label = s.label || s.path || s.id;
const isChecked = (s.id in checked) ? checked[s.id] : true;
html += '<label class="source-check">'
@ -236,17 +237,209 @@ function closeSettings() {
}
function switchSettingsTab(tab) {
['general','security','scheduler','email','database'].forEach(function(t) {
['general','security','scheduler','email','database','auditlog','ai'].forEach(function(t) {
var cap = t.charAt(0).toUpperCase() + t.slice(1);
var pane = document.getElementById('stPane' + cap);
var btn = document.getElementById('stTab' + cap);
if (pane) pane.classList.toggle('active', t === tab);
if (btn) btn.classList.toggle('active', t === tab);
});
if (tab === 'security') { stLoadPinStatus(); if (typeof stLoadViewerPinStatus === 'function') stLoadViewerPinStatus(); }
if (tab === 'general') stLoadUpdateSettings();
if (tab === 'security') { stLoadPinStatus(); if (typeof stLoadViewerPinStatus === 'function') stLoadViewerPinStatus(); if (typeof stLoadInterfacePinStatus === 'function') stLoadInterfacePinStatus(); }
if (tab === 'email') stLoadSmtp();
if (tab === 'database') stLoadDbStats();
if (tab === 'scheduler') schedLoad();
if (tab === 'auditlog') stLoadAuditLog();
if (tab === 'ai') stLoadAiSettings();
}
async function stLoadAuditLog() {
const tbody = document.getElementById('stAuditTableBody');
if (!tbody) return;
tbody.innerHTML = `<tr><td colspan="4" style="padding:8px;color:var(--muted)">${t('m365_audit_loading')}</td></tr>`;
try {
const rows = await fetch('/api/audit_log?limit=200').then(r => r.json());
if (!Array.isArray(rows) || !rows.length) {
tbody.innerHTML = `<tr><td colspan="4" style="padding:8px;color:var(--muted)">${t('m365_audit_empty')}</td></tr>`;
return;
}
tbody.innerHTML = rows.map(function(r) {
const d = new Date(r.ts * 1000);
const ts = d.toLocaleDateString() + ' ' + d.toLocaleTimeString();
return '<tr style="border-bottom:1px solid var(--border)">'
+ '<td style="padding:4px 8px;white-space:nowrap;color:var(--muted);font-size:11px">' + window._escHtml(ts) + '</td>'
+ '<td style="padding:4px 8px"><span style="font-family:monospace;background:var(--bg);border:1px solid var(--border);border-radius:3px;padding:1px 4px;font-size:11px">' + window._escHtml(r.action) + '</span></td>'
+ '<td style="padding:4px 8px;color:var(--text);font-size:12px">' + window._escHtml(r.detail) + '</td>'
+ '<td style="padding:4px 8px;color:var(--muted);font-size:11px">' + window._escHtml(r.ip) + '</td>'
+ '</tr>';
}).join('');
} catch(e) {
tbody.innerHTML = '<tr><td colspan="4" style="padding:8px;color:var(--danger)">' + window._escHtml(String(e)) + '</td></tr>';
}
}
// ── AI / Claude NER settings ─────────────────────────────────────────────────
async function stLoadAiSettings() {
try {
const cfg = await fetch('/api/settings/claude').then(r => r.json());
const cb = document.getElementById('aiEnabled');
if (cb) cb.checked = !!cfg.enabled;
const ks = document.getElementById('aiKeyStatus');
if (ks) ks.textContent = cfg.api_key_set
? t('m365_ai_key_set', 'API key saved')
: t('m365_ai_key_not_set', 'No API key saved');
} catch(e) { /* ignore */ }
}
async function stAiSave() {
const enabled = !!(document.getElementById('aiEnabled') || {}).checked;
const keyVal = (document.getElementById('aiApiKey') || {}).value || '';
const status = document.getElementById('aiStatus');
const payload = { enabled };
if (keyVal) payload.api_key = keyVal;
try {
await fetch('/api/settings/claude', {
method: 'POST',
headers: {'Content-Type': 'application/json'},
body: JSON.stringify(payload),
});
if (status) { status.textContent = t('m365_ai_saved', 'Saved'); status.style.color = 'var(--success)'; }
if (keyVal) {
const inp = document.getElementById('aiApiKey');
if (inp) inp.value = '';
const ks = document.getElementById('aiKeyStatus');
if (ks) ks.textContent = t('m365_ai_key_set', 'API key saved');
}
setTimeout(function() { if (status) status.textContent = ''; }, 2000);
} catch(e) {
if (status) { status.textContent = String(e); status.style.color = 'var(--danger)'; }
}
}
async function stAiTest() {
const status = document.getElementById('aiStatus');
if (status) { status.textContent = t('m365_ai_testing', 'Testing…'); status.style.color = 'var(--muted)'; }
try {
const res = await fetch('/api/settings/claude/test', { method: 'POST' }).then(r => r.json());
if (status) {
status.textContent = res.ok
? t('m365_ai_test_ok', 'API key valid')
: (t('m365_ai_test_fail', 'Test failed') + ': ' + (res.error || ''));
status.style.color = res.ok ? 'var(--success)' : 'var(--danger)';
}
} catch(e) {
if (status) { status.textContent = String(e); status.style.color = 'var(--danger)'; }
}
}
// ── Software updates ─────────────────────────────────────────────────────────
async function stLoadUpdateSettings() {
try {
const cfg = await fetch('/api/update/settings').then(r => r.json());
const grp = document.getElementById('stUpdateGroup');
if (grp) grp.style.display = cfg.supported ? '' : 'none';
const cb = document.getElementById('stAutoUpdate');
if (cb) cb.checked = !!cfg.auto_update;
} catch(e) { /* ignore */ }
}
async function stSaveAutoUpdate() {
const cb = document.getElementById('stAutoUpdate');
try {
await fetch('/api/update/settings', {
method: 'POST',
headers: {'Content-Type': 'application/json'},
body: JSON.stringify({ auto_update: !!(cb && cb.checked) }),
});
} catch(e) { /* ignore */ }
}
async function stCheckUpdate() {
const status = document.getElementById('stUpdateStatus');
const commits = document.getElementById('stUpdateCommits');
const applyBtn = document.getElementById('stApplyUpdateBtn');
if (status) { status.textContent = t('m365_update_checking', 'Checking…'); status.style.color = 'var(--muted)'; }
if (commits) commits.style.display = 'none';
if (applyBtn) applyBtn.style.display = 'none';
try {
const res = await fetch('/api/update/check').then(r => r.json());
if (!status) return;
if (res.error) {
status.textContent = t('m365_update_failed', 'Update check failed') + ': ' + res.error;
status.style.color = 'var(--danger)';
} else if (res.up_to_date) {
status.textContent = t('m365_update_uptodate', 'You are running the latest version.') + ' (' + res.current + ')';
status.style.color = 'var(--success)';
} else {
status.textContent = t('m365_update_available', 'Update available') + ': ' + res.current + ' → ' + res.latest;
status.style.color = 'var(--accent)';
if (commits && res.commits && res.commits.length) {
commits.innerHTML = res.commits.map(function(c) { return window._escHtml(c); }).join('<br>');
commits.style.display = '';
}
if (applyBtn) applyBtn.style.display = '';
}
} catch(e) {
if (status) { status.textContent = String(e); status.style.color = 'var(--danger)'; }
}
}
async function stApplyUpdate() {
const status = document.getElementById('stUpdateStatus');
const applyBtn = document.getElementById('stApplyUpdateBtn');
const checkBtn = document.getElementById('stCheckUpdateBtn');
if (applyBtn) applyBtn.disabled = true;
if (checkBtn) checkBtn.disabled = true;
if (status) { status.textContent = t('m365_update_installing', 'Installing update — the app will restart…'); status.style.color = 'var(--muted)'; }
try {
const res = await fetch('/api/update/apply', { method: 'POST' }).then(r => r.json());
if (!res.ok) {
const msg = res.code === 'scan_running'
? t('m365_update_scan_running', 'Cannot update while a scan is running.')
: (res.error || 'Update failed');
if (status) { status.textContent = msg; status.style.color = 'var(--danger)'; }
if (applyBtn) applyBtn.disabled = false;
if (checkBtn) checkBtn.disabled = false;
return;
}
if (!res.updated) { // already up to date
if (status) { status.textContent = t('m365_update_uptodate', 'You are running the latest version.'); status.style.color = 'var(--success)'; }
if (applyBtn) { applyBtn.disabled = false; applyBtn.style.display = 'none'; }
if (checkBtn) checkBtn.disabled = false;
return;
}
_stWaitForRestart();
} catch(e) {
if (status) { status.textContent = String(e); status.style.color = 'var(--danger)'; }
if (applyBtn) applyBtn.disabled = false;
if (checkBtn) checkBtn.disabled = false;
}
}
// Poll until the server has gone down and come back, then reload the page.
function _stWaitForRestart() {
let tries = 0, sawDown = false;
const iv = setInterval(async function() {
tries++;
try {
await fetch('/api/about', { cache: 'no-store' }).then(r => { if (!r.ok) throw new Error(); });
if (sawDown || tries >= 5) { clearInterval(iv); location.reload(); }
} catch(e) {
sawDown = true;
}
if (tries > 90) clearInterval(iv); // give up after ~3 minutes
}, 2000);
}
function stAiToggleKey() {
const inp = document.getElementById('aiApiKey');
const btn = document.getElementById('aiShowKeyBtn');
if (!inp) return;
const show = inp.type === 'password';
inp.type = show ? 'text' : 'password';
if (btn) btn.textContent = show ? t('m365_ai_hide_key', 'Hide') : t('m365_ai_show_key', 'Show');
}
// ── Window exports (HTML handlers + cross-module calls) ─────────────────────
@ -265,5 +458,14 @@ window.confirmPinPrompt = confirmPinPrompt;
window.openSettings = openSettings;
window.closeSettings = closeSettings;
window.switchSettingsTab = switchSettingsTab;
window.stLoadAuditLog = stLoadAuditLog;
window.stLoadAiSettings = stLoadAiSettings;
window.stAiSave = stAiSave;
window.stAiTest = stAiTest;
window.stAiToggleKey = stAiToggleKey;
window.stLoadUpdateSettings = stLoadUpdateSettings;
window.stSaveAutoUpdate = stSaveAutoUpdate;
window.stCheckUpdate = stCheckUpdate;
window.stApplyUpdate = stApplyUpdate;
window._M365_SOURCES = _M365_SOURCES;
window._pinCallback = _pinCallback;

View File

@ -28,4 +28,9 @@ export const S = {
_pendingGoogleSources: null,
// Sources
_fileSources: [],
// History browser
_historyRefScanId: null, // null = live/SSE, number = viewing a past session
// Bulk disposition
_selectMode: false,
_selectedIds: new Set(),
};

View File

@ -28,6 +28,11 @@ async function loadUsers() {
u.selected = prevSelected.has(u.id) ? prevSelected.get(u.id) : false;
});
S._allUsers = [...fetched, ...toAdd];
// Apply deferred "select all" from a profile chosen before users loaded
if (window._pendingProfileAllUsers) {
S._allUsers.forEach(u => { u.selected = true; });
window._pendingProfileAllUsers = false;
}
renderAccountList(fetched.length <= 1);
// Merge Google users separately so they're not blocked by M365 auth timing
_mergeGoogleUsers();
@ -171,7 +176,7 @@ async function loadLastScanSummary() {
try {
const r = await fetch('/api/db/stats');
const d = await r.json();
if (!d.scan_id || S.flaggedData.length > 0) return;
if (!d.scan_id || S.flaggedData.length > 0 || S._m365ScanRunning || S._googleScanRunning || S._fileScanRunning) return;
const panel = document.getElementById('lastScanSummary');
const empty = document.getElementById('emptyState');
if (!panel || !empty) return;

View File

@ -1,25 +1,160 @@
// ── Viewer token management (#33) ─────────────────────────────────────────────
// Share button → modal to create, copy, and revoke read-only viewer links.
import { S } from './state.js';
let _shareBaseUrl = null; // cached so Copy buttons can build the URL synchronously
async function _getShareBaseUrl() {
// Use the machine's LAN IP so links work for remote users, not just localhost.
if (_shareBaseUrl) return _shareBaseUrl;
// The LAN-IP probe exists only to fix links when the operator browses the
// app at localhost — those would be unusable for remote users. Any other
// origin (LAN IP, or a reverse-proxied HTTPS hostname) is already routable,
// and rewriting it to http://<LAN-IP> would bypass the proxy's TLS.
const host = window.location.hostname;
if (window.location.protocol === 'https:' ||
(host !== 'localhost' && host !== '127.0.0.1' && host !== '[::1]')) {
_shareBaseUrl = window.location.origin;
return _shareBaseUrl;
}
try {
const r = await fetch('/api/local_ip');
if (r.ok) {
const d = await r.json();
if (d.ip && d.ip !== '127.0.0.1') {
return 'http://' + d.ip + ':' + window.location.port;
_shareBaseUrl = 'http://' + d.ip + ':' + window.location.port;
return _shareBaseUrl;
}
}
} catch(e) {}
return window.location.origin;
_shareBaseUrl = window.location.origin;
return _shareBaseUrl;
}
// ── User autocomplete for Share modal ────────────────────────────────────────
// Holds the resolved user when one is picked from the dropdown.
// Cleared on modal reset or when the input is edited manually.
let _selectedScopeUser = null; // { emails: string[], display_name: string }
let _userAcInit = false;
function _initUserAutocomplete() {
if (_userAcInit) return;
_userAcInit = true;
const input = document.getElementById('shareScopeUser');
const drop = document.getElementById('shareScopeUserDropdown');
if (!input || !drop) return;
input.addEventListener('input', () => {
_selectedScopeUser = null; // user edited manually — discard dropdown selection
_renderUserDropdown(input.value);
});
input.addEventListener('focus', () => _renderUserDropdown(input.value));
input.addEventListener('keydown', e => {
if (e.key === 'Escape') { drop.style.display = 'none'; }
if (e.key === 'ArrowDown') { e.preventDefault(); drop.querySelector('[data-uid]')?.focus(); }
});
drop.addEventListener('keydown', e => {
if (e.key === 'Escape') { drop.style.display = 'none'; input.focus(); }
if (e.key === 'ArrowDown') { e.preventDefault(); document.activeElement?.nextElementSibling?.focus(); }
if (e.key === 'ArrowUp') {
e.preventDefault();
const prev = document.activeElement?.previousElementSibling;
prev ? prev.focus() : input.focus();
}
if (e.key === 'Enter') {
const el = document.activeElement;
if (el?.dataset?.uid) _selectUser(parseInt(el.dataset.uid, 10));
}
});
document.addEventListener('click', e => {
if (!document.getElementById('shareScopeUserWrap')?.contains(e.target))
drop.style.display = 'none';
}, true);
}
function _renderUserDropdown(query) {
const drop = document.getElementById('shareScopeUserDropdown');
if (!drop) return;
const users = S._allUsers;
if (!users.length) { drop.style.display = 'none'; return; }
const q = (query || '').trim().toLowerCase();
const matches = (q
? users.filter(u =>
(u.displayName || '').toLowerCase().includes(q) ||
(u.email || '').toLowerCase().includes(q) ||
(u.googleEmail || '').toLowerCase().includes(q))
: users
).slice(0, 8);
if (!matches.length) { drop.style.display = 'none'; return; }
drop.innerHTML = '';
matches.forEach((u, i) => {
const emails = [u.email, u.googleEmail].filter(Boolean);
const emailLbl = emails.join(', ');
const roleLbl = u.userRole === 'staff' ? t('share_scope_staff', 'Staff')
: u.userRole === 'student' ? t('share_scope_student', 'Students')
: '';
const row = document.createElement('div');
row.tabIndex = 0;
row.dataset.uid = i; // index into matches; resolved in _selectUser
row.style.cssText = 'display:flex;align-items:center;gap:8px;padding:6px 10px;cursor:pointer;font-size:12px'
+ (i < matches.length - 1 ? ';border-bottom:1px solid var(--border)' : '');
row.innerHTML =
'<div style="flex:1;min-width:0">' +
'<div style="font-weight:500;color:var(--text);overflow:hidden;text-overflow:ellipsis;white-space:nowrap">' +
(u.displayName || emails[0] || '') +
(roleLbl ? ' <span style="font-size:9px;padding:1px 5px;border-radius:10px;background:var(--accent);color:#fff;font-weight:600">' + roleLbl + '</span>' : '') +
'</div>' +
'<div style="font-size:10px;color:var(--muted);overflow:hidden;text-overflow:ellipsis;white-space:nowrap">' + emailLbl + '</div>' +
'</div>';
row.addEventListener('mouseenter', () => row.style.background = 'var(--surface)');
row.addEventListener('mouseleave', () => row.style.background = '');
row.addEventListener('focus', () => row.style.background = 'var(--surface)');
row.addEventListener('blur', () => row.style.background = '');
row.addEventListener('mousedown', e => {
e.preventDefault();
_selectUser(u);
});
drop.appendChild(row);
});
drop.style.display = '';
}
function _selectUser(u) {
const input = document.getElementById('shareScopeUser');
const drop = document.getElementById('shareScopeUserDropdown');
const emails = [u.email, u.googleEmail].filter(Boolean);
_selectedScopeUser = {
emails: emails,
display_name: u.displayName || emails[0] || '',
};
if (input) input.value = u.displayName || emails[0] || '';
if (drop) drop.style.display = 'none';
}
function _shareScopeTypeChanged() {
const type = document.getElementById('shareScopeType')?.value || '';
document.getElementById('shareScopeRoleWrap').style.display = type === 'role' ? '' : 'none';
document.getElementById('shareScopeUserWrap').style.display = type === 'user' ? '' : 'none';
if (type === 'user') _initUserAutocomplete();
}
function _resetShareForm() {
document.getElementById('shareLabel').value = '';
document.getElementById('shareExpiry').value = '30';
const scopeType = document.getElementById('shareScopeType');
if (scopeType) { scopeType.value = ''; _shareScopeTypeChanged(); }
_selectedScopeUser = null;
const scopeUser = document.getElementById('shareScopeUser');
if (scopeUser) scopeUser.value = '';
const scopeDrop = document.getElementById('shareScopeUserDropdown');
if (scopeDrop) scopeDrop.style.display = 'none';
const vf = document.getElementById('shareValidFrom'); if (vf) vf.value = '';
const vt = document.getElementById('shareValidTo'); if (vt) vt.value = '';
}
function openShareModal() {
document.getElementById('shareBackdrop').classList.add('open');
document.getElementById('shareNewLinkRow').style.display = 'none';
document.getElementById('shareLabel').value = '';
document.getElementById('shareExpiry').value = '30';
_resetShareForm();
_renderTokenList();
fetch('/api/viewer/pin').then(function(r){ return r.json(); }).then(function(d) {
const el = document.getElementById('sharePinStatus');
@ -31,7 +166,7 @@ function closeShareModal() {
document.getElementById('shareBackdrop').classList.remove('open');
}
async function _renderTokenList() {
async function _renderTokenList(highlightToken) {
const list = document.getElementById('shareTokenList');
list.innerHTML = '<div style="font-size:12px;color:var(--muted);padding:4px 0">' + t('lbl_loading', 'Loading…') + '</div>';
try {
@ -51,10 +186,31 @@ async function _renderTokenList() {
: '—';
const row = document.createElement('div');
row.style.cssText = 'display:flex;align-items:center;gap:8px;padding:6px 10px;background:var(--bg);border:1px solid var(--border);border-radius:6px;font-size:12px';
const roleVal = tok.scope?.role || '';
const roleLbl = roleVal === 'student' ? t('share_scope_student', 'Students')
: roleVal === 'staff' ? t('share_scope_staff', 'Staff')
: '';
const roleBadge = roleLbl
? '<span style="font-size:9px;padding:1px 5px;border-radius:10px;background:var(--accent);color:#fff;margin-left:5px;font-weight:600;vertical-align:middle">' + roleLbl + '</span>'
: '';
const userScope = tok.scope?.user;
const userLbl = tok.scope?.display_name
|| (Array.isArray(userScope) ? userScope.join(', ') : (userScope || ''));
const userBadge = userLbl
? '<span style="font-size:9px;padding:1px 5px;border-radius:10px;background:var(--muted);color:#fff;margin-left:5px;font-weight:600;vertical-align:middle;max-width:140px;overflow:hidden;text-overflow:ellipsis;white-space:nowrap;display:inline-block">' + userLbl + '</span>'
: '';
const dateFrom = tok.scope?.valid_from || '';
const dateTo = tok.scope?.valid_to || '';
const dateBadge = (dateFrom || dateTo)
? '<span style="font-size:9px;padding:1px 5px;border-radius:10px;background:rgba(80,160,80,.25);color:var(--text);margin-left:5px;font-weight:600;vertical-align:middle">' +
(dateFrom || '…') + ' ' + (dateTo || '…') +
'</span>'
: '';
row.innerHTML =
'<div style="flex:1;min-width:0">' +
'<div style="font-weight:500;color:var(--text);overflow:hidden;text-overflow:ellipsis;white-space:nowrap">' +
(tok.label || '<span style="color:var(--muted);font-style:italic">' + t('share_unlabelled', 'Unlabelled') + '</span>') +
roleBadge + userBadge + dateBadge +
'</div>' +
'<div style="font-size:10px;color:var(--muted);margin-top:1px">' +
t('share_expires_prefix', 'Expires:') + ' ' + expires + ' &nbsp;·&nbsp; ' + t('share_last_used', 'Last used:') + ' ' + lastUsed +
@ -65,6 +221,17 @@ async function _renderTokenList() {
'<button title="' + t('share_revoke', 'Revoke') + '" onclick="revokeToken(\'' + tok.token + '\',this.closest(\'div[style]\'))" ' +
'style="height:24px;padding:0 8px;background:none;border:1px solid var(--danger);color:var(--danger);border-radius:4px;font-size:11px;cursor:pointer;flex-shrink:0">' + t('share_revoke', 'Revoke') + '</button>';
list.appendChild(row);
// Briefly highlight a freshly created link so it is easy to find and copy.
if (highlightToken && tok.token === highlightToken) {
row.style.transition = 'border-color .3s, background .3s';
row.style.borderColor = 'var(--accent)';
row.style.background = 'rgba(80,160,80,.18)';
setTimeout(function() { row.scrollIntoView({block: 'nearest'}); }, 0);
setTimeout(function() {
row.style.borderColor = 'var(--border)';
row.style.background = 'var(--bg)';
}, 2500);
}
});
} catch(e) {
list.innerHTML = '<div style="font-size:12px;color:var(--danger);padding:4px 0">' + t('share_load_error', 'Failed to load links.') + '</div>';
@ -74,8 +241,32 @@ async function _renderTokenList() {
async function createShareLink() {
const label = document.getElementById('shareLabel').value.trim();
const expiry = document.getElementById('shareExpiry').value;
const scopeType = document.getElementById('shareScopeType')?.value || '';
const validFrom = document.getElementById('shareValidFrom')?.value || '';
const validTo = document.getElementById('shareValidTo')?.value || '';
const body = {label};
if (expiry) body.expires_days = parseInt(expiry);
if (scopeType === 'role') {
const role = document.getElementById('shareScope')?.value || '';
if (role) body.scope = {role};
} else if (scopeType === 'user') {
if (_selectedScopeUser) {
body.scope = { user: _selectedScopeUser.emails, display_name: _selectedScopeUser.display_name };
} else {
// Manual entry fallback — treat raw input as a single email
const email = (document.getElementById('shareScopeUser')?.value || '').trim().toLowerCase();
if (!email || !email.includes('@')) {
alert(t('share_scope_user_invalid', 'Please enter a valid email address for the user scope.'));
return;
}
body.scope = { user: [email], display_name: email };
}
}
if (validFrom || validTo) {
if (!body.scope) body.scope = {};
if (validFrom) body.scope.valid_from = validFrom;
if (validTo) body.scope.valid_to = validTo;
}
try {
const r = await fetch('/api/viewer/tokens', {
method: 'POST', headers: {'Content-Type':'application/json'},
@ -83,48 +274,51 @@ async function createShareLink() {
});
if (!r.ok) throw new Error('Server error ' + r.status);
const entry = await r.json();
const url = (await _getShareBaseUrl()) + '/view?token=' + encodeURIComponent(entry.token);
const urlInput = document.getElementById('shareNewLinkUrl');
urlInput.value = url;
document.getElementById('shareNewLinkRow').style.display = 'block';
document.getElementById('shareCopyBtn').textContent = t('log_copy', 'Copy');
document.getElementById('shareLabel').value = '';
_renderTokenList();
// The new link appears in the active-links list below (each row has its
// own Copy button) — reset the form and highlight the just-created row
// rather than leaving a stale link preview in the create box.
_resetShareForm();
_renderTokenList(entry.token);
} catch(e) {
alert(t('share_create_error', 'Failed to create link:') + ' ' + e.message);
}
}
function copyShareLink() {
const url = document.getElementById('shareNewLinkUrl').value;
_copyText(url, document.getElementById('shareCopyBtn'));
}
async function copyTokenLink(token, btn) {
const url = (await _getShareBaseUrl()) + '/view?token=' + encodeURIComponent(token);
_copyText(url, btn);
}
function _copyText(text, btn) {
navigator.clipboard.writeText(text).then(() => {
const done = () => {
const orig = btn.textContent;
btn.textContent = t('share_copied', 'Copied!');
setTimeout(() => { btn.textContent = orig; }, 1800);
}).catch(() => {
// Fallback for HTTP contexts
};
// Fallback for HTTP contexts, where navigator.clipboard is undefined
// (the Clipboard API only exists in secure contexts — HTTPS or localhost).
const fallback = () => {
let ok = false;
try {
const ta = document.createElement('textarea');
ta.value = text;
ta.style.position = 'fixed'; ta.style.opacity = '0';
ta.setAttribute('readonly', '');
document.body.appendChild(ta);
ta.focus();
ta.select();
document.execCommand('copy');
ok = document.execCommand('copy');
document.body.removeChild(ta);
const orig = btn.textContent;
btn.textContent = t('share_copied', 'Copied!');
setTimeout(() => { btn.textContent = orig; }, 1800);
} catch(_) {}
});
} catch(_) { ok = false; }
if (ok) done();
// Last resort: show the link in a prompt so it can be copied manually.
else prompt(t('share_copy_link_prompt', 'Copy link:'), text);
};
if (navigator.clipboard && navigator.clipboard.writeText) {
navigator.clipboard.writeText(text).then(done).catch(fallback);
} else {
fallback();
}
}
async function revokeToken(token, rowEl) {
@ -137,12 +331,6 @@ async function revokeToken(token, rowEl) {
if (!list.children.length) {
list.innerHTML = '<div style="font-size:12px;color:var(--muted);padding:4px 0">' + t('share_no_links', 'No active links.') + '</div>';
}
// Hide the copy row if the just-revoked token was the last created
const newRow = document.getElementById('shareNewLinkRow');
if (newRow) {
const shownUrl = document.getElementById('shareNewLinkUrl')?.value || '';
if (shownUrl.includes(token)) newRow.style.display = 'none';
}
} catch(e) {
alert(t('share_revoke_error', 'Failed to revoke:') + ' ' + e.message);
}
@ -227,13 +415,96 @@ async function stClearViewerPin() {
}
}
// ── Interface PIN — Settings UI ───────────────────────────────────────────────
async function stLoadInterfacePinStatus() {
try {
const r = await fetch('/api/interface/pin');
const d = await r.json();
const statusEl = document.getElementById('stInterfacePinStatus');
const currentRow = document.getElementById('stInterfaceCurrentPinRow');
const clearBtn = document.getElementById('stInterfacePinClearBtn');
if (d.pin_set) {
if (statusEl) statusEl.textContent = '\u2714 ' + t('interface_pin_is_set', 'Interface PIN is set');
if (currentRow) currentRow.style.display = '';
if (clearBtn) clearBtn.style.display = '';
} else {
if (statusEl) statusEl.textContent = t('interface_pin_not_set_msg', 'No PIN set \u2014 interface is open to anyone on the network');
if (currentRow) currentRow.style.display = 'none';
if (clearBtn) clearBtn.style.display = 'none';
}
} catch(e) {}
}
async function stSaveInterfacePin() {
const newPin = (document.getElementById('stInterfaceNewPin')?.value || '').trim();
const currentPin = (document.getElementById('stInterfaceCurrentPin')?.value || '').trim();
const st = document.getElementById('stInterfacePinSaveStatus');
if (!newPin) {
if (st) { st.style.color = 'var(--danger)'; st.textContent = t('m365_settings_pin_required', 'PIN is required.'); }
return;
}
if (!/^\d{4,8}$/.test(newPin)) {
if (st) { st.style.color = 'var(--danger)'; st.textContent = t('viewer_pin_format', 'PIN must be 4\u20138 digits.'); }
return;
}
if (st) { st.style.color = 'var(--muted)'; st.textContent = t('viewer_pin_saving', 'Saving\u2026'); }
try {
const r = await fetch('/api/interface/pin', {
method: 'POST', headers: {'Content-Type': 'application/json'},
body: JSON.stringify({pin: newPin, current_pin: currentPin})
});
const d = await r.json();
if (!r.ok) {
if (st) { st.style.color = 'var(--danger)'; st.textContent = d.error || 'Error.'; }
return;
}
if (st) { st.style.color = 'var(--accent)'; st.textContent = '\u2714 ' + t('interface_pin_saved', 'PIN saved'); }
if (document.getElementById('stInterfaceNewPin')) document.getElementById('stInterfaceNewPin').value = '';
if (document.getElementById('stInterfaceCurrentPin')) document.getElementById('stInterfaceCurrentPin').value = '';
stLoadInterfacePinStatus();
} catch(e) {
if (st) { st.style.color = 'var(--danger)'; st.textContent = e.message; }
}
}
async function stClearInterfacePin() {
const currentPin = (document.getElementById('stInterfaceCurrentPin')?.value || '').trim();
const st = document.getElementById('stInterfacePinSaveStatus');
if (!currentPin) {
if (st) { st.style.color = 'var(--danger)'; st.textContent = t('m365_settings_pin_required', 'PIN is required.'); }
document.getElementById('stInterfaceCurrentPin')?.focus();
return;
}
if (!confirm(t('interface_pin_clear_confirm', 'Remove the interface PIN? The scanner will be accessible to anyone on the network.'))) return;
try {
const r = await fetch('/api/interface/pin', {
method: 'DELETE', headers: {'Content-Type': 'application/json'},
body: JSON.stringify({current_pin: currentPin})
});
const d = await r.json();
if (!r.ok) {
if (st) { st.style.color = 'var(--danger)'; st.textContent = d.error || 'Error.'; }
return;
}
if (st) { st.style.color = 'var(--muted)'; st.textContent = t('interface_pin_cleared', 'PIN cleared'); }
stLoadInterfacePinStatus();
} catch(e) {
if (st) { st.style.color = 'var(--danger)'; st.textContent = e.message; }
}
}
// ── Window exports ────────────────────────────────────────────────────────────
window._shareScopeTypeChanged = _shareScopeTypeChanged;
window.openShareModal = openShareModal;
window.closeShareModal = closeShareModal;
window.createShareLink = createShareLink;
window.copyShareLink = copyShareLink;
window._copyText = _copyText;
window.copyTokenLink = copyTokenLink;
window.revokeToken = revokeToken;
window.stLoadViewerPinStatus = stLoadViewerPinStatus;
window.stSaveViewerPin = stSaveViewerPin;
window.stClearViewerPin = stClearViewerPin;
window.stLoadInterfacePinStatus = stLoadInterfacePinStatus;
window.stSaveInterfacePin = stSaveInterfacePin;
window.stClearInterfacePin = stClearInterfacePin;

View File

@ -197,7 +197,7 @@
.filter-clear:hover { border-color: var(--danger); color: var(--danger); }
/* Grid */
.grid-area { flex: 1; overflow-y: auto; padding: 24px; min-width: 0; scrollbar-width: thin; scrollbar-color: var(--border) transparent; }
.grid-area { flex: 1; overflow-y: auto; overflow-anchor: none; padding: 24px; min-width: 0; scrollbar-width: thin; scrollbar-color: var(--border) transparent; }
.grid-area::-webkit-scrollbar { width: 4px; }
.grid-area::-webkit-scrollbar-track { background: transparent; }
.grid-area::-webkit-scrollbar-thumb { background: var(--border); border-radius: 2px; }
@ -234,7 +234,7 @@
.preview-meta { padding: 10px 14px; border-top: 1px solid var(--border); font-size: 11px; color: var(--muted); display: flex; gap: 10px; flex-wrap: wrap; flex-shrink: 0; }
.preview-open-btn { margin-left: auto; background: var(--accent); color: #fff; border: none; border-radius: 5px; padding: 4px 10px; font-size: 11px; cursor: pointer; white-space: nowrap; }
.card.selected { outline: 2px solid var(--accent); outline-offset: 2px; }
.card { background: var(--surface); border: 1px solid var(--border); border-radius: 10px; overflow: hidden; cursor: pointer; transition: border-color .15s, box-shadow .15s; }
.card { position: relative; background: var(--surface); border: 1px solid var(--border); border-radius: 10px; overflow: hidden; cursor: pointer; transition: border-color .15s, box-shadow .15s; }
.card:hover { border-color: var(--accent); box-shadow: 0 0 0 1px var(--accent); }
.card.list-view { display: flex; align-items: center; gap: 12px; padding: 10px 14px; border-radius: 8px; }
.thumb-wrap { aspect-ratio: 7/9; overflow: hidden; background: var(--bg); }
@ -253,6 +253,31 @@
.card-delete-btn { position:absolute; top:6px; right:6px; background:rgba(0,0,0,0.45); color:#fff; border:none; border-radius:50%; width:22px; height:22px; font-size:13px; line-height:22px; text-align:center; cursor:pointer; opacity:0.35; transition:opacity .15s; padding:0; z-index:1; }
.card:hover .card-delete-btn { opacity:1; }
.card.list-view .card-delete-btn { position:static; opacity:1; background:transparent; color:var(--muted); flex-shrink:0; }
.card-redact-btn { position:absolute; top:6px; right:32px; background:rgba(0,80,40,0.55); color:#7effc0; border:none; border-radius:50%; width:22px; height:22px; font-size:12px; line-height:22px; text-align:center; cursor:pointer; opacity:0.35; transition:opacity .15s; padding:0; z-index:1; }
.card:hover .card-redact-btn { opacity:1; }
.card.list-view .card-redact-btn { position:static; opacity:1; background:transparent; color:#7effc0; flex-shrink:0; }
/* Per-card checkbox (select mode) */
.card-cb { position:absolute; top:6px; left:6px; width:16px; height:16px; margin:0; cursor:pointer; z-index:2;
display:none; accent-color:var(--accent); }
body.select-mode .card-cb { display:block; }
.card.card-selected-bulk { outline:2px solid var(--accent); outline-offset:2px; background:color-mix(in srgb, var(--accent) 8%, var(--surface)); }
body.select-mode .card { cursor:default; }
body.select-mode .card:hover { border-color:var(--accent); }
/* Disposition stats bar */
.disp-stats-bar { display:flex; align-items:center; gap:8px; padding:4px 16px;
background:var(--bg); border-bottom:1px solid var(--border);
font-size:11px; color:var(--muted); flex-shrink:0; flex-wrap:wrap; }
.disp-stat-sep { width:1px; height:10px; background:var(--border); flex-shrink:0; }
.disp-stat-warn { color:var(--danger); font-weight:600; }
.disp-stat-ok { color:var(--success); }
/* Bulk tag bar */
.bulk-tag-bar { display:flex; align-items:center; gap:8px; padding:6px 16px;
background:var(--surface); border-top:1px solid var(--border);
font-size:12px; color:var(--text); flex-shrink:0; flex-wrap:wrap; }
.bulk-tag-bar button { height:26px; padding:0 10px; border-radius:5px; font-size:12px; cursor:pointer; box-sizing:border-box; }
.bulk-delete-modal { max-width:460px; }
.bulk-criteria-row { display:flex; align-items:center; gap:8px; margin-bottom:8px; font-size:12px; }
.bulk-criteria-row label { flex:0 0 130px; color:var(--muted); }
@ -336,17 +361,17 @@
.settings-backdrop.open { display:flex; }
.settings-modal {
background:var(--surface); border:1px solid var(--border);
border-radius:10px; width:min(540px,96vw);
border-radius:10px; width:min(720px,96vw);
display:flex; flex-direction:column; overflow:hidden;
font-size:12px; color:var(--text);
}
.settings-header { padding:16px 20px 0; display:flex; align-items:center; justify-content:space-between; }
.settings-header h2 { font-size:14px; font-weight:700; margin:0; }
.settings-tabs { display:flex; border-bottom:1px solid var(--border); padding:0 20px; margin-top:12px; }
.settings-tabs { display:flex; border-bottom:1px solid var(--border); padding:0 20px; margin-top:12px; flex-wrap:wrap; }
.settings-tab {
height:36px; padding:0 14px; font-size:12px; cursor:pointer; border:none;
background:none; color:var(--muted); border-bottom:2px solid transparent;
margin-bottom:-1px; font-weight:500;
margin-bottom:-1px; font-weight:500; white-space:nowrap;
}
.settings-tab.active { color:var(--accent); border-bottom-color:var(--accent); font-weight:600; }
.settings-body { padding:16px 20px; overflow-y:auto; max-height:65vh; display:flex; flex-direction:column; gap:14px; }
@ -469,6 +494,18 @@
.overdue-badge { font-size: 9px; padding: 1px 5px; border-radius: 10px;
background: #7c3200; color: #ffb347; font-weight: 600; white-space: nowrap; }
[data-theme="light"] .overdue-badge { background: #fff3e0; color: #c55a00; }
.resolved-badge { font-size: 9px; padding: 1px 5px; border-radius: 10px;
background: #1a3a28; color: #7effc0; font-weight: 600; white-space: nowrap; }
[data-theme="light"] .resolved-badge { background: #d0f5ea; color: #005a3a; }
.card-resolved { opacity: 0.6; }
.resolved-divider { grid-column: 1 / -1; padding: 8px 2px; font-size: 11px;
color: var(--muted); border-top: 1px dashed var(--border); text-align: center; }
.email-badge { font-size: 9px; padding: 1px 5px; border-radius: 10px;
background: #1a3a5c; color: #7ec8f0; font-weight: 500; white-space: nowrap; }
[data-theme="light"] .email-badge { background: #d0eaff; color: #004a80; }
.phone-badge { font-size: 9px; padding: 1px 5px; border-radius: 10px;
background: #1a4030; color: #7eeac0; font-weight: 500; white-space: nowrap; }
[data-theme="light"] .phone-badge { background: #d0f5ea; color: #005a3a; }
.badge-email { background: rgba(139,68,173,.2); color: #b87fd8; }
.badge-onedrive { background: rgba(0,120,212,.2); color: #5ba4e8; }
.badge-sharepoint { background: rgba(0,160,100,.2); color: #2ecc71; }
@ -607,6 +644,8 @@
body.viewer-mode .config-group { display: none !important; }
body.viewer-mode #resumeBanner { display: none !important; }
body.viewer-mode #bulkDeleteBtn { display: none !important; }
body.viewer-mode #selectModeBtn { display: none !important; }
body.viewer-mode #bulkTagBar { display: none !important; }
body.viewer-mode .card-delete-btn { display: none !important; }
body.viewer-mode #dsubDeleteBtn { display: none !important; }
body.viewer-mode #shareBtn { display: none !important; }

View File

@ -13,6 +13,7 @@
var LANG = {{ lang_json | safe }};
// ── Viewer mode ───────────────────────────────────────────────────────────────
window.VIEWER_MODE = {{ 'true' if viewer_mode else 'false' }};
window.VIEWER_SCOPE = {{ viewer_scope | safe if viewer_scope is defined else '{}' }};
function t(key, fallback) {
return LANG[key] !== undefined ? LANG[key] : (fallback !== undefined ? fallback : key);
}
@ -109,6 +110,7 @@ document.addEventListener('DOMContentLoaded', applyI18n);
<div id="deltaStatusRow" style="display:none;font-size:10px;padding:3px 0 2px;color:var(--muted)">
<span id="deltaStatusText"></span>
<button onclick="clearDeltaTokens()" style="background:none;border:none;color:var(--danger);font-size:10px;cursor:pointer;padding:0 0 0 6px" data-i18n="m365_delta_clear">Clear tokens</button>
<span class="hint-wrap"><span class="hint-icon" onclick="toggleHint(this)">?</span><span class="hint-bubble" data-i18n="m365_delta_tokens_hint">Saved change-tokens let delta scans fetch only items modified since the last scan. Clear tokens forces the next scan to be a full scan.</span></span>
</div>
<!-- Photo / biometric scan (#9) -->
@ -119,6 +121,62 @@ document.addEventListener('DOMContentLoaded', applyI18n);
<label class="toggle"><input type="checkbox" id="optScanPhotos"><span class="toggle-slider"></span></label>
</div>
<!-- Skip GPS in images -->
<div class="toggle-row">
<span class="toggle-label" style="flex:1">
<span data-i18n="m365_opt_skip_gps">Ignorer GPS i billeder</span><span class="hint-wrap"><span class="hint-icon" onclick="toggleHint(this)">?</span><span class="hint-bubble" data-i18n="m365_opt_skip_gps_hint">Billeder med GPS-koordinater flagges ikke — nyttigt ved elevscanninger, hvor smartphones indlejrer placering i alle fotos.</span></span>
</span>
<label class="toggle"><input type="checkbox" id="optSkipGps"><span class="toggle-slider"></span></label>
</div>
<!-- Minimum CPR count per file -->
<div class="toggle-row">
<span class="toggle-label" style="flex:1">
<span data-i18n="m365_opt_min_cpr">Min. CPR-antal pr. fil</span><span class="hint-wrap"><span class="hint-icon" onclick="toggleHint(this)">?</span><span class="hint-bubble" data-i18n="m365_opt_min_cpr_hint">Filer med færre distinkte CPR-numre end denne tærskel rapporteres ikke. Sæt til 2 for at undgå falske positive, når elever har egne CPR-numre i filer.</span></span>
</span>
<input type="number" id="optMinCpr" value="1" min="1" max="50"
style="width:46px;padding:3px 6px;font-size:11px;text-align:right">
</div>
<!-- OCR language -->
<div class="toggle-row">
<span class="toggle-label" style="flex:1">
<span data-i18n="m365_opt_ocr_lang">OCR language</span><span class="hint-wrap"><span class="hint-icon" onclick="toggleHint(this)">?</span><span class="hint-bubble" data-i18n="m365_opt_ocr_lang_hint">Tesseract language pack(s) used when scanning scanned PDFs and images. Must match installed language packs.</span></span>
</span>
<select id="optOcrLang" style="font-size:11px;padding:2px 4px;background:var(--surface);border:1px solid var(--border);color:var(--text);border-radius:4px">
<option value="dan+eng">dan+eng</option>
<option value="dan">dan</option>
<option value="eng">eng</option>
<option value="dan+eng+deu">dan+eng+deu</option>
<option value="dan+eng+swe">dan+eng+swe</option>
<option value="dan+eng+fra">dan+eng+fra</option>
</select>
</div>
<!-- CPR-only mode -->
<div class="toggle-row">
<span class="toggle-label" style="flex:1">
<span data-i18n="m365_opt_cpr_only">CPR-only mode</span><span class="hint-wrap"><span class="hint-icon" onclick="toggleHint(this)">?</span><span class="hint-bubble" data-i18n="m365_opt_cpr_only_hint">Only flag files that contain CPR numbers. Files with only email addresses, phone numbers, faces, or EXIF metadata are ignored.</span></span>
</span>
<label class="toggle"><input type="checkbox" id="optCprOnly"><span class="toggle-slider"></span></label>
</div>
<!-- Scan for email addresses -->
<div class="toggle-row">
<span class="toggle-label" style="flex:1">
<span data-i18n="m365_opt_scan_emails">Scan for email addresses</span><span class="hint-wrap"><span class="hint-icon" onclick="toggleHint(this)">?</span><span class="hint-bubble" data-i18n="m365_opt_scan_emails_hint">Flags files that contain email addresses. Off by default — email addresses are very common and may produce many results.</span></span>
</span>
<label class="toggle"><input type="checkbox" id="optScanEmails"><span class="toggle-slider"></span></label>
</div>
<!-- Scan for phone numbers -->
<div class="toggle-row">
<span class="toggle-label" style="flex:1">
<span data-i18n="m365_opt_scan_phones">Scan for phone numbers</span><span class="hint-wrap"><span class="hint-icon" onclick="toggleHint(this)">?</span><span class="hint-bubble" data-i18n="m365_opt_scan_phones_hint">Flags files containing Danish phone numbers (8 digits). Useful for finding contact lists and parent correspondence.</span></span>
</span>
<label class="toggle"><input type="checkbox" id="optScanPhones"><span class="toggle-slider"></span></label>
</div>
<!-- Retention policy (suggestion #1) -->
<div class="toggle-row">
<span class="toggle-label" style="flex:1">
@ -268,7 +326,7 @@ document.addEventListener('DOMContentLoaded', applyI18n);
<!-- Topbar -->
<div class="topbar">
<span id="viewerBrand" style="display:none;font-size:15px;font-weight:600;color:var(--text);white-space:nowrap;margin-right:6px">🔍 GDPRScanner</span>
<button class="scan-btn" id="scanBtn" onclick="startScan()" data-i18n="m365_btn_scan">Scan</button>
<button class="scan-btn" id="scanBtn" onclick="checkCheckpoint(() => startScan(false))" data-i18n="m365_btn_scan">Scan</button>
<button class="stop-btn" id="stopBtn" style="display:none" onclick="stopScan()" data-i18n="m365_btn_stop">Stop</button>
<!-- Profile selector (15c) -->
@ -309,6 +367,17 @@ document.addEventListener('DOMContentLoaded', applyI18n);
<button onclick="clearCheckpointAndScan()" style="padding:3px 10px;border-radius:5px;background:none;border:1px solid var(--border);color:var(--muted);cursor:pointer;font-size:12px" data-i18n="m365_btn_start_fresh">Start fresh</button>
</div>
<!-- History mode banner -->
<div id="historyBanner" style="display:none;align-items:center;gap:10px;padding:6px 14px;background:var(--surface);border-bottom:1px solid var(--border);font-size:12px">
<span style="font-size:11px;font-weight:600;color:var(--muted);flex-shrink:0" data-i18n="history_lbl">History</span>
<span id="historyBannerText" style="flex:1;font-size:11px;color:var(--text);overflow:hidden;text-overflow:ellipsis;white-space:nowrap"></span>
<div data-history-wrap style="position:relative;flex-shrink:0">
<button id="historyPickerBtn" type="button" onclick="openHistoryPicker()" style="height:24px;padding:0 10px;background:none;border:1px solid var(--border);color:var(--muted);border-radius:4px;font-size:11px;cursor:pointer" data-i18n="history_btn_sessions">Sessions</button>
<div id="historyDropdown" style="display:none;position:absolute;right:0;top:calc(100% + 4px);background:var(--surface);border:1px solid var(--border);border-radius:6px;z-index:9999;width:300px;max-height:260px;overflow-y:auto;box-shadow:0 4px 12px rgba(0,0,0,.25)"></div>
</div>
<button id="historyLatestBtn" type="button" onclick="loadHistorySession(null)" style="display:none;height:24px;padding:0 10px;background:none;border:1px solid var(--accent);color:var(--accent);border-radius:4px;font-size:11px;cursor:pointer;flex-shrink:0" data-i18n="history_btn_latest">Open items</button>
</div>
<!-- Filter bar — full width, above grid + preview -->
<div class="filter-bar" id="filterBar">
<input type="text" id="filterSearch" data-i18n-placeholder="m365_filter_search" placeholder="Search…" oninput="applyFilters()">
@ -344,14 +413,24 @@ document.addEventListener('DOMContentLoaded', applyI18n);
<option value="1" data-i18n="m365_filter_special_only">⚠ Art. 9 only</option>
<option value="photo" data-i18n="m365_filter_photo_only">📷 Photos / biometric</option>
</select>
<span id="viewerIdentityBadge" style="display:none;font-size:11px;padding:2px 8px;border-radius:10px;background:var(--muted);color:#fff;font-weight:600;white-space:nowrap;max-width:180px;overflow:hidden;text-overflow:ellipsis"></span>
<select id="filterRole" onchange="applyFilters()" style="width:120px">
<option value="" data-i18n="m365_filter_all_roles">All roles</option>
<option value="staff" data-i18n="m365_filter_staff">Ansatte</option>
<option value="student" data-i18n="m365_filter_student">Elever</option>
</select>
<button class="filter-clear" onclick="clearFilters()" data-i18n="m365_filter_clear">Ryd</button>
<div class="spacer"></div>
<button id="exportBtn" onclick="exportExcel()" style="background:none;border:1px solid var(--border);color:var(--muted)" data-i18n="m365_btn_export_excel" title="Export results as Excel">Excel</button>
<button id="exportA30Btn" onclick="exportArticle30()" style="background:none;border:1px solid var(--accent);color:var(--accent)" data-i18n="m365_btn_export_article30" title="Export GDPR Article 30 report as Word document">Art.30</button>
<button id="bulkDeleteBtn" onclick="openBulkDelete()" style="background:none;border:1px solid var(--danger);color:var(--danger)" data-i18n="m365_btn_bulk_delete" title="Bulk delete">Slet</button>
<button id="selectModeBtn" style="background:none;border:1px solid var(--border);color:var(--muted)" onclick="toggleSelectMode()" data-i18n="bulk_select_mode">Vælg</button>
<button id="listViewBtn" style="background:none;border:1px solid var(--border);color:var(--muted)" onclick="toggleView()" data-i18n="m365_btn_list_view">Liste</button>
</div>
<!-- Disposition stats bar -->
<div id="dispStats" class="disp-stats-bar" style="display:none"></div>
<!-- Content area: grid + preview panel -->
<div class="content-area">
<div style="flex:1; display:flex; flex-direction:column; overflow:hidden; min-width:220px">
@ -364,6 +443,24 @@ document.addEventListener('DOMContentLoaded', applyI18n);
</div>
<div id="lastScanSummary" style="display:none" class="empty-state last-scan-summary"></div>
<div class="grid" id="grid" style="display:none"></div>
<!-- Bulk disposition tag bar (visible only in select mode with items selected) -->
<div id="bulkTagBar" class="bulk-tag-bar" style="display:none">
<span id="bulkTagCount" style="font-weight:600;white-space:nowrap"></span>
<button id="bulkSelectAll" type="button" onclick="selectAllVisible()" data-i18n="bulk_select_all">Vælg alle synlige</button>
<select id="bulkDispSelect" style="height:26px;font-size:12px;padding:0 8px;flex:0 0 auto">
<option value="retain-legal" data-i18n="m365_disp_retain_legal">Behold — juridisk</option>
<option value="retain-legitimate" data-i18n="m365_disp_retain_legit">Behold — legitimt</option>
<option value="retain-contract" data-i18n="m365_disp_retain_contract">Behold — kontrakt</option>
<option value="delete-scheduled" data-i18n="m365_disp_delete_sched">Slet — planlagt</option>
<option value="deleted" data-i18n="m365_disp_deleted">Slettet</option>
<option value="personal-use" data-i18n="m365_disp_personal_use">Personlig brug</option>
<option value="unreviewed" data-i18n="m365_disp_unreviewed">Ikke gennemgået</option>
</select>
<button id="bulkTagApplyBtn" type="button" onclick="applyBulkDisposition()" style="background:var(--accent);color:#fff;border:none;height:26px;padding:0 14px;border-radius:5px;font-size:12px;cursor:pointer;font-weight:600" data-i18n="bulk_apply">Anvend</button>
<span id="bulkTagStatus" style="font-size:11px;color:var(--accent)"></span>
<button type="button" onclick="toggleSelectMode()" style="margin-left:auto;background:none;border:1px solid var(--border);color:var(--muted);height:26px;padding:0 10px;border-radius:5px;font-size:12px;cursor:pointer" data-i18n="bulk_done">Afslut</button>
</div>
</div>
<!-- Progress bar -->
@ -405,6 +502,8 @@ document.addEventListener('DOMContentLoaded', applyI18n);
<iframe id="previewFrame" sandbox="allow-scripts allow-same-origin allow-forms allow-popups" style="display:none"></iframe>
</div>
<div class="preview-meta" id="previewMeta"></div>
<!-- Related documents -->
<div id="previewRelated" style="display:none;padding:8px 14px 4px;border-top:1px solid var(--border)"></div>
<!-- Disposition widget (#6) -->
<div class="disposition-row" id="dispositionRow" style="display:none">
<span class="disposition-label" data-i18n="m365_disposition_label">Disposition</span>
@ -517,6 +616,8 @@ document.addEventListener('DOMContentLoaded', applyI18n);
<button class="settings-tab" id="stTabScheduler" onclick="switchSettingsTab('scheduler')" data-i18n="m365_settings_tab_scheduler">Scheduler</button>
<button class="settings-tab" id="stTabEmail" onclick="switchSettingsTab('email')" data-i18n="m365_settings_tab_email">Email report</button>
<button class="settings-tab" id="stTabDatabase" onclick="switchSettingsTab('database')" data-i18n="m365_settings_tab_database">Database</button>
<button class="settings-tab" id="stTabAuditlog" onclick="switchSettingsTab('auditlog')" data-i18n="m365_settings_tab_auditlog">Audit Log</button>
<button class="settings-tab" id="stTabAi" onclick="switchSettingsTab('ai')" data-i18n="m365_settings_tab_ai">AI / NER</button>
</div>
<div class="settings-body">
@ -541,6 +642,19 @@ document.addEventListener('DOMContentLoaded', applyI18n);
<div class="settings-about-row"><span>Requests</span><span id="st-about-requests" style="color:var(--muted)"></span></div>
<div class="settings-about-row"><span>openpyxl</span><span id="st-about-openpyxl" style="color:var(--muted)"></span></div>
</div>
<div class="settings-group" id="stUpdateGroup" style="display:none">
<div class="settings-group-title" data-i18n="m365_settings_updates">Software update</div>
<div id="stUpdateStatus" style="font-size:11px;color:var(--muted);margin-bottom:8px" data-i18n="m365_update_idle">Check whether a newer version is available.</div>
<div id="stUpdateCommits" style="display:none;font-size:11px;color:var(--muted);font-family:monospace;line-height:1.6;background:var(--bg);border:1px solid var(--border);border-radius:6px;padding:6px 10px;margin-bottom:8px;max-height:120px;overflow-y:auto"></div>
<div style="display:flex;align-items:center;gap:10px;margin-bottom:10px">
<label class="toggle" style="flex:unset"><input type="checkbox" id="stAutoUpdate" onchange="stSaveAutoUpdate()"><span class="toggle-slider"></span></label>
<span style="font-size:12px" data-i18n="m365_update_auto">Install updates automatically (checked daily — the app restarts itself)</span>
</div>
<div style="display:flex;justify-content:flex-end;gap:8px">
<button type="button" onclick="stCheckUpdate()" id="stCheckUpdateBtn" style="height:26px;padding:0 14px;background:none;border:1px solid var(--border);color:var(--text);border-radius:6px;font-size:12px;cursor:pointer;box-sizing:border-box" data-i18n="m365_update_check">Check for updates</button>
<button type="button" onclick="stApplyUpdate()" id="stApplyUpdateBtn" style="display:none;height:26px;padding:0 14px;background:var(--accent);color:#fff;border:none;border-radius:6px;font-size:12px;cursor:pointer;font-weight:600;box-sizing:border-box" data-i18n="m365_update_install">Install update</button>
</div>
</div>
</div>
<!-- ── Security pane ─────────────────────────────────────────────────── -->
@ -584,6 +698,24 @@ document.addEventListener('DOMContentLoaded', applyI18n);
<button type="button" onclick="stSaveViewerPin()" style="background:var(--accent);color:#fff;border:none;height:26px;padding:0 14px;border-radius:6px;font-size:12px;cursor:pointer;font-weight:600;box-sizing:border-box" data-i18n="m365_settings_save_pin">Save PIN</button>
</div>
</div>
<div class="settings-group">
<div class="settings-group-title" data-i18n="interface_pin_group_title">Interface PIN</div>
<div style="font-size:10px;color:var(--muted);line-height:1.5;margin-bottom:4px" data-i18n="interface_pin_desc">A numeric PIN (48 digits) that must be entered before accessing the main scanner interface. Viewers accessing <code style="font-size:10px">/view</code> are not affected.</div>
<div id="stInterfacePinStatus" style="font-size:10px;color:var(--muted);margin-bottom:6px"></div>
<div class="settings-row" id="stInterfaceCurrentPinRow" style="display:none">
<label data-i18n="m365_settings_current_pin">Current PIN</label>
<input id="stInterfaceCurrentPin" type="password" autocomplete="off" placeholder="••••">
</div>
<div class="settings-row">
<label data-i18n="m365_settings_new_pin">New PIN</label>
<input id="stInterfaceNewPin" type="password" inputmode="numeric" maxlength="8" autocomplete="off" placeholder="48 digits">
</div>
<div style="display:flex;justify-content:flex-end;gap:8px;margin-top:4px">
<div id="stInterfacePinSaveStatus" style="flex:1;font-size:11px;color:var(--muted);align-self:center"></div>
<button type="button" onclick="stClearInterfacePin()" id="stInterfacePinClearBtn" style="display:none;background:none;border:1px solid var(--danger);color:var(--danger);height:26px;padding:0 12px;border-radius:6px;font-size:12px;cursor:pointer;box-sizing:border-box" data-i18n="interface_pin_clear">Clear PIN</button>
<button type="button" onclick="stSaveInterfacePin()" style="background:var(--accent);color:#fff;border:none;height:26px;padding:0 14px;border-radius:6px;font-size:12px;cursor:pointer;font-weight:600;box-sizing:border-box" data-i18n="m365_settings_save_pin">Save PIN</button>
</div>
</div>
</div>
<!-- ── Scheduler pane (#19) ──────────────────────────────────────────── -->
@ -640,12 +772,19 @@ document.addEventListener('DOMContentLoaded', applyI18n);
<input id="schedMinute" type="number" min="0" max="59" value="0" style="width:50px">
</div>
</div>
<div class="settings-row">
<div class="settings-row" id="schedProfileRow">
<label data-i18n="m365_sched_profile">Profile</label>
<select id="schedProfile" style="flex:1;height:26px;padding:0 8px;border:1px solid var(--border);border-radius:5px;background:var(--surface);color:var(--text);font-size:12px;box-sizing:border-box">
<option value="" data-i18n="m365_sched_profile_last">Last saved settings</option>
</select>
</div>
<div class="settings-row">
<label data-i18n="m365_sched_report_only">Report only</label>
<label class="toggle" style="flex:unset"><input type="checkbox" id="schedReportOnly" onchange="schedToggleReportOnly()"><span class="toggle-slider"></span></label>
</div>
<div class="settings-row" id="schedReportOnlyHint" style="display:none">
<span style="font-size:10px;color:var(--muted);line-height:1.4" data-i18n="m365_sched_report_only_hint">Email the latest scan results without running a new scan. Requires scan results in the database.</span>
</div>
<div class="settings-row">
<label data-i18n="m365_sched_auto_email">Email report automatically</label>
<label class="toggle" style="flex:unset"><input type="checkbox" id="schedAutoEmail"><span class="toggle-slider"></span></label>
@ -702,6 +841,14 @@ document.addEventListener('DOMContentLoaded', applyI18n);
<label data-i18n="m365_smtp_recipients">Recipients</label>
<input id="st-smtpTo" type="text" placeholder="a@school.dk, b@school.dk">
</div>
<div class="settings-row">
<label data-i18n="m365_smtp_auto_email_manual">Email report after manual scan</label>
<label class="toggle" style="flex:unset"><input type="checkbox" id="st-smtpAutoEmail"><span class="toggle-slider"></span></label>
</div>
<div class="settings-row">
<label data-i18n="m365_smtp_prefer_smtp">Always send via SMTP (skip Microsoft Graph)</label>
<label class="toggle" style="flex:unset"><input type="checkbox" id="st-smtpPreferSmtp"><span class="toggle-slider"></span></label>
</div>
<div style="display:flex;justify-content:flex-end;gap:8px;margin-top:4px">
<div id="st-smtpStatus" style="flex:1;font-size:11px;color:var(--muted);align-self:center"></div>
<button onclick="stSmtpSave()" style="background:none;border:1px solid var(--border);color:var(--muted);height:26px;padding:0 12px;border-radius:6px;font-size:12px;cursor:pointer;box-sizing:border-box" data-i18n="btn_save">Save</button>
@ -729,6 +876,56 @@ document.addEventListener('DOMContentLoaded', applyI18n);
</div>
</div>
<!-- ── Audit Log pane ─────────────────────────────────────────────────── -->
<div class="settings-pane" id="stPaneAuditlog">
<div class="settings-group">
<div class="settings-group-title" data-i18n="m365_audit_title">Compliance Audit Log</div>
<div style="overflow-x:auto">
<table id="stAuditTable" style="width:100%;border-collapse:collapse;font-size:12px">
<thead>
<tr style="text-align:left">
<th style="padding:4px 8px;border-bottom:1px solid var(--border);color:var(--muted);font-weight:500" data-i18n="m365_audit_col_time">Time</th>
<th style="padding:4px 8px;border-bottom:1px solid var(--border);color:var(--muted);font-weight:500" data-i18n="m365_audit_col_action">Action</th>
<th style="padding:4px 8px;border-bottom:1px solid var(--border);color:var(--muted);font-weight:500" data-i18n="m365_audit_col_detail">Detail</th>
<th style="padding:4px 8px;border-bottom:1px solid var(--border);color:var(--muted);font-weight:500" data-i18n="m365_audit_col_ip">IP</th>
</tr>
</thead>
<tbody id="stAuditTableBody">
<tr><td colspan="4" style="padding:8px;color:var(--muted)" data-i18n="m365_audit_loading">Loading…</td></tr>
</tbody>
</table>
</div>
</div>
</div>
<div class="settings-pane" id="stPaneAi">
<div class="settings-group">
<div class="settings-group-title" data-i18n="m365_ai_title">AI-Enhanced NER</div>
<p style="margin:0 0 12px;font-size:12px;color:var(--muted)" data-i18n="m365_ai_desc">Use Claude AI instead of spaCy for name, address, and organisation detection. Significantly more accurate on Danish text — especially hyphenated surnames and foreign-origin names. Requires an Anthropic API key; charged per token.</p>
<div style="display:flex;align-items:center;gap:10px;margin-bottom:14px">
<label class="toggle" style="flex-shrink:0">
<input type="checkbox" id="aiEnabled">
<span class="toggle-track"></span>
</label>
<span style="font-size:13px" data-i18n="m365_ai_enable">Enable Claude NER</span>
</div>
<div style="margin-bottom:12px">
<label style="font-size:12px;color:var(--muted);display:block;margin-bottom:4px" data-i18n="m365_ai_api_key_label">Anthropic API key</label>
<div style="display:flex;gap:6px">
<input type="password" id="aiApiKey" placeholder="sk-ant-…" autocomplete="off" style="flex:1;height:26px;padding:0 8px;border:1px solid var(--border);border-radius:6px;background:var(--bg);color:var(--text);font-size:12px;box-sizing:border-box">
<button type="button" onclick="stAiToggleKey()" id="aiShowKeyBtn" style="height:26px;padding:0 10px;border:1px solid var(--border);background:none;color:var(--muted);border-radius:6px;font-size:12px;cursor:pointer" data-i18n="m365_ai_show_key">Show</button>
</div>
<span id="aiKeyStatus" style="font-size:11px;color:var(--muted);margin-top:4px;display:block"></span>
</div>
<div style="display:flex;gap:8px;align-items:center;flex-wrap:wrap">
<button type="button" onclick="stAiSave()" style="height:26px;padding:0 14px;background:var(--accent);color:#fff;border:none;border-radius:6px;font-size:12px;cursor:pointer" data-i18n="btn_save">Save</button>
<button type="button" onclick="stAiTest()" style="height:26px;padding:0 14px;background:none;border:1px solid var(--border);color:var(--text);border-radius:6px;font-size:12px;cursor:pointer" data-i18n="m365_ai_test">Test key</button>
<span id="aiStatus" style="font-size:12px"></span>
</div>
<p style="margin:14px 0 0;font-size:11px;color:var(--muted)" data-i18n="m365_ai_model_note">Model: claude-haiku-4-5 · billed at Anthropic token rates · results cached per document.</p>
</div>
</div>
</div><!-- /.settings-body -->
<div class="settings-footer">
<button onclick="closeSettings()" style="background:none;border:1px solid var(--border);color:var(--muted);height:26px;padding:0 14px;border-radius:6px;font-size:12px;cursor:pointer;box-sizing:border-box" data-i18n="btn_close">Close</button>
@ -859,6 +1056,36 @@ document.addEventListener('DOMContentLoaded', applyI18n);
<div style="font-size:11px;color:var(--muted);margin-bottom:3px" data-i18n="share_label_lbl">Label (optional)</div>
<input id="shareLabel" type="text" data-i18n-placeholder="share_label_placeholder" placeholder="e.g. DPO review 2026" style="width:100%;box-sizing:border-box;font-size:12px;padding:5px 8px;background:var(--surface);border:1px solid var(--border);border-radius:5px;color:var(--text)">
</div>
<div style="width:120px">
<div style="font-size:11px;color:var(--muted);margin-bottom:3px" data-i18n="share_scope_lbl">Scope</div>
<select id="shareScopeType" onchange="_shareScopeTypeChanged()" style="width:100%;font-size:12px;padding:5px 6px;background:var(--surface);border:1px solid var(--border);border-radius:5px;color:var(--text)">
<option value="" data-i18n="share_scope_all">All</option>
<option value="role" data-i18n="share_scope_type_role">Role</option>
<option value="user" data-i18n="share_scope_type_user">User</option>
</select>
</div>
<div id="shareScopeRoleWrap" style="width:110px;display:none">
<div style="font-size:11px;color:var(--muted);margin-bottom:3px" data-i18n="share_scope_role_lbl">Role</div>
<select id="shareScope" style="width:100%;font-size:12px;padding:5px 6px;background:var(--surface);border:1px solid var(--border);border-radius:5px;color:var(--text)">
<option value="staff" data-i18n="share_scope_staff">Staff</option>
<option value="student" data-i18n="share_scope_student">Students</option>
</select>
</div>
<div id="shareScopeUserWrap" style="flex:1.5;min-width:140px;display:none;position:relative">
<div style="font-size:11px;color:var(--muted);margin-bottom:3px" data-i18n="share_scope_user_lbl">User email</div>
<input id="shareScopeUser" type="text" autocomplete="off" data-i18n-placeholder="share_scope_user_placeholder" placeholder="alice@school.dk" style="width:100%;box-sizing:border-box;font-size:12px;padding:5px 8px;background:var(--surface);border:1px solid var(--border);border-radius:5px;color:var(--text)">
<div id="shareScopeUserDropdown" style="display:none;position:absolute;top:100%;left:0;right:0;margin-top:2px;background:var(--surface);border:1px solid var(--border);border-radius:6px;z-index:9999;max-height:220px;overflow-y:auto;box-shadow:0 4px 12px rgba(0,0,0,.3)"></div>
</div>
<div style="display:flex;gap:6px;flex:1.5;min-width:200px">
<div style="flex:1">
<div style="font-size:11px;color:var(--muted);margin-bottom:3px" data-i18n="share_date_from">Items from</div>
<input id="shareValidFrom" type="date" style="width:100%;box-sizing:border-box;font-size:12px;padding:5px 6px;background:var(--surface);border:1px solid var(--border);border-radius:5px;color:var(--text)">
</div>
<div style="flex:1">
<div style="font-size:11px;color:var(--muted);margin-bottom:3px" data-i18n="share_date_to">Items until</div>
<input id="shareValidTo" type="date" style="width:100%;box-sizing:border-box;font-size:12px;padding:5px 6px;background:var(--surface);border:1px solid var(--border);border-radius:5px;color:var(--text)">
</div>
</div>
<div style="width:100px">
<div style="font-size:11px;color:var(--muted);margin-bottom:3px" data-i18n="share_expires_in">Expires in</div>
<select id="shareExpiry" style="width:100%;font-size:12px;padding:5px 6px;background:var(--surface);border:1px solid var(--border);border-radius:5px;color:var(--text)">
@ -871,13 +1098,6 @@ document.addEventListener('DOMContentLoaded', applyI18n);
</div>
<button onclick="createShareLink()" style="height:30px;padding:0 14px;background:var(--accent);color:#fff;border:none;border-radius:5px;font-size:12px;cursor:pointer;flex-shrink:0" data-i18n="share_create">Create</button>
</div>
<div id="shareNewLinkRow" style="display:none;margin-top:10px">
<div style="font-size:11px;color:var(--muted);margin-bottom:4px" data-i18n="share_copy_link_prompt">Copy link:</div>
<div style="display:flex;gap:6px;align-items:center">
<input id="shareNewLinkUrl" type="text" readonly style="flex:1;font-size:11px;padding:5px 8px;background:var(--bg2,var(--bg));border:1px solid var(--border);border-radius:5px;color:var(--text);min-width:0">
<button onclick="copyShareLink()" id="shareCopyBtn" style="height:26px;padding:0 10px;background:none;border:1px solid var(--border);color:var(--muted);border-radius:5px;font-size:11px;cursor:pointer;flex-shrink:0" data-i18n="log_copy">Copy</button>
</div>
</div>
</div>
<!-- Existing tokens -->
@ -1120,30 +1340,93 @@ document.addEventListener('DOMContentLoaded', applyI18n);
<div class="srcmgmt-group">
<div class="srcmgmt-group-title" data-i18n="m365_file_sources_add">Add source</div>
<div class="fsrc-form" style="border-color:var(--border)">
<!-- Source type selector -->
<div class="fsrc-form-row">
<label>Name <span style="color:var(--accent)">*</span></label>
<input id="srcFileLabel" type="text" placeholder="e.g. Teacher files, NAS archive" maxlength="80" autocomplete="off">
<label>Type</label>
<div style="display:flex;background:var(--bg);border:1px solid var(--border);border-radius:6px;overflow:hidden">
<button type="button" id="srcTypeLocal" onclick="srcFileTypeSelect('local')" style="flex:1;border:none;padding:3px 8px;font-size:11px;cursor:pointer;background:var(--accent);color:#fff" data-i18n="m365_fsrc_type_local">Local folder</button>
<button type="button" id="srcTypeSmb" onclick="srcFileTypeSelect('smb')" style="flex:1;border:none;border-left:1px solid var(--border);padding:3px 8px;font-size:11px;cursor:pointer;background:none;color:var(--muted)" data-i18n="m365_fsrc_type_smb">Network (SMB)</button>
<button type="button" id="srcTypeSftp" onclick="srcFileTypeSelect('sftp')" style="flex:1;border:none;border-left:1px solid var(--border);padding:3px 8px;font-size:11px;cursor:pointer;background:none;color:var(--muted)" data-i18n="m365_fsrc_type_sftp">SFTP</button>
</div>
</div>
<input type="hidden" id="srcFileSourceType" value="local">
<div class="fsrc-form-row">
<label><span data-i18n="m365_fsrc_name">Name</span> <span style="color:var(--accent)">*</span></label>
<input id="srcFileLabel" type="text" data-i18n-placeholder="m365_fsrc_name_placeholder" placeholder="e.g. Teacher files, NAS archive" maxlength="80" autocomplete="off">
</div>
<!-- Local / SMB path field -->
<div id="srcFilePathRow" class="fsrc-form-row">
<label data-i18n="m365_fsrc_path">Path</label>
<input id="srcFilePath" type="text" placeholder="~/Documents or //nas/shares" oninput="srcFileDetectSmb(); srcFileAutoName()">
<input id="srcFilePath" type="text" data-i18n-placeholder="m365_fsrc_path_placeholder" placeholder="~/Documents or //nas/shares" oninput="srcFileDetectSmb(); srcFileAutoName()">
</div>
<div id="srcFileSmbFields" style="display:none;flex-direction:column;gap:6px">
<div style="font-size:10px;color:var(--accent)" data-i18n="m365_fsrc_smb_detected">SMB/CIFS network share detected</div>
<div class="fsrc-form-row">
<label data-i18n="m365_fsrc_smb_host">SMB host</label>
<input id="srcFileSmbHost" type="text" placeholder="nas.school.dk">
<input id="srcFileSmbHost" type="text" data-i18n-placeholder="m365_fsrc_smb_host_placeholder" placeholder="nas.school.dk">
</div>
<div class="fsrc-form-row">
<label data-i18n="m365_fsrc_smb_user">Username</label>
<input id="srcFileSmbUser" type="text" placeholder="DOMAIN\\username">
<input id="srcFileSmbUser" type="text" data-i18n-placeholder="m365_fsrc_smb_user_placeholder" placeholder="DOMAIN\\username">
</div>
<div class="fsrc-form-row">
<label data-i18n="m365_fsrc_smb_pw">Password</label>
<input id="srcFileSmbPw" type="password" placeholder="Stored in OS keychain">
<input id="srcFileSmbPw" type="password" data-i18n-placeholder="m365_fsrc_pw_keychain_placeholder" placeholder="Stored in OS keychain">
</div>
<div style="font-size:10px;color:var(--muted)" data-i18n="m365_fsrc_smb_pw_hint">Saved to OS keychain — never stored in a file.</div>
</div>
<!-- SFTP fields -->
<div id="srcFileSftpFields" style="display:none;flex-direction:column;gap:6px">
<div class="fsrc-form-row">
<label data-i18n="m365_fsrc_sftp_host">SFTP host</label>
<input id="srcFileSftpHost" type="text" data-i18n-placeholder="m365_fsrc_sftp_host_placeholder" placeholder="sftp.school.dk" oninput="srcFileAutoNameSftp()">
</div>
<div class="fsrc-form-row">
<label data-i18n="m365_fsrc_sftp_port">Port</label>
<input id="srcFileSftpPort" type="number" value="22" min="1" max="65535" style="width:70px">
</div>
<div class="fsrc-form-row">
<label data-i18n="m365_fsrc_sftp_user">Username</label>
<input id="srcFileSftpUser" type="text" data-i18n-placeholder="m365_fsrc_sftp_user_placeholder" placeholder="backup_user">
</div>
<div class="fsrc-form-row">
<label data-i18n="m365_fsrc_sftp_remote_path">Remote path</label>
<input id="srcFileSftpPath" type="text" data-i18n-placeholder="m365_fsrc_sftp_path_placeholder" placeholder="/var/data" value="/">
</div>
<!-- Auth type toggle -->
<div class="fsrc-form-row">
<label data-i18n="m365_fsrc_sftp_auth">Auth</label>
<div style="display:flex;background:var(--bg);border:1px solid var(--border);border-radius:6px;overflow:hidden">
<button type="button" id="srcSftpAuthPw" onclick="srcFileSftpAuthSelect('password')" style="flex:1;border:none;padding:3px 8px;font-size:11px;cursor:pointer;background:var(--accent);color:#fff" data-i18n="m365_fsrc_sftp_auth_password">Password</button>
<button type="button" id="srcSftpAuthKey" onclick="srcFileSftpAuthSelect('key')" style="flex:1;border:none;border-left:1px solid var(--border);padding:3px 8px;font-size:11px;cursor:pointer;background:none;color:var(--muted)" data-i18n="m365_fsrc_sftp_auth_key">SSH key</button>
</div>
</div>
<input type="hidden" id="srcFileSftpAuth" value="password">
<!-- Password auth -->
<div id="srcSftpPwFields">
<div class="fsrc-form-row">
<label data-i18n="m365_fsrc_sftp_pw">Password</label>
<input id="srcFileSftpPw" type="password" data-i18n-placeholder="m365_fsrc_pw_keychain_placeholder" placeholder="Stored in OS keychain">
</div>
<div style="font-size:10px;color:var(--muted)" data-i18n="m365_fsrc_sftp_pw_hint">Password is saved to the OS keychain — never stored in a file.</div>
</div>
<!-- Key auth -->
<div id="srcSftpKeyFields" style="display:none;flex-direction:column;gap:6px">
<div class="fsrc-form-row">
<label data-i18n="m365_fsrc_sftp_key_upload">Private key</label>
<div style="display:flex;gap:6px;align-items:center">
<input id="srcFileSftpKeyFile" type="file" accept=".pem,.key,.pub,*" style="flex:1;font-size:11px">
<span id="srcFileSftpKeyStatus" style="font-size:10px;color:var(--muted)"></span>
</div>
</div>
<input type="hidden" id="srcFileSftpKeyPath" value="">
<div class="fsrc-form-row">
<label data-i18n="m365_fsrc_sftp_passphrase">Passphrase</label>
<input id="srcFileSftpPassphrase" type="password" data-i18n-placeholder="m365_fsrc_sftp_passphrase_placeholder" placeholder="Leave blank if key has no passphrase">
</div>
<div style="font-size:10px;color:var(--muted)" data-i18n="m365_fsrc_sftp_passphrase_hint">Passphrase is saved to the OS keychain — never stored in a file.</div>
</div>
</div>
<div style="display:flex;align-items:center;gap:8px">
<input type="hidden" id="srcFileEditId" value="">
<div id="srcFileStatus" style="flex:1;font-size:11px;color:var(--muted)"></div>
@ -1174,26 +1457,26 @@ document.addEventListener('DOMContentLoaded', applyI18n);
<div class="fsrc-form" id="fsrcForm">
<div style="font-size:11px;font-weight:600;color:var(--text)" data-i18n="m365_file_sources_add">Add source</div>
<div class="fsrc-form-row">
<label data-i18n="m365_fsrc_label">Name <span style="color:var(--accent)">*</span></label>
<input id="fsrcLabel" type="text" placeholder="e.g. Teacher files, NAS archive" maxlength="80" autocomplete="off">
<label><span data-i18n="m365_fsrc_name">Name</span> <span style="color:var(--accent)">*</span></label>
<input id="fsrcLabel" type="text" data-i18n-placeholder="m365_fsrc_name_placeholder" placeholder="e.g. Teacher files, NAS archive" maxlength="80" autocomplete="off">
</div>
<div class="fsrc-form-row">
<label data-i18n="m365_fsrc_path">Path</label>
<input id="fsrcPath" type="text" placeholder="~/Documents or //nas/shares" oninput="fsrcDetectSmb(); fsrcAutoName()">
<input id="fsrcPath" type="text" data-i18n-placeholder="m365_fsrc_path_placeholder" placeholder="~/Documents or //nas/shares" oninput="fsrcDetectSmb(); fsrcAutoName()">
</div>
<div id="fsrcSmbFields" class="fsrc-smb-fields" style="display:none;flex-direction:column;gap:6px">
<div style="font-size:10px;color:var(--accent);margin:-2px 0 2px" data-i18n="m365_fsrc_smb_detected">SMB/CIFS network share detected</div>
<div class="fsrc-form-row">
<label data-i18n="m365_fsrc_smb_host">SMB host</label>
<input id="fsrcSmbHost" type="text" placeholder="nas.school.dk">
<input id="fsrcSmbHost" type="text" data-i18n-placeholder="m365_fsrc_smb_host_placeholder" placeholder="nas.school.dk">
</div>
<div class="fsrc-form-row">
<label data-i18n="m365_fsrc_smb_user">Username</label>
<input id="fsrcSmbUser" type="text" placeholder="DOMAIN\\username or username">
<input id="fsrcSmbUser" type="text" data-i18n-placeholder="m365_fsrc_smb_user_edit_placeholder" placeholder="DOMAIN\\username or username">
</div>
<div class="fsrc-form-row">
<label data-i18n="m365_fsrc_smb_pw">Password</label>
<input id="fsrcSmbPw" type="password" placeholder="Stored in OS keychain">
<input id="fsrcSmbPw" type="password" data-i18n-placeholder="m365_fsrc_pw_keychain_placeholder" placeholder="Stored in OS keychain">
</div>
<div style="font-size:10px;color:var(--muted)" data-i18n="m365_fsrc_smb_pw_hint">Password is saved to the OS keychain — never stored in a file.</div>
</div>
@ -1252,7 +1535,7 @@ document.addEventListener('DOMContentLoaded', applyI18n);
<option value="replace" data-i18n="m365_db_import_replace">Replace (full restore)</option>
</select>
</div>
<div id="importDbReplaceWarn" style="display:none;background:#7c1a0060;border:1px solid var(--danger);border-radius:6px;padding:8px 10px;font-size:11px;color:#ff7070;line-height:1.5" data-i18n="m365_db_import_replace_warn">⚠ Replace mode will erase all existing scan data before restoring. Make sure you have a backup of ~/.gdpr_scanner.db first.</div>
<div id="importDbReplaceWarn" style="display:none;background:#7c1a0060;border:1px solid var(--danger);border-radius:6px;padding:8px 10px;font-size:11px;color:#ff7070;line-height:1.5" data-i18n="m365_db_import_replace_warn">⚠ Replace mode will erase all existing scan data before restoring. Make sure you have a backup of ~/.gdprscanner/scanner.db first.</div>
<div id="importDbStatus" style="min-height:16px;font-size:11px;color:var(--muted)"></div>
<div style="display:flex;justify-content:flex-end;gap:8px;padding-top:4px;border-top:1px solid var(--border)">
<button onclick="closeImportDBModal()" style="background:none;border:1px solid var(--border);color:var(--muted);padding:5px 14px;border-radius:6px;font-size:12px;cursor:pointer" data-i18n="btn_close">Close</button>
@ -1272,5 +1555,6 @@ document.addEventListener('DOMContentLoaded', applyI18n);
<script type="module" src="/static/js/scheduler.js"></script>
<script type="module" src="/static/js/connector.js"></script>
<script type="module" src="/static/js/viewer.js"></script>
<script type="module" src="/static/js/history.js"></script>
</body>
</html>

View File

@ -0,0 +1,86 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>GDPRScanner — {{ LANG.get('interface_pin_login_btn', 'Sign in') }}</title>
<link rel="stylesheet" href="{{ url_for('static', filename='style.css') }}">
<style>
body { display: flex; align-items: center; justify-content: center; min-height: 100vh; margin: 0; }
.pin-card {
background: var(--surface);
border: 1px solid var(--border);
border-radius: 8px;
padding: 32px 40px;
width: min(340px, 92vw);
box-sizing: border-box;
}
.pin-card h1 { font-size: 15px; font-weight: 600; margin: 0 0 6px; color: var(--text); }
.pin-card p { font-size: 12px; color: var(--muted); margin: 0 0 18px; }
.pin-input {
width: 100%; box-sizing: border-box;
font-size: 22px; letter-spacing: .3em; text-align: center;
padding: 10px 12px; border-radius: 6px;
border: 1px solid var(--border); background: var(--bg);
color: var(--text); outline: none; margin-bottom: 12px;
}
.pin-input:focus { border-color: var(--accent); }
.pin-btn {
width: 100%; padding: 10px; border: none; border-radius: 6px;
background: var(--accent); color: #fff; font-size: 13px;
font-weight: 600; cursor: pointer; font-family: var(--sans);
}
.pin-btn:disabled { opacity: .5; cursor: default; }
.pin-error { font-size: 12px; color: var(--danger); margin-top: 8px; min-height: 16px; text-align: center; }
</style>
</head>
<body data-theme="dark">
<div class="pin-card">
<h1>GDPRScanner</h1>
<p>{{ LANG.get('interface_pin_login_desc', 'Enter the interface PIN to continue.') }}</p>
<input id="pinInput" class="pin-input" type="password" inputmode="numeric"
maxlength="8" placeholder="••••" autocomplete="off"
onkeydown="if(event.key==='Enter')submitPin()">
<button class="pin-btn" id="pinBtn" onclick="submitPin()">{{ LANG.get('interface_pin_login_btn', 'Continue') }}</button>
<div class="pin-error" id="pinError"></div>
</div>
<script>
const _L = {
incorrect: {{ LANG.get('interface_pin_err_incorrect', 'Incorrect PIN.') | tojson }},
tooMany: {{ LANG.get('interface_pin_err_too_many', 'Too many attempts. Try again later.') | tojson }},
network: {{ LANG.get('interface_pin_err_network', 'Network error. Please try again.') | tojson }}
};
async function submitPin() {
const pin = document.getElementById('pinInput').value.trim();
if (!pin) return;
const btn = document.getElementById('pinBtn');
const err = document.getElementById('pinError');
btn.disabled = true;
err.textContent = '';
try {
const r = await fetch('/api/interface/pin/verify', {
method: 'POST',
headers: {'Content-Type': 'application/json'},
body: JSON.stringify({pin})
});
if (r.ok) {
const next = new URLSearchParams(window.location.search).get('next') || '/';
window.location.href = next;
} else {
const d = await r.json().catch(() => ({}));
err.textContent = r.status === 429 ? (d.error || _L.tooMany) : (d.error || _L.incorrect);
if (r.status !== 429) {
document.getElementById('pinInput').value = '';
document.getElementById('pinInput').focus();
}
btn.disabled = false;
}
} catch(e) {
err.textContent = _L.network;
btn.disabled = false;
}
}
document.getElementById('pinInput').focus();
</script>
</body>
</html>

View File

@ -0,0 +1,19 @@
Personoplysninger — Elevakt
===========================
Elevens navn: Lars Bjerregaard Nielsen
Klasse: 8B
Skole: Gudenaaskolen
CPR-nummer: 010172-1019
Fødselsdato: 1. januar 1972
Adresse: Skolevej 14, 8680 Ry
Telefon: +45 86 89 12 34
E-mail: lars.nielsen@privat.dk
Notater:
Eleven har haft fravær i uge 12 og 14. Forældrene er kontaktet.
Der er afholdt møde den 3. marts 2024 med klasselærer og skoleleder.
Underskrift: _______________________
Dato: ___________________

View File

@ -0,0 +1,15 @@
Besøgslog — Sundhedscenter Skanderborg
=======================================
Dato: 28. april 2024
Sagsbehandler: M. Andersen
Borger: Hanne Kirstine Pedersen
Registreringsnummer: 280490-0120
Henvendelse vedrørende: Sygedagpenge, paragraf 7 opfølgning
Samtalen fandt sted kl. 10:15 og varede 45 minutter.
Borger mødte op til tiden og var forberedt.
Aftale om næste møde: 26. maj 2024 kl. 10:00
Sted: Mødelokale 3, Adelgade 44, 8660 Skanderborg

View File

@ -0,0 +1,24 @@
Tilmelding til SFO — Gudenaaskolen
===================================
Barnets navn: Emma Sofie Christensen
Personnummer: 150315-4321
Klasse: 1A (skolestart august 2022)
Forældrenes oplysninger
-----------------------
Forældrenes navn: Søren og Pia Christensen
Adresse: Birkevej 7, 8680 Ry
Telefon: +45 23 45 67 89
E-mail: soeren.christensen@familie.dk
Fremmødetider valgt:
Morgen-SFO: 07:0008:00
Eftermiddag: 13:0017:00
Særlige oplysninger til pædagoger:
Emma har en lettere nøddeallergi (jordnødder og cashewnødder).
Kontaktperson ved allergi: Pia Christensen, tlf. 23 45 67 89
Dato for tilmelding: 15. marts 2022
Underskrift: _______________________

View File

@ -0,0 +1,31 @@
Personalemappe — Fortroligt
============================
Afdeling: Administrationen, Skanderborg Kommune
Medarbejder 1
-------------
Navn: Christian Bøgh Hansen
CPR: 150365-1102
Stilling: Skoleleder
Ansættelsesdato: 1. august 2005
Løngruppe: L4
Medarbejder 2
-------------
Navn: Lise Ravn Johansen
CPR: 020898-0203
Stilling: Pædagog, fuldtid
Ansættelsesdato: 15. september 2021
Løngruppe: L2
Medarbejder 3
-------------
Navn: Anders Munk Mortensen
CPR: 010172-1019
Stilling: Administrativ medarbejder
Ansættelsesdato: 1. marts 2010
Løngruppe: L3
Dokument oprettet: 20. april 2026
Sidst opdateret: 20. april 2026
Udarbejdet af: HR-afdelingen

View File

@ -0,0 +1,9 @@
Klasse,Navn,CPR-nummer,Adresse,Forælder tlf,Bemærkninger
7A,Magnus Lund Eriksen,010172-1019,Egevej 3 8680 Ry,+45 40 12 34 56,
7A,Nora Bjerrum Nielsen,280490-0120,Møllevej 11 8680 Ry,+45 50 23 45 67,Brillebærer
7A,Oliver Skov Madsen,250372-0100,Kirkegade 2 8660 Skanderborg,+45 60 34 56 78,
7A,Ida Holst Andersen,020898-0203,Skovbrynet 19 8680 Ry,+45 70 45 67 89,Kontaktperson: Far
7B,Rasmus Dal Kristensen,150365-1102,Rosenvej 5 8680 Ry,+45 21 56 78 90,
7B,Sofie Holm Thomsen,111111-1010,Birkevej 22 8660 Skanderborg,+45 31 67 89 01,Allergi: nødder
7B,Emil Sand Jensen,010107-4102,Hybenvej 7 8680 Ry,+45 41 78 90 12,
7B,Laura Bak Møller,410172-1200,Pilevej 4 8660 Skanderborg,+45 51 89 01 23,Beskyttet adresse
1 Klasse Navn CPR-nummer Adresse Forælder tlf Bemærkninger
2 7A Magnus Lund Eriksen 010172-1019 Egevej 3 8680 Ry +45 40 12 34 56
3 7A Nora Bjerrum Nielsen 280490-0120 Møllevej 11 8680 Ry +45 50 23 45 67 Brillebærer
4 7A Oliver Skov Madsen 250372-0100 Kirkegade 2 8660 Skanderborg +45 60 34 56 78
5 7A Ida Holst Andersen 020898-0203 Skovbrynet 19 8680 Ry +45 70 45 67 89 Kontaktperson: Far
6 7B Rasmus Dal Kristensen 150365-1102 Rosenvej 5 8680 Ry +45 21 56 78 90
7 7B Sofie Holm Thomsen 111111-1010 Birkevej 22 8660 Skanderborg +45 31 67 89 01 Allergi: nødder
8 7B Emil Sand Jensen 010107-4102 Hybenvej 7 8680 Ry +45 41 78 90 12
9 7B Laura Bak Møller 410172-1200 Pilevej 4 8660 Skanderborg +45 51 89 01 23 Beskyttet adresse

View File

@ -0,0 +1,6 @@
Medarbejder-ID,Navn,Personnummer,Afdeling,Stilling,E-mail,Telefon,Ansættelses-dato
EMP-001,Christian Bøgh Hansen,150365-1102,Ledelse,Skoleleder,c.hansen@gudenaaskolen.dk,+45 86 89 10 01,2005-08-01
EMP-002,Mette Dahl Andersen,280490-0120,Administration,Sekretær,m.andersen@gudenaaskolen.dk,+45 86 89 10 02,2012-01-15
EMP-003,Søren Lykke Jakobsen,010172-1019,Pædagogik,Lærer,s.jakobsen@gudenaaskolen.dk,+45 86 89 10 03,2009-08-01
EMP-004,Hanne Frost Pedersen,250372-0100,Pædagogik,Lærer,h.pedersen@gudenaaskolen.dk,+45 86 89 10 04,2015-08-01
EMP-005,Lise Ravn Johansen,020898-0203,SFO,Pædagog,l.johansen@gudenaaskolen.dk,+45 86 89 10 05,2021-09-15
1 Medarbejder-ID Navn Personnummer Afdeling Stilling E-mail Telefon Ansættelses-dato
2 EMP-001 Christian Bøgh Hansen 150365-1102 Ledelse Skoleleder c.hansen@gudenaaskolen.dk +45 86 89 10 01 2005-08-01
3 EMP-002 Mette Dahl Andersen 280490-0120 Administration Sekretær m.andersen@gudenaaskolen.dk +45 86 89 10 02 2012-01-15
4 EMP-003 Søren Lykke Jakobsen 010172-1019 Pædagogik Lærer s.jakobsen@gudenaaskolen.dk +45 86 89 10 03 2009-08-01
5 EMP-004 Hanne Frost Pedersen 250372-0100 Pædagogik Lærer h.pedersen@gudenaaskolen.dk +45 86 89 10 04 2015-08-01
6 EMP-005 Lise Ravn Johansen 020898-0203 SFO Pædagog l.johansen@gudenaaskolen.dk +45 86 89 10 05 2021-09-15

View File

@ -0,0 +1,16 @@
Fortrolig personoplysning — Navne- og adressebeskyttelse
==========================================================
VIGTIGT: Denne person har navne- og adressebeskyttelse i CPR-registeret.
Oplysningerne må ikke videregives uden samtykke.
Navn: Laura Bak Møller
CPR-nummer: 410172-1200
(Dag + 40 angiver beskyttet adresse)
Kontaktoplysninger administreres af kommunen.
Henvendelse via: Borgerservice, Skanderborg Kommune
Telefon: 86 52 10 00
Dokumentet er klassificeret FORTROLIGT.
Opbevares i aflåst arkiv — ikke i fællesnetværk.

View File

@ -0,0 +1,21 @@
Lægeerklæring — Helbredsattest
================================
Udstedt af: Skanderborg Lægepraksis, Adelgade 10, 8660 Skanderborg
Praktiserende læge: Dr. P. Holm
Patient: Søren Lykke Jakobsen
Fødselsdato / CPR: 010172-1019
Adresse: Skolevej 22, 8680 Ry
Telefon: +45 22 33 44 55
E-mail: soeren.jakobsen@privat.dk
Diagnose (ICD-10): F41.1 — Generaliseret angst
Behandling: Psykoterapi + medicinsk behandling (SSRI)
Særlig kategori: Psykisk lidelse — GDPR Art. 9
Erklæringens formål: Sygedagpenge, §7-opfølgning
Periode: 1. april 2026 30. juni 2026
Lægens underskrift: _______________________
Dato: 20. april 2026
Stempel: [Skanderborg Lægepraksis]

Binary file not shown.

View File

@ -0,0 +1,25 @@
Mødereferat — Pædagogisk råd
==============================
Dato: 20. april 2026
Sted: Personalerummet, Gudenaaskolen
Ordstyrer: Skolelederen
Referent: Administrationen
Dagsorden:
1. Godkendelse af referat fra seneste møde
2. Orientering om skoleårets planlægning 2026/2027
3. Status på inklusion og trivselsundersøgelse
4. Eventuelt
Ad 1: Referatet fra mødet den 15. marts 2026 blev godkendt uden bemærkninger.
Ad 2: Skolelederen orienterede om planlægningen for det kommende skoleår.
Skemaerne for 0.-9. klasse offentliggøres i Aula senest 1. juni 2026.
Der er planlagt en fælles pædagogisk dag den 10. august 2026.
Ad 3: Trivselsundersøgelsen viste generelt gode resultater.
Inklusionsvejlederen præsenterer en handlingsplan på næste møde.
Ad 4: Intet til eventuelt.
Næste møde: Tirsdag den 19. maj 2026 kl. 14:00 i personalerummet.

View File

@ -0,0 +1,31 @@
FAKTURA
=======
Leverandør: Kontor & Papir A/S
Industriparken 22, 8600 Silkeborg
CVR: 12345678
Kunde: Gudenaaskolen
Skolevej 1, 8680 Ry
EAN: 5790001234567
Fakturanr: 250372-0100
Fakturadato: 20. april 2026
Forfaldsdato: 20. maj 2026
Ordrenr: 020898-0203
Varenr: 150365-1102
Linjer:
---------------------------------------------------------------------------
Beskrivelse Antal Enhedspris Moms Total
---------------------------------------------------------------------------
Kopipapir A4 80g, pk/500 20 89,00 kr 20% 2.136,00 kr
Blækpatroner HP 305, sort 5 149,00 kr 20% 894,00 kr
Whiteboardmarker, ass. farver 3 49,95 kr 20% 179,82 kr
---------------------------------------------------------------------------
Subtotal ekskl. moms: 2.561,95 kr
Moms 25%: 640,49 kr
I alt inkl. moms: 3.202,44 kr
Betalingsbetingelser: Netto 30 dage
Bank: Jyske Bank, Reg. 7600, Konto 1234567

View File

@ -0,0 +1,20 @@
Inventarliste — Klasselokale 7A
================================
Opdateret: 20. april 2026
Af: Teknisk servicepersonale
Rum-ID: 7A-GS-2026
Lokale: Bygning C, 1. sal
Inventar:
---------
Elevborde 32 stk (serienr. påtegnet under bordet)
Elevstole 32 stk (standard, justerbar højde)
Lærerbord 1 stk (inkl. skuff, lås medfølger)
Whiteboard 2 stk (160×120 cm)
Projektor 1 stk (Epson EB-W51, serienr. 150315-4321)
Projektordug 1 stk (180 cm, motor-betjent)
Gardinmotor 2 stk (fjernstyret)
Næste serviceeftersyn: Oktober 2026
Ansvarlig: Teknisk afdeling, Skanderborg Kommune

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

View File

@ -0,0 +1,347 @@
"""
Generate binary fixture files for the local-file GDPR scan test suite.
Run from repo root:
source venv/bin/activate
python tests/fixtures/local_files/generate_fixtures.py
Fixtures produced
Document fixtures (require python-docx + openpyxl):
09_cpr_in_docx.docx Word document with 2 CPR numbers Flag
13_cpr_in_xlsx.xlsx Excel workbook with CPR numbers Flag
Audio fixtures (require mutagen):
14_audio_artist_pii.mp3 MP3 with artist/title tags (personal name) Flag
15_audio_artist_pii.flac FLAC with artist/title Vorbis comments Flag
16_audio_no_pii.mp3 MP3 with no metadata tags No flag
17_audio_no_pii.flac FLAC with no metadata No flag
Video fixtures (require mutagen):
18_video_gps.mp4 MP4 with GPS coordinates + artist tag Flag
19_video_no_pii.mp4 MP4 with no metadata tags No flag
"""
import struct
import tempfile
import os
from pathlib import Path
import sys
HERE = Path(__file__).parent
def _require(pkg):
try:
return __import__(pkg)
except ImportError:
print(f"Missing: {pkg} → pip install {pkg}", file=sys.stderr)
sys.exit(1)
openpyxl = _require("openpyxl")
docx = _require("docx")
_require("mutagen")
from openpyxl import Workbook
from openpyxl.styles import Font, PatternFill, Alignment
from docx import Document
from docx.shared import Pt, RGBColor
from docx.enum.text import WD_ALIGN_PARAGRAPH
# ── 09_cpr_in_docx.docx ───────────────────────────────────────────────────────
def make_docx():
doc = Document()
doc.add_heading("Elevjournal — Gudenaaskolen", level=1)
p = doc.add_paragraph()
p.add_run("Dette dokument indeholder personoplysninger og er fortroligt.")
p.runs[0].italic = True
doc.add_heading("Elevoplysninger", level=2)
# Use labelled paragraphs so CPR values are always preceded by ": " —
# avoids the _CPR_PREFIX_NOISE guard that fires when table-cell runs are
# concatenated without a separator.
fields = [
("Navn", "Magnus Lund Eriksen"),
("CPR-nummer", "010172-1019"),
("Klasse", "8B"),
("Adresse", "Egevej 3, 8680 Ry"),
("Telefon", "+45 40 12 34 56"),
("E-mail", "magnus.eriksen@elev.gudenaaskolen.dk"),
]
for label, value in fields:
p = doc.add_paragraph()
run_label = p.add_run(f"{label}: ")
run_label.bold = True
p.add_run(value + " ")
doc.add_heading("Forældrekontakt", level=2)
doc.add_paragraph(
"Forældrene er orienteret om elevens situation den 15. marts 2026. "
"Begge forældre deltog i mødet. Næste opfølgning er planlagt til "
"maj 2026."
)
doc.add_heading("Anden elev — tabel", level=2)
doc.add_paragraph(
"Nedenstående tabel viser en anden elev, der deler klasse med Magnus."
)
for label, value in [
("Navn", "Nora Bjerrum Nielsen"),
("Personnummer", "280490-0120"),
("Klasse", "8B"),
]:
p = doc.add_paragraph()
p.add_run(f"{label}: ").bold = True
p.add_run(value + " ")
doc.add_heading("Sagsbehandlernote", level=2)
doc.add_paragraph(
"Sagsbehandler: M. Andersen\n"
"Dato: 20. april 2026\n"
"Der er ikke fundet grundlag for yderligere foranstaltninger."
)
out = HERE / "09_cpr_in_docx.docx"
doc.save(str(out))
print(f"Written: {out.name}")
# ── 13_cpr_in_xlsx.xlsx ───────────────────────────────────────────────────────
def make_xlsx():
wb = Workbook()
# Sheet 1: Elevliste
ws1 = wb.active
ws1.title = "Elevliste"
header_font = Font(bold=True, color="FFFFFF")
header_fill = PatternFill("solid", fgColor="2B5F9E")
headers = ["Klasse", "Navn", "CPR-nummer", "Adresse", "Forælder tlf", "Bemærkninger"]
for col, h in enumerate(headers, 1):
cell = ws1.cell(row=1, column=col, value=h)
cell.font = header_font
cell.fill = header_fill
cell.alignment = Alignment(horizontal="center")
students = [
("7A", "Magnus Lund Eriksen", "010172-1019", "Egevej 3, 8680 Ry", "+45 40 12 34 56", ""),
("7A", "Nora Bjerrum Nielsen", "280490-0120", "Møllevej 11, 8680 Ry", "+45 50 23 45 67", "Brillebærer"),
("7A", "Oliver Skov Madsen", "250372-0100", "Kirkegade 2, 8660 Skanderborg", "+45 60 34 56 78", ""),
("7B", "Rasmus Dal Kristensen", "150365-1102", "Rosenvej 5, 8680 Ry", "+45 21 56 78 90", ""),
("7B", "Sofie Holm Thomsen", "111111-1010", "Birkevej 22, 8660 Skanderborg", "+45 31 67 89 01", "Allergi: nødder"),
("7B", "Emil Sand Jensen", "010107-4102", "Hybenvej 7, 8680 Ry", "+45 41 78 90 12", ""),
]
for row_i, row_data in enumerate(students, 2):
for col_i, val in enumerate(row_data, 1):
ws1.cell(row=row_i, column=col_i, value=val)
for col in ws1.columns:
max_len = max(len(str(c.value or "")) for c in col)
ws1.column_dimensions[col[0].column_letter].width = max_len + 4
# Sheet 2: Medarbejdere
ws2 = wb.create_sheet("Medarbejdere")
emp_headers = ["ID", "Navn", "Personnummer", "Afdeling", "E-mail"]
for col, h in enumerate(emp_headers, 1):
cell = ws2.cell(row=1, column=col, value=h)
cell.font = header_font
cell.fill = header_fill
cell.alignment = Alignment(horizontal="center")
employees = [
("EMP-001", "Christian Bøgh Hansen", "150365-1102", "Ledelse", "c.hansen@gudenaaskolen.dk"),
("EMP-002", "Mette Dahl Andersen", "280490-0120", "Administration", "m.andersen@gudenaaskolen.dk"),
("EMP-003", "Søren Lykke Jakobsen", "010172-1019", "Pædagogik", "s.jakobsen@gudenaaskolen.dk"),
]
for row_i, row_data in enumerate(employees, 2):
for col_i, val in enumerate(row_data, 1):
ws2.cell(row=row_i, column=col_i, value=val)
for col in ws2.columns:
max_len = max(len(str(c.value or "")) for c in col)
ws2.column_dimensions[col[0].column_letter].width = max_len + 4
out = HERE / "13_cpr_in_xlsx.xlsx"
wb.save(str(out))
print(f"Written: {out.name}")
# ── Audio / video helpers ─────────────────────────────────────────────────────
# Two silent MPEG1 Layer3 frames (128 kbps / 44100 Hz / mono).
# mutagen needs at least 2 consecutive frame headers to confirm sync.
# 4-byte header + 413 bytes frame body = 417 bytes × 2 = 834 bytes total.
_MPEG_FRAMES = (b'\xff\xfb\x90\x00' + b'\x00' * 413) * 2
def _flac_block_header(block_type: int, data_len: int, last: bool = False) -> bytes:
first = (0x80 if last else 0x00) | block_type
return bytes([first, (data_len >> 16) & 0xFF, (data_len >> 8) & 0xFF, data_len & 0xFF])
def _vorbis_comment_block(comments: dict) -> bytes:
vendor = b'GDPRScanner fixture'
data = struct.pack('<I', len(vendor)) + vendor
data += struct.pack('<I', len(comments))
for key, value in comments.items():
entry = f'{key}={value}'.encode('utf-8')
data += struct.pack('<I', len(entry)) + entry
return data
def _minimal_flac(comments: dict) -> bytes:
"""Return bytes for a valid minimal FLAC file with Vorbis comments."""
# STREAMINFO (34 bytes): 44100 Hz, mono, 16-bit, 0 samples, zero MD5.
si = bytearray(34)
si[0:2] = struct.pack('>H', 4096) # min block size
si[2:4] = struct.pack('>H', 4096) # max block size
# bytes 4-9: min/max frame sizes = 0 (unknown)
# Bits 80-99: sample_rate=44100 (0xAC44 in 20-bit field)
# Bits 100-102: channels-1 = 0 (mono)
# Bits 103-107: bits_per_sample-1 = 15 (16-bit)
# Bits 108-143: total_samples = 0; bytes 14-17 remain zero
si[10] = 0x0A # 0000_1010 — top 8 of 44100 in 20-bit field
si[11] = 0xC4 # 1100_0100
si[12] = 0x40 # bottom 4 of sample_rate | channels(000) | bps_msb(0)
si[13] = 0xF0 # bps remaining 4 bits (1111) | top 4 of total_samples (0)
vc = _vorbis_comment_block(comments)
return (
b'fLaC'
+ _flac_block_header(0, 34, last=not comments) # STREAMINFO
+ bytes(si)
+ (_flac_block_header(4, len(vc), last=True) + vc if comments else b'')
)
def _mp4_atom(name: bytes, data: bytes) -> bytes:
return struct.pack('>I', 8 + len(data)) + name + data
def _minimal_mp4_base() -> bytes:
"""Return bytes for the smallest valid MPEG-4 container mutagen can tag."""
# ftyp — identifies the file as M4A
ftyp = _mp4_atom(
b'ftyp',
b'M4A ' + struct.pack('>I', 0) + b'M4A ' + b'mp42' + b'isom',
)
# mvhd version 0 — 100 bytes of content (ISO 14496-12 §8.2.2)
mvhd = bytearray(100)
mvhd[0:4] = b'\x00\x00\x00\x00' # version + flags
struct.pack_into('>IIII', mvhd, 4, 0, 0, 1000, 0) # creation, modification, timescale, duration
struct.pack_into('>I', mvhd, 16, 0x00010000) # rate = 1.0
struct.pack_into('>H', mvhd, 20, 0x0100) # volume = 1.0
# bytes 22-31: reserved (10 bytes, already zero)
struct.pack_into('>9i', mvhd, 32, # unity matrix
0x00010000, 0, 0, 0, 0x00010000, 0, 0, 0, 0x40000000)
# bytes 68-91: pre-defined (24 bytes, already zero)
struct.pack_into('>I', mvhd, 96, 0xFFFFFFFF) # next_track_ID
return ftyp + _mp4_atom(b'moov', _mp4_atom(b'mvhd', bytes(mvhd)))
def _mp4_with_tags(tags: dict) -> bytes:
"""Return bytes for a minimal MP4 with the given mutagen tag dict."""
import mutagen.mp4
tmp = tempfile.mktemp(suffix='.mp4')
try:
with open(tmp, 'wb') as fh:
fh.write(_minimal_mp4_base())
f = mutagen.mp4.MP4(tmp)
f.add_tags()
for key, value in tags.items():
f.tags[key] = [value]
f.save()
with open(tmp, 'rb') as fh:
return fh.read()
finally:
if os.path.exists(tmp):
os.unlink(tmp)
# ── 14_audio_artist_pii.mp3 ───────────────────────────────────────────────────
def make_mp3_pii():
from mutagen.easyid3 import EasyID3
tmp = tempfile.mktemp(suffix='.mp3')
try:
t = EasyID3()
t['artist'] = ['Emma Slot Henriksen']
t['title'] = ['Fortrolig optagelse — personalemøde']
t['date'] = ['2026-04-21']
t.save(tmp)
with open(tmp, 'rb') as fh:
id3_bytes = fh.read()
finally:
if os.path.exists(tmp):
os.unlink(tmp)
out = HERE / '14_audio_artist_pii.mp3'
out.write_bytes(id3_bytes + _MPEG_FRAMES)
print(f"Written: {out.name}")
# ── 15_audio_artist_pii.flac ──────────────────────────────────────────────────
def make_flac_pii():
out = HERE / '15_audio_artist_pii.flac'
out.write_bytes(_minimal_flac({
'ARTIST': 'Emma Slot Henriksen',
'TITLE': 'Fortrolig optagelse — personalemøde',
'DATE': '2026-04-21',
}))
print(f"Written: {out.name}")
# ── 16_audio_no_pii.mp3 ───────────────────────────────────────────────────────
def make_mp3_no_pii():
from mutagen.easyid3 import EasyID3
tmp = tempfile.mktemp(suffix='.mp3')
try:
EasyID3().save(tmp) # empty ID3 header, no tags
with open(tmp, 'rb') as fh:
id3_bytes = fh.read()
finally:
if os.path.exists(tmp):
os.unlink(tmp)
out = HERE / '16_audio_no_pii.mp3'
out.write_bytes(id3_bytes + _MPEG_FRAMES)
print(f"Written: {out.name}")
# ── 17_audio_no_pii.flac ──────────────────────────────────────────────────────
def make_flac_no_pii():
out = HERE / '17_audio_no_pii.flac'
out.write_bytes(_minimal_flac({})) # no Vorbis comment block
print(f"Written: {out.name}")
# ── 18_video_gps.mp4 ─────────────────────────────────────────────────────────
def make_mp4_gps():
out = HERE / '18_video_gps.mp4'
out.write_bytes(_mp4_with_tags({
'©xyz': '+55.6761+012.5683+000.000/', # Copenhagen
'©ART': 'Emma Slot Henriksen',
'©nam': 'Optagelse fra skolegården',
}))
print(f"Written: {out.name}")
# ── 19_video_no_pii.mp4 ──────────────────────────────────────────────────────
def make_mp4_no_pii():
out = HERE / '19_video_no_pii.mp4'
out.write_bytes(_minimal_mp4_base()) # no moov/udta/meta/ilst — no tags
print(f"Written: {out.name}")
if __name__ == "__main__":
make_docx()
make_xlsx()
make_mp3_pii()
make_flac_pii()
make_mp3_no_pii()
make_flac_no_pii()
make_mp4_gps()
make_mp4_no_pii()
print("Done.")

View File

@ -252,3 +252,36 @@ class TestFernet:
def test_decrypt_empty_returns_empty(self):
result = app_config._decrypt_password("")
assert result == ""
class TestSmtpConfigLegacyKeys:
"""SMTP config saved by the older settings tab used `user`/`starttls`;
readers expect `username`/`use_tls`. _load_smtp_config must normalise them."""
def test_legacy_keys_normalised_on_load(self, tmp_path, monkeypatch):
import json
p = tmp_path / "smtp.json"
p.write_text(json.dumps({
"host": "smtp.gmail.com", "port": 587,
"user": "netadmin@adm.example.dk", # legacy key
"starttls": True, # legacy key
"from_addr": "netadmin@adm.example.dk",
"recipients": ["a@example.dk"],
}), encoding="utf-8")
monkeypatch.setattr(app_config, "_SMTP_CONFIG_PATH", p)
cfg = app_config._load_smtp_config()
assert cfg["username"] == "netadmin@adm.example.dk"
assert cfg["use_tls"] is True
def test_canonical_keys_take_precedence(self, tmp_path, monkeypatch):
import json
p = tmp_path / "smtp.json"
p.write_text(json.dumps({
"username": "canonical@example.dk",
"user": "legacy@example.dk",
}), encoding="utf-8")
monkeypatch.setattr(app_config, "_SMTP_CONFIG_PATH", p)
cfg = app_config._load_smtp_config()
assert cfg["username"] == "canonical@example.dk"

View File

@ -22,7 +22,7 @@ import checkpoint
@pytest.fixture(autouse=True)
def _isolate(tmp_path, monkeypatch):
"""Redirect all disk writes to a temp dir for each test."""
monkeypatch.setattr(checkpoint, "_CHECKPOINT_PATH", tmp_path / "checkpoint.json")
monkeypatch.setattr(checkpoint, "_DATA_DIR", tmp_path)
monkeypatch.setattr(checkpoint, "_DELTA_PATH", tmp_path / "delta.json")

View File

@ -265,3 +265,71 @@ class TestExportImport:
tgt.import_db(str(export_path), mode="replace")
results = tgt.lookup_data_subject("290472-1234")
assert len(results) >= 1
# ─────────────────────────────────────────────────────────────────────────────
# Orphan-scan recovery (crash / kill / mid-scan restart)
# ─────────────────────────────────────────────────────────────────────────────
class TestOrphanScanRecovery:
def _start_unfinished_scan(self, db, item_id):
"""Begin a scan and save an item but never call finish_scan."""
sid = db.begin_scan({"sources": ["email"], "user_ids": []})
db.save_item(sid, _make_card(item_id=item_id))
return sid
def test_unfinished_scan_items_hidden_until_recovery(self, tmp_db):
self._start_unfinished_scan(tmp_db, "orphan-1")
# Not finalised → invisible to the open-items view
assert tmp_db.get_open_items() == []
def test_recovery_finalises_and_reveals_items(self, tmp_db):
self._start_unfinished_scan(tmp_db, "orphan-1")
self._start_unfinished_scan(tmp_db, "orphan-2")
recovered = tmp_db.finalize_orphan_scans()
assert recovered == 2
ids = {row["id"] for row in tmp_db.get_open_items()}
assert ids == {"orphan-1", "orphan-2"}
def test_recovery_leaves_finished_scans_untouched(self, tmp_db):
sid = tmp_db.begin_scan({"sources": ["email"], "user_ids": []})
tmp_db.save_item(sid, _make_card(item_id="done-1"))
tmp_db.finish_scan(sid, total_scanned=1)
before = tmp_db._connect().execute(
"SELECT finished_at FROM scans WHERE id=?", (sid,)
).fetchone()[0]
assert tmp_db.finalize_orphan_scans() == 0 # nothing to recover
after = tmp_db._connect().execute(
"SELECT finished_at FROM scans WHERE id=?", (sid,)
).fetchone()[0]
assert after == before # finished_at not rewritten
def test_recovery_is_idempotent(self, tmp_db):
self._start_unfinished_scan(tmp_db, "orphan-1")
assert tmp_db.finalize_orphan_scans() == 1
assert tmp_db.finalize_orphan_scans() == 0
# ─────────────────────────────────────────────────────────────────────────────
# account_name persistence (user/group badge data)
# ─────────────────────────────────────────────────────────────────────────────
class TestAccountNamePersistence:
def test_account_name_round_trips(self, tmp_db):
sid = tmp_db.begin_scan({"sources": ["email"], "user_ids": []})
tmp_db.save_item(sid, _make_card(item_id="an-1")) # account_name="Test User"
tmp_db.finish_scan(sid, total_scanned=1)
row = [r for r in tmp_db.get_open_items() if r["id"] == "an-1"][0]
assert row.get("account_name") == "Test User"
def test_account_name_column_exists(self, tmp_db):
cols = [r[1] for r in tmp_db._connect().execute(
"PRAGMA table_info(flagged_items)").fetchall()]
assert "account_name" in cols

311
tests/test_google_scan.py Normal file
View File

@ -0,0 +1,311 @@
"""
Route and engine tests for the Google Workspace scan module.
Covers:
- GET /api/google/scan/users auth guard, user list, error propagation
- POST /api/google/scan/start auth guard, concurrency lock, successful start, lock release
- POST /api/google/scan/cancel abort signal
- _run_google_scan no-connector broadcast, CPR hit flagging, source_type tagging
"""
from __future__ import annotations
import threading
import time
from unittest.mock import MagicMock
import pytest
# ── Fixtures ──────────────────────────────────────────────────────────────────
@pytest.fixture(scope="module")
def flask_app():
import gdpr_scanner
gdpr_scanner.app.config["TESTING"] = True
gdpr_scanner.app.config["WTF_CSRF_ENABLED"] = False
return gdpr_scanner.app
@pytest.fixture()
def client(flask_app):
with flask_app.test_client() as c:
yield c
@pytest.fixture()
def mock_google_connector(monkeypatch):
from routes import state
conn = MagicMock()
conn.list_users.return_value = []
monkeypatch.setattr(state, "google_connector", conn)
return conn
@pytest.fixture(autouse=True)
def clean_google_state():
yield
from routes import state
# Release the Google scan lock if a test left it acquired
acquired = state._google_scan_lock.acquire(blocking=False)
if acquired:
state._google_scan_lock.release()
state._google_scan_abort.clear()
# ── GET /api/google/scan/users ────────────────────────────────────────────────
class TestGoogleScanUsers:
def test_not_connected_returns_401(self, client, monkeypatch):
from routes import state
monkeypatch.setattr(state, "google_connector", None)
r = client.get("/api/google/scan/users")
assert r.status_code == 401
assert r.json["error"] == "not connected"
def test_returns_user_list(self, client, mock_google_connector):
mock_google_connector.list_users.return_value = [
{"id": "1", "email": "alice@test.dk", "displayName": "Alice", "userRole": "student"},
]
r = client.get("/api/google/scan/users")
assert r.status_code == 200
assert len(r.json["users"]) == 1
assert r.json["users"][0]["email"] == "alice@test.dk"
def test_returns_empty_list_when_no_users(self, client, mock_google_connector):
mock_google_connector.list_users.return_value = []
r = client.get("/api/google/scan/users")
assert r.status_code == 200
assert r.json["users"] == []
def test_connector_error_returns_500(self, client, mock_google_connector):
mock_google_connector.list_users.side_effect = Exception("Admin SDK unavailable")
r = client.get("/api/google/scan/users")
assert r.status_code == 500
assert "error" in r.json
# ── POST /api/google/scan/start ───────────────────────────────────────────────
class TestGoogleScanStart:
def test_not_connected_returns_401(self, client, monkeypatch):
from routes import state
monkeypatch.setattr(state, "google_connector", None)
r = client.post("/api/google/scan/start", json={})
assert r.status_code == 401
assert "not connected" in r.json["error"]
def test_already_running_returns_409(self, client, mock_google_connector):
from routes import state
state._google_scan_lock.acquire()
try:
r = client.post("/api/google/scan/start", json={})
assert r.status_code == 409
assert "already running" in r.json["error"]
finally:
state._google_scan_lock.release()
def test_starts_successfully(self, client, mock_google_connector, monkeypatch):
import routes.google_scan
monkeypatch.setattr(routes.google_scan, "_run_google_scan", lambda opts: None)
r = client.post("/api/google/scan/start", json={})
assert r.status_code == 200
assert r.json["status"] == "started"
def test_abort_event_cleared_on_start(self, client, mock_google_connector, monkeypatch):
import routes.google_scan
from routes import state
state._google_scan_abort.set()
monkeypatch.setattr(routes.google_scan, "_run_google_scan", lambda opts: None)
client.post("/api/google/scan/start", json={})
assert not state._google_scan_abort.is_set()
def test_lock_released_after_scan_completes(self, client, mock_google_connector, monkeypatch):
import routes.google_scan
from routes import state
done = threading.Event()
def _fake_scan(opts):
time.sleep(0.02)
done.set()
monkeypatch.setattr(routes.google_scan, "_run_google_scan", _fake_scan)
r = client.post("/api/google/scan/start", json={})
assert r.status_code == 200
assert done.wait(timeout=3), "Scan thread did not complete in time"
time.sleep(0.05) # allow finally block to run
acquired = state._google_scan_lock.acquire(blocking=False)
assert acquired, "Lock was not released after scan completed"
state._google_scan_lock.release()
@pytest.mark.filterwarnings("ignore::pytest.PytestUnhandledThreadExceptionWarning")
def test_lock_released_on_scan_exception(self, client, mock_google_connector, monkeypatch):
import routes.google_scan
from routes import state
done = threading.Event()
def _failing_scan(opts):
done.set()
raise RuntimeError("simulated crash")
monkeypatch.setattr(routes.google_scan, "_run_google_scan", _failing_scan)
r = client.post("/api/google/scan/start", json={})
assert r.status_code == 200
assert done.wait(timeout=3), "Scan thread did not complete in time"
time.sleep(0.05)
acquired = state._google_scan_lock.acquire(blocking=False)
assert acquired, "Lock was not released after scan raised an exception"
state._google_scan_lock.release()
# ── POST /api/google/scan/cancel ─────────────────────────────────────────────
class TestGoogleScanCancel:
def test_sets_abort_event(self, client):
from routes import state
state._google_scan_abort.clear()
r = client.post("/api/google/scan/cancel")
assert r.status_code == 200
assert r.json["status"] == "cancelling"
assert state._google_scan_abort.is_set()
def test_idempotent_when_not_running(self, client):
r = client.post("/api/google/scan/cancel")
assert r.status_code == 200
assert r.json["status"] == "cancelling"
# ── _run_google_scan engine ───────────────────────────────────────────────────
class TestRunGoogleScan:
"""
Unit-tests for _run_google_scan() called synchronously with all heavy
dependencies mocked: broadcast, _scan_bytes, DB, checkpoint I/O.
"""
def _setup_mocks(self, monkeypatch, conn, scan_bytes_result=None):
import gdpr_scanner
import checkpoint
import scan_engine
import gdpr_db
from routes import state
events = []
monkeypatch.setattr(state, "google_connector", conn)
monkeypatch.setattr(gdpr_scanner, "broadcast",
lambda evt, data=None: events.append((evt, data or {})))
monkeypatch.setattr(gdpr_scanner, "_scan_bytes",
lambda data, name, **kw: scan_bytes_result or {
"cprs": [], "pii_counts": None, "emails": [], "phones": []
})
monkeypatch.setattr(checkpoint, "_load_checkpoint", lambda *a, **kw: None)
monkeypatch.setattr(checkpoint, "_save_checkpoint", lambda *a, **kw: None)
monkeypatch.setattr(checkpoint, "_clear_checkpoint", lambda *a, **kw: None)
monkeypatch.setattr(checkpoint, "_load_delta_tokens", lambda: {})
monkeypatch.setattr(checkpoint, "_save_delta_tokens", lambda *a: None)
monkeypatch.setattr(scan_engine, "_with_disposition", lambda card, db: card)
monkeypatch.setattr(gdpr_db, "get_db", lambda *a, **kw: None)
gdpr_scanner.flagged_items.clear()
return events
def _run(self, monkeypatch, conn, options, scan_bytes_result=None):
import gdpr_scanner
import routes.google_scan as gs
events = self._setup_mocks(monkeypatch, conn, scan_bytes_result)
gs._run_google_scan(options)
gdpr_scanner.flagged_items.clear()
return events
def test_no_connector_broadcasts_error_and_done(self, monkeypatch):
import gdpr_scanner
import routes.google_scan as gs
from routes import state
events = []
monkeypatch.setattr(state, "google_connector", None)
monkeypatch.setattr(gdpr_scanner, "broadcast",
lambda evt, data=None: events.append((evt, data or {})))
gs._run_google_scan({"sources": ["gmail"], "user_emails": ["a@b.dk"], "options": {}})
assert any(evt == "scan_error" for evt, _ in events)
assert any(evt == "google_scan_done" for evt, _ in events)
def test_gmail_item_with_cpr_is_flagged(self, monkeypatch):
conn = MagicMock()
conn.list_users.return_value = []
conn.iter_gmail_messages.return_value = [
({"id": "msg1", "name": "report.txt", "size": 1024, "lastModifiedDateTime": "2026-01-01"}, b"content"),
]
cpr_result = {"cprs": [{"formatted": "010101-1234"}], "pii_counts": None, "emails": [], "phones": []}
events = self._run(monkeypatch, conn,
{"sources": ["gmail"], "user_emails": ["a@test.dk"], "options": {}},
scan_bytes_result=cpr_result)
flagged = [d for evt, d in events if evt == "scan_file_flagged"]
assert len(flagged) == 1
def test_gmail_item_source_type_is_gmail(self, monkeypatch):
conn = MagicMock()
conn.list_users.return_value = []
conn.iter_gmail_messages.return_value = [
({"id": "msg2", "name": "invoice.txt", "size": 512, "lastModifiedDateTime": "2026-01-01"}, b"data"),
]
cpr_result = {"cprs": [{"formatted": "020202-2345"}], "pii_counts": None, "emails": [], "phones": []}
events = self._run(monkeypatch, conn,
{"sources": ["gmail"], "user_emails": ["a@test.dk"], "options": {}},
scan_bytes_result=cpr_result)
flagged = [d for evt, d in events if evt == "scan_file_flagged"]
assert flagged[0]["source_type"] == "gmail"
def test_gmail_item_without_pii_not_flagged(self, monkeypatch):
conn = MagicMock()
conn.list_users.return_value = []
conn.iter_gmail_messages.return_value = [
({"id": "msg3", "name": "memo.txt", "size": 100}, b"hello world"),
]
events = self._run(monkeypatch, conn,
{"sources": ["gmail"], "user_emails": ["a@test.dk"], "options": {}})
assert not any(evt == "scan_file_flagged" for evt, _ in events)
def test_gdrive_item_source_type_is_gdrive(self, monkeypatch):
conn = MagicMock()
conn.list_users.return_value = []
conn.iter_gmail_messages.return_value = []
conn.iter_drive_files.return_value = [
({"id": "file1", "name": "doc.docx", "size": 2048, "lastModifiedDateTime": "2026-01-01"}, b"data"),
]
cpr_result = {"cprs": [{"formatted": "030303-3456"}], "pii_counts": None, "emails": [], "phones": []}
events = self._run(monkeypatch, conn,
{"sources": ["gmail", "gdrive"], "user_emails": ["a@test.dk"], "options": {}},
scan_bytes_result=cpr_result)
gdrive = [d for evt, d in events if evt == "scan_file_flagged" and d.get("source_type") == "gdrive"]
assert len(gdrive) == 1
def test_scan_done_always_broadcast(self, monkeypatch):
conn = MagicMock()
conn.list_users.return_value = []
conn.iter_gmail_messages.return_value = []
events = self._run(monkeypatch, conn,
{"sources": ["gmail"], "user_emails": ["a@test.dk"], "options": {}})
done = [d for evt, d in events if evt == "google_scan_done"]
assert len(done) == 1
assert "flagged_count" in done[0]
assert "total_scanned" in done[0]
def test_scan_done_counts_are_correct(self, monkeypatch):
conn = MagicMock()
conn.list_users.return_value = []
conn.iter_gmail_messages.return_value = [
({"id": "m1", "name": "a.txt", "size": 100}, b"x"),
({"id": "m2", "name": "b.txt", "size": 100}, b"y"),
]
cpr_result = {"cprs": [{"formatted": "040404-4567"}], "pii_counts": None, "emails": [], "phones": []}
events = self._run(monkeypatch, conn,
{"sources": ["gmail"], "user_emails": ["a@test.dk"], "options": {}},
scan_bytes_result=cpr_result)
done = next(d for evt, d in events if evt == "google_scan_done")
assert done["total_scanned"] == 2
assert done["flagged_count"] == 2

View File

@ -0,0 +1,663 @@
"""
Route integration tests security-sensitive paths and data-correctness contracts.
Covers:
- Viewer token CRUD and scope validation
- GET /api/db/flagged role and user scope enforcement
- POST /api/db/disposition/bulk only updates selected items
- Viewer PIN set / verify / rate-limit / clear
- Interface PIN set / gate / clear
- Scan lock always released (even when run_scan raises)
- GET /api/db/sessions basic shape
- Profile routes CRUD and rename
"""
from __future__ import annotations
import time
from unittest.mock import MagicMock
import pytest
# ---------------------------------------------------------------------------
# Module-level app fixture (shared with test_routes.py via flask_app)
# ---------------------------------------------------------------------------
@pytest.fixture(scope="module")
def flask_app():
import gdpr_scanner
gdpr_scanner.app.config["TESTING"] = True
gdpr_scanner.app.config["WTF_CSRF_ENABLED"] = False
return gdpr_scanner.app
@pytest.fixture()
def client(flask_app):
with flask_app.test_client() as c:
yield c
@pytest.fixture()
def db_patch(tmp_path, monkeypatch):
from gdpr_db import ScanDB
import routes.database, routes.export
db = ScanDB(str(tmp_path / "test.db"))
monkeypatch.setattr(routes.database, "_get_db", lambda: db)
monkeypatch.setattr(routes.database, "DB_OK", True)
monkeypatch.setattr(routes.export, "_get_db", lambda: db)
monkeypatch.setattr(routes.export, "DB_OK", True)
return db
@pytest.fixture()
def mock_connector(monkeypatch):
from routes import state
conn = MagicMock()
monkeypatch.setattr(state, "connector", conn)
return conn
@pytest.fixture(autouse=True)
def clean_state():
from routes import state
yield
state.flagged_items.clear()
if not state._scan_lock.acquire(blocking=False):
pass
else:
state._scan_lock.release()
# ---------------------------------------------------------------------------
# Helpers
# ---------------------------------------------------------------------------
def _seed_scan(db, items: list[dict]) -> int:
"""Create a completed scan and persist items. Returns the scan_id."""
scan_id = db.begin_scan({"sources": ["email"], "user_ids": [], "options": {}})
for item in items:
db.save_item(scan_id, item)
db.finish_scan(scan_id, total_scanned=len(items))
return scan_id
def _item(item_id: str, role: str = "staff", account_id: str = "") -> dict:
return {
"id": item_id,
"name": f"{item_id}.docx",
"source": "email",
"source_type": "email",
"account_id": account_id or f"{item_id}@school.dk",
"user_role": role,
"cpr_count": 1,
"face_count": 0,
"size_kb": 10,
"modified": "2025-01-01T00:00:00",
}
def _clear_viewer_pins():
"""Remove both viewer and interface PINs between tests."""
from app_config import clear_viewer_pin, clear_interface_pin
clear_viewer_pin()
clear_interface_pin()
# ---------------------------------------------------------------------------
# Viewer token CRUD
# ---------------------------------------------------------------------------
class TestViewerTokenCRUD:
def test_create_and_list(self, client):
r = client.post("/api/viewer/tokens",
json={"label": "Test token", "expires_days": 7})
assert r.status_code == 201
data = r.get_json()
assert "token" in data
tok = data["token"]
r2 = client.get("/api/viewer/tokens")
assert r2.status_code == 200
tokens = r2.get_json()
assert any(t["token"] == tok for t in tokens)
def test_delete_existing_token(self, client):
r = client.post("/api/viewer/tokens", json={"label": "to-delete"})
tok = r.get_json()["token"]
r2 = client.delete(f"/api/viewer/tokens/{tok}")
assert r2.status_code == 200
assert r2.get_json()["ok"] is True
r3 = client.get("/api/viewer/tokens")
tokens = r3.get_json()
assert not any(t["token"] == tok for t in tokens)
def test_delete_nonexistent_token_returns_404(self, client):
r = client.delete("/api/viewer/tokens/doesnotexist123")
assert r.status_code == 404
def test_validate_valid_token(self, client):
tok = client.post("/api/viewer/tokens", json={}).get_json()["token"]
r = client.post("/api/viewer/tokens/validate", json={"token": tok})
assert r.status_code == 200
assert r.get_json()["valid"] is True
def test_validate_invalid_token(self, client):
r = client.post("/api/viewer/tokens/validate",
json={"token": "notarealtoken00000000"})
assert r.status_code == 401
assert r.get_json()["valid"] is False
class TestViewerTokenScopeValidation:
def test_role_and_user_mutually_exclusive(self, client):
r = client.post("/api/viewer/tokens", json={
"scope": {"role": "student", "user": "alice@school.dk"}
})
assert r.status_code == 400
assert "mutually exclusive" in r.get_json()["error"]
def test_invalid_role_value(self, client):
r = client.post("/api/viewer/tokens", json={
"scope": {"role": "teacher"}
})
assert r.status_code == 400
assert "role" in r.get_json()["error"]
def test_user_email_must_contain_at(self, client):
r = client.post("/api/viewer/tokens", json={
"scope": {"user": "notanemail"}
})
assert r.status_code == 400
assert "email" in r.get_json()["error"].lower()
def test_valid_role_scope_stored(self, client):
r = client.post("/api/viewer/tokens",
json={"scope": {"role": "student"}})
assert r.status_code == 201
assert r.get_json()["scope"] == {"role": "student"}
def test_valid_user_scope_stored(self, client):
r = client.post("/api/viewer/tokens", json={
"scope": {
"user": ["alice@m365.dk", "alice@gws.dk"],
"display_name": "Alice Smith",
}
})
assert r.status_code == 201
scope = r.get_json()["scope"]
assert scope["user"] == ["alice@m365.dk", "alice@gws.dk"]
assert scope["display_name"] == "Alice Smith"
# ---------------------------------------------------------------------------
# GET /api/db/flagged — scope enforcement
# ---------------------------------------------------------------------------
class TestFlaggedScopeEnforcement:
def test_no_scope_returns_all_items(self, client, db_patch):
_seed_scan(db_patch, [
_item("s1", role="student"),
_item("s2", role="staff"),
])
r = client.get("/api/db/flagged")
assert r.status_code == 200
ids = {row["id"] for row in r.get_json()}
assert "s1" in ids
assert "s2" in ids
def test_role_scope_student_excludes_staff(self, client, db_patch):
_seed_scan(db_patch, [
_item("r1", role="student"),
_item("r2", role="staff"),
])
with client.session_transaction() as sess:
sess["viewer_ok"] = True
sess["viewer_scope"] = {"role": "student"}
r = client.get("/api/db/flagged")
ids = {row["id"] for row in r.get_json()}
assert "r1" in ids
assert "r2" not in ids
def test_role_scope_staff_excludes_students(self, client, db_patch):
_seed_scan(db_patch, [
_item("t1", role="student"),
_item("t2", role="staff"),
])
with client.session_transaction() as sess:
sess["viewer_ok"] = True
sess["viewer_scope"] = {"role": "staff"}
r = client.get("/api/db/flagged")
ids = {row["id"] for row in r.get_json()}
assert "t2" in ids
assert "t1" not in ids
def test_user_scope_returns_only_matching_account_id(self, client, db_patch):
_seed_scan(db_patch, [
_item("u1", account_id="alice@m365.dk"),
_item("u2", account_id="bob@m365.dk"),
])
with client.session_transaction() as sess:
sess["viewer_ok"] = True
sess["viewer_scope"] = {"user": ["alice@m365.dk"]}
r = client.get("/api/db/flagged")
ids = {row["id"] for row in r.get_json()}
assert "u1" in ids
assert "u2" not in ids
def test_user_scope_matches_both_platform_emails(self, client, db_patch):
# Same person — M365 UPN and GWS email both in scope
_seed_scan(db_patch, [
_item("p1", account_id="alice@m365.dk"),
_item("p2", account_id="alice@gws.dk"),
_item("p3", account_id="bob@m365.dk"),
])
with client.session_transaction() as sess:
sess["viewer_ok"] = True
sess["viewer_scope"] = {"user": ["alice@m365.dk", "alice@gws.dk"]}
r = client.get("/api/db/flagged")
ids = {row["id"] for row in r.get_json()}
assert "p1" in ids
assert "p2" in ids
assert "p3" not in ids
def test_user_scope_case_insensitive(self, client, db_patch):
_seed_scan(db_patch, [_item("ci1", account_id="Alice@M365.dk")])
with client.session_transaction() as sess:
sess["viewer_ok"] = True
sess["viewer_scope"] = {"user": ["alice@m365.dk"]}
r = client.get("/api/db/flagged")
ids = {row["id"] for row in r.get_json()}
assert "ci1" in ids
def test_no_ref_returns_open_items_across_all_sessions(self, client, db_patch):
# Two scans in separate session windows. The default (no-ref) view must
# surface unactioned items from BOTH, not just the latest session.
old_id = _seed_scan(db_patch, [_item("o1")])
db_patch._connect().execute(
"UPDATE scans SET started_at = started_at - 400 WHERE id = ?", (old_id,)
)
db_patch._connect().commit()
_seed_scan(db_patch, [_item("o2")])
r = client.get("/api/db/flagged")
ids = {row["id"] for row in r.get_json()}
assert ids == {"o1", "o2"}
def test_no_ref_excludes_items_with_a_disposition(self, client, db_patch):
_seed_scan(db_patch, [_item("d1"), _item("d2")])
db_patch.set_disposition("d1", "kept")
r = client.get("/api/db/flagged")
ids = {row["id"] for row in r.get_json()}
assert "d2" in ids # untouched → still open
assert "d1" not in ids # action taken → hidden
def test_no_ref_unreviewed_disposition_stays_open(self, client, db_patch):
_seed_scan(db_patch, [_item("u1")])
db_patch.set_disposition("u1", "unreviewed")
r = client.get("/api/db/flagged")
ids = {row["id"] for row in r.get_json()}
assert "u1" in ids # 'unreviewed' status is not an action
def test_no_ref_dedupes_rescanned_item_to_latest(self, client, db_patch):
# Same item flagged by two scans → appears once.
old_id = _seed_scan(db_patch, [_item("k1")])
db_patch._connect().execute(
"UPDATE scans SET started_at = started_at - 400 WHERE id = ?", (old_id,)
)
db_patch._connect().commit()
_seed_scan(db_patch, [_item("k1")])
rows = [row for row in client.get("/api/db/flagged").get_json() if row["id"] == "k1"]
assert len(rows) == 1
def test_ref_param_loads_historical_session(self, client, db_patch):
# Push first scan >300 s into the past so it occupies its own session window.
old_id = _seed_scan(db_patch, [_item("h1")])
db_patch._connect().execute(
"UPDATE scans SET started_at = started_at - 400 WHERE id = ?", (old_id,)
)
db_patch._connect().commit()
_seed_scan(db_patch, [_item("h2")])
r = client.get(f"/api/db/flagged?ref={old_id}")
ids = {row["id"] for row in r.get_json()}
assert "h1" in ids
# h2 belongs to a different (newer) session window — must not appear
assert "h2" not in ids
# ---------------------------------------------------------------------------
# POST /api/db/disposition/bulk
# ---------------------------------------------------------------------------
class TestBulkDisposition:
def test_updates_selected_items(self, client, db_patch):
_seed_scan(db_patch, [_item("b1"), _item("b2"), _item("b3")])
r = client.post("/api/db/disposition/bulk", json={
"item_ids": ["b1", "b2"],
"status": "retain-legal",
})
assert r.status_code == 200
assert r.get_json()["saved"] == 2
assert db_patch.get_disposition("b1")["status"] == "retain-legal"
assert db_patch.get_disposition("b2")["status"] == "retain-legal"
def test_unselected_item_unchanged(self, client, db_patch):
_seed_scan(db_patch, [_item("c1"), _item("c2")])
client.post("/api/db/disposition/bulk", json={
"item_ids": ["c1"],
"status": "delete-scheduled",
})
d = db_patch.get_disposition("c2")
# c2 was not in the bulk request — must remain unreviewed
assert d is None or d.get("status", "unreviewed") == "unreviewed"
def test_missing_item_ids_returns_400(self, client, db_patch):
r = client.post("/api/db/disposition/bulk",
json={"status": "retain-legal"})
assert r.status_code == 400
def test_missing_status_returns_400(self, client, db_patch):
r = client.post("/api/db/disposition/bulk",
json={"item_ids": ["x"]})
assert r.status_code == 400
def test_without_db_returns_503(self, client, monkeypatch):
import routes.database
monkeypatch.setattr(routes.database, "DB_OK", False)
r = client.post("/api/db/disposition/bulk",
json={"item_ids": ["x"], "status": "retain-legal"})
assert r.status_code == 503
# ---------------------------------------------------------------------------
# Viewer PIN
# ---------------------------------------------------------------------------
class TestViewerPin:
def setup_method(self):
_clear_viewer_pins()
def teardown_method(self):
_clear_viewer_pins()
def test_status_no_pin(self, client):
r = client.get("/api/viewer/pin")
assert r.status_code == 200
assert r.get_json()["pin_set"] is False
def test_set_and_status_reflects_set(self, client):
client.post("/api/viewer/pin", json={"pin": "1234"})
r = client.get("/api/viewer/pin")
assert r.get_json()["pin_set"] is True
def test_set_too_short_rejected(self, client):
r = client.post("/api/viewer/pin", json={"pin": "123"})
assert r.status_code == 400
def test_set_too_long_rejected(self, client):
r = client.post("/api/viewer/pin", json={"pin": "123456789"})
assert r.status_code == 400
def test_set_non_digits_rejected(self, client):
r = client.post("/api/viewer/pin", json={"pin": "abcd"})
assert r.status_code == 400
def test_verify_correct_pin_sets_session(self, client):
client.post("/api/viewer/pin", json={"pin": "4321"})
r = client.post("/api/viewer/pin/verify", json={"pin": "4321"})
assert r.status_code == 200
assert r.get_json()["ok"] is True
def test_verify_wrong_pin_returns_401(self, client):
client.post("/api/viewer/pin", json={"pin": "4321"})
r = client.post("/api/viewer/pin/verify", json={"pin": "9999"})
assert r.status_code == 401
def test_verify_rate_limit_after_5_failures(self, client):
client.post("/api/viewer/pin", json={"pin": "5678"})
from routes.viewer import _pin_attempts
_pin_attempts.clear()
for _ in range(5):
client.post("/api/viewer/pin/verify", json={"pin": "0000"})
r = client.post("/api/viewer/pin/verify", json={"pin": "0000"})
assert r.status_code == 429
_pin_attempts.clear()
def test_change_pin_requires_current(self, client):
client.post("/api/viewer/pin", json={"pin": "1111"})
r = client.post("/api/viewer/pin",
json={"pin": "2222", "current_pin": "9999"})
assert r.status_code == 403
def test_change_pin_with_correct_current(self, client):
client.post("/api/viewer/pin", json={"pin": "1111"})
r = client.post("/api/viewer/pin",
json={"pin": "2222", "current_pin": "1111"})
assert r.status_code == 200
# Old PIN no longer valid
r2 = client.post("/api/viewer/pin/verify", json={"pin": "1111"})
assert r2.status_code == 401
def test_clear_pin_requires_current(self, client):
client.post("/api/viewer/pin", json={"pin": "3333"})
r = client.delete("/api/viewer/pin", json={"current_pin": "0000"})
assert r.status_code == 403
def test_clear_pin_with_correct_current(self, client):
client.post("/api/viewer/pin", json={"pin": "3333"})
r = client.delete("/api/viewer/pin", json={"current_pin": "3333"})
assert r.status_code == 200
assert client.get("/api/viewer/pin").get_json()["pin_set"] is False
# ---------------------------------------------------------------------------
# Interface PIN
# ---------------------------------------------------------------------------
class TestInterfacePin:
def setup_method(self):
_clear_viewer_pins()
def teardown_method(self):
_clear_viewer_pins()
def test_status_no_pin(self, client):
r = client.get("/api/interface/pin")
assert r.get_json()["pin_set"] is False
def test_set_and_verify(self, client):
r = client.post("/api/interface/pin", json={"pin": "7777"})
assert r.status_code == 200
# Gate is now active — authenticate before the status check
with client.session_transaction() as sess:
sess["interface_ok"] = True
assert client.get("/api/interface/pin").get_json()["pin_set"] is True
def test_non_digit_rejected(self, client):
r = client.post("/api/interface/pin", json={"pin": "abcd"})
assert r.status_code == 400
def test_set_requires_current_when_set(self, client):
client.post("/api/interface/pin", json={"pin": "7777"})
with client.session_transaction() as sess:
sess["interface_ok"] = True
r = client.post("/api/interface/pin",
json={"pin": "8888", "current_pin": "0000"})
assert r.status_code == 403
def test_clear_requires_current(self, client):
client.post("/api/interface/pin", json={"pin": "7777"})
with client.session_transaction() as sess:
sess["interface_ok"] = True
r = client.delete("/api/interface/pin", json={"current_pin": "0000"})
assert r.status_code == 403
def test_clear_with_correct_current(self, client):
client.post("/api/interface/pin", json={"pin": "7777"})
with client.session_transaction() as sess:
sess["interface_ok"] = True
r = client.delete("/api/interface/pin", json={"current_pin": "7777"})
assert r.status_code == 200
assert client.get("/api/interface/pin").get_json()["pin_set"] is False
# ---------------------------------------------------------------------------
# Scan lock released on run_scan() exception
# ---------------------------------------------------------------------------
class TestScanLockReleasedOnError:
def test_lock_released_when_run_scan_raises(self, client, mock_connector,
monkeypatch):
import scan_engine
from routes import state
def _boom(opts):
raise RuntimeError("simulated scan failure")
monkeypatch.setattr(scan_engine, "run_scan", _boom)
r = client.post("/api/scan/start", json={"sources": ["email"]})
assert r.status_code == 200
# Wait for the background thread to finish and release the lock
deadline = time.time() + 2.0
while True:
acquired = state._scan_lock.acquire(blocking=False)
if acquired:
state._scan_lock.release()
break
assert time.time() < deadline, "scan lock was never released after exception"
time.sleep(0.05)
# ---------------------------------------------------------------------------
# GET /api/db/sessions
# ---------------------------------------------------------------------------
class TestDbSessions:
def test_returns_list(self, client, db_patch):
r = client.get("/api/db/sessions")
assert r.status_code == 200
assert isinstance(r.get_json(), list)
def test_completed_scan_appears_in_sessions(self, client, db_patch):
_seed_scan(db_patch, [_item("sess1")])
r = client.get("/api/db/sessions")
sessions = r.get_json()
assert len(sessions) >= 1
s = sessions[0]
assert "ref_scan_id" in s
assert "flagged_count" in s
assert s["flagged_count"] == 1
def test_sessions_ordered_newest_first(self, client, db_patch):
# Create two scans >300 s apart so each forms its own session window.
old_id = _seed_scan(db_patch, [_item("old1")])
db_patch._connect().execute(
"UPDATE scans SET started_at = started_at - 400 WHERE id = ?", (old_id,)
)
db_patch._connect().commit()
_seed_scan(db_patch, [_item("new1")])
sessions = client.get("/api/db/sessions").get_json()
assert len(sessions) == 2
# Newest session (highest ref_scan_id) must be first
assert sessions[0]["ref_scan_id"] > sessions[1]["ref_scan_id"]
# ---------------------------------------------------------------------------
# Profile routes
# ---------------------------------------------------------------------------
class TestProfileRoutes:
"""
Tests for GET /api/profiles, POST /api/profiles/save,
GET /api/profiles/get, and POST /api/profiles/delete.
Each test monkeypatches the profile storage path to a tmp directory so
tests are fully isolated from the real ~/.gdprscanner/settings.json.
"""
@pytest.fixture(autouse=True)
def _isolate(self, tmp_path, monkeypatch):
import app_config
monkeypatch.setattr(app_config, "_SETTINGS_PATH", tmp_path / "settings.json")
def test_list_returns_empty_list_initially(self, client):
r = client.get("/api/profiles")
assert r.status_code == 200
assert r.get_json()["profiles"] == []
def test_save_missing_name_returns_400(self, client):
r = client.post("/api/profiles/save", json={"sources": ["email"]})
assert r.status_code == 400
assert "error" in r.get_json()
def test_save_creates_profile_and_returns_it(self, client):
r = client.post("/api/profiles/save", json={
"id": "", "name": "Alpha", "sources": ["email"], "options": {}
})
assert r.status_code == 200
data = r.get_json()
assert data["status"] == "saved"
assert data["profile"]["name"] == "Alpha"
assert data["profile"]["id"] # server assigned a non-empty id
def test_saved_profile_appears_in_list(self, client):
client.post("/api/profiles/save", json={"name": "Beta", "sources": [], "options": {}})
profiles = client.get("/api/profiles").get_json()["profiles"]
assert any(p["name"] == "Beta" for p in profiles)
def test_rename_updates_name_in_list(self, client):
"""Regression: _pmgmtSaveFullEdit renames the copy — the API must
persist the new name so loadProfiles() returns fresh data for the
left-column re-render."""
r = client.post("/api/profiles/save", json={
"id": "", "name": "LOCAL-TEST (copy)", "sources": [], "options": {}
})
profile_id = r.get_json()["profile"]["id"]
# Simulate the user renaming the copy in the editor and clicking Save
r2 = client.post("/api/profiles/save", json={
"id": profile_id, "name": "LOCAL-TEST-2", "sources": [], "options": {}
})
assert r2.status_code == 200
assert r2.get_json()["profile"]["name"] == "LOCAL-TEST-2"
profiles = client.get("/api/profiles").get_json()["profiles"]
names = [p["name"] for p in profiles]
assert "LOCAL-TEST-2" in names
assert "LOCAL-TEST (copy)" not in names
def test_get_by_id(self, client):
r = client.post("/api/profiles/save", json={
"id": "fixed-id-1", "name": "Gamma", "sources": [], "options": {}
})
profile_id = r.get_json()["profile"]["id"]
r2 = client.get(f"/api/profiles/get?id={profile_id}")
assert r2.status_code == 200
assert r2.get_json()["profile"]["name"] == "Gamma"
def test_get_nonexistent_returns_404(self, client):
r = client.get("/api/profiles/get?id=does-not-exist")
assert r.status_code == 404
def test_delete_removes_profile(self, client):
client.post("/api/profiles/save", json={"name": "ToDelete", "sources": [], "options": {}})
r = client.post("/api/profiles/delete", json={"name": "ToDelete"})
assert r.status_code == 200
assert r.get_json()["status"] == "deleted"
profiles = client.get("/api/profiles").get_json()["profiles"]
assert not any(p["name"] == "ToDelete" for p in profiles)
def test_delete_nonexistent_returns_not_found(self, client):
r = client.post("/api/profiles/delete", json={"name": "Ghost"})
assert r.status_code == 200
assert r.get_json()["status"] == "not_found"
def test_delete_missing_key_returns_400(self, client):
r = client.post("/api/profiles/delete", json={})
assert r.status_code == 400

View File

@ -97,6 +97,22 @@ class TestScanStatus:
assert "scan_id" in data
assert data["scan_id"] is None
def test_idle_reports_google_not_running(self, client):
# The refresh/restore path relies on google_running being reported
# separately — running alone misses live Google scans.
data = client.get("/api/scan/status").get_json()
assert data["google_running"] is False
def test_google_lock_held_reports_google_running(self, client):
from routes import state
assert state._google_scan_lock.acquire(blocking=False)
try:
data = client.get("/api/scan/status").get_json()
assert data["google_running"] is True
assert data["running"] is False # M365/file lock still free
finally:
state._google_scan_lock.release()
# ---------------------------------------------------------------------------
# /api/scan/start

222
tests/test_updates.py Normal file
View File

@ -0,0 +1,222 @@
"""
Tests for the software-update routes (routes/updates.py).
All git interaction is mocked no test touches the real repository,
the network, or restarts the process.
"""
from __future__ import annotations
import subprocess
import pytest
@pytest.fixture(scope="module")
def flask_app():
import gdpr_scanner
gdpr_scanner.app.config["TESTING"] = True
return gdpr_scanner.app
@pytest.fixture()
def client(flask_app):
with flask_app.test_client() as c:
yield c
def _cp(returncode=0, stdout="", stderr=""):
return subprocess.CompletedProcess(args=[], returncode=returncode,
stdout=stdout, stderr=stderr)
def _fake_git(*, local="aaaaaaa1", remote="aaaaaaa1", branch="main",
fetch_rc=0, dirty=False, reqs_changed=False, merge_rc=0,
commits=""):
"""Build a _git() replacement dispatching on the git subcommand."""
calls = []
def fake(*args, timeout=None):
calls.append(args)
if args[:2] == ("rev-parse", "--abbrev-ref"):
return _cp(stdout=branch + "\n")
if args == ("rev-parse", "HEAD"):
return _cp(stdout=local + "\n")
if args[0] == "rev-parse":
return _cp(stdout=remote + "\n")
if args[0] == "fetch":
return _cp(returncode=fetch_rc, stderr="fetch failed" if fetch_rc else "")
if args[0] == "log":
return _cp(stdout=commits)
if args[0] == "diff-index":
return _cp(returncode=1 if dirty else 0)
if args[0] == "diff":
return _cp(returncode=1 if reqs_changed else 0)
if args[0] == "merge":
return _cp(returncode=merge_rc, stderr="not a fast-forward" if merge_rc else "")
if args[0] == "stash":
return _cp()
raise AssertionError(f"unexpected git call: {args}")
fake.calls = calls
return fake
@pytest.fixture(autouse=True)
def supported(monkeypatch):
import routes.updates as upd
monkeypatch.setattr(upd, "_supported", lambda: True)
@pytest.fixture(autouse=True)
def no_audit(monkeypatch):
import gdpr_db
monkeypatch.setattr(gdpr_db, "log_audit_event", lambda *a, **k: None)
# ── /api/update/check ─────────────────────────────────────────────────────────
def test_check_unsupported(client, monkeypatch):
import routes.updates as upd
monkeypatch.setattr(upd, "_supported", lambda: False)
r = client.get("/api/update/check")
assert r.status_code == 200
assert r.get_json() == {"supported": False}
def test_check_up_to_date(client, monkeypatch):
import routes.updates as upd
monkeypatch.setattr(upd, "_git", _fake_git())
d = client.get("/api/update/check").get_json()
assert d["supported"] and d["up_to_date"]
assert d["commits"] == []
def test_check_update_available(client, monkeypatch):
import routes.updates as upd
monkeypatch.setattr(upd, "_git", _fake_git(
local="aaaaaaa1", remote="bbbbbbb2",
commits="bbbbbbb2 Fix thing\nccccccc3 Add thing\n"))
d = client.get("/api/update/check").get_json()
assert d["up_to_date"] is False
assert d["current"] == "aaaaaaa"
assert d["latest"] == "bbbbbbb"
assert len(d["commits"]) == 2
def test_check_fetch_failure(client, monkeypatch):
import routes.updates as upd
monkeypatch.setattr(upd, "_git", _fake_git(fetch_rc=1))
d = client.get("/api/update/check").get_json()
assert d["supported"] is True
assert "fetch failed" in d["error"]
# ── /api/update/apply ─────────────────────────────────────────────────────────
def test_apply_up_to_date_is_noop(client, monkeypatch):
import routes.updates as upd
monkeypatch.setattr(upd, "_git", _fake_git())
monkeypatch.setattr(upd, "_schedule_restart", lambda *a, **k: pytest.fail("must not restart"))
r = client.post("/api/update/apply")
assert r.status_code == 200
d = r.get_json()
assert d["ok"] is True and d["updated"] is False
def test_apply_refused_while_scan_running(client, monkeypatch):
import routes.updates as upd
from routes import state
monkeypatch.setattr(upd, "_git", _fake_git(remote="bbbbbbb2"))
monkeypatch.setattr(upd, "_schedule_restart", lambda *a, **k: pytest.fail("must not restart"))
assert state._scan_lock.acquire(blocking=False)
try:
r = client.post("/api/update/apply")
finally:
state._scan_lock.release()
assert r.status_code == 409
assert r.get_json()["code"] == "scan_running"
def test_apply_happy_path(client, monkeypatch):
import routes.updates as upd
fake = _fake_git(remote="bbbbbbb2", commits="bbbbbbb2 Fix\n")
monkeypatch.setattr(upd, "_git", fake)
restarts = []
monkeypatch.setattr(upd, "_schedule_restart", lambda *a, **k: restarts.append(1))
r = client.post("/api/update/apply")
assert r.status_code == 200
d = r.get_json()
assert d["ok"] and d["updated"] and d["restarting"]
assert d["from"] == "aaaaaaa" and d["to"] == "bbbbbbb"
assert restarts == [1]
assert ("merge", "--ff-only", "origin/main") in fake.calls
# tree was clean — no stash
assert not any(c[0] == "stash" for c in fake.calls)
def test_apply_stashes_dirty_tree(client, monkeypatch):
import routes.updates as upd
fake = _fake_git(remote="bbbbbbb2", dirty=True)
monkeypatch.setattr(upd, "_git", fake)
monkeypatch.setattr(upd, "_schedule_restart", lambda *a, **k: None)
r = client.post("/api/update/apply")
assert r.status_code == 200
assert any(c[0] == "stash" for c in fake.calls)
def test_apply_merge_failure(client, monkeypatch):
import routes.updates as upd
monkeypatch.setattr(upd, "_git", _fake_git(remote="bbbbbbb2", merge_rc=1))
monkeypatch.setattr(upd, "_schedule_restart", lambda *a, **k: pytest.fail("must not restart"))
r = client.post("/api/update/apply")
assert r.status_code == 409
d = r.get_json()
assert d["code"] == "merge_failed"
assert "fast-forward" in d["error"]
def test_apply_installs_requirements_when_changed(client, monkeypatch):
import routes.updates as upd
fake = _fake_git(remote="bbbbbbb2", reqs_changed=True)
monkeypatch.setattr(upd, "_git", fake)
monkeypatch.setattr(upd, "_schedule_restart", lambda *a, **k: None)
pip_calls = []
monkeypatch.setattr(upd.subprocess, "run",
lambda cmd, **kw: pip_calls.append(cmd) or _cp())
r = client.post("/api/update/apply")
assert r.status_code == 200
assert len(pip_calls) == 1
assert "pip" in pip_calls[0] and "-r" in pip_calls[0]
# ── Restart fd hygiene ────────────────────────────────────────────────────────
def test_mark_fds_cloexec_unmarks_inheritable_socket():
"""Werkzeug sets the listening socket inheritable; the restart must undo
that or the socket leaks through execv and squats on the port."""
import socket
import routes.updates as upd
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
try:
s.set_inheritable(True)
assert s.get_inheritable() is True
upd._mark_fds_cloexec()
assert s.get_inheritable() is False
finally:
s.close()
# ── /api/update/settings ──────────────────────────────────────────────────────
def test_settings_roundtrip(client, monkeypatch):
import routes.updates as upd
store = {"auto_update": False}
monkeypatch.setattr(upd, "get_update_config", lambda: dict(store))
monkeypatch.setattr(upd, "save_update_config",
lambda v: store.__setitem__("auto_update", bool(v)))
d = client.get("/api/update/settings").get_json()
assert d == {"supported": True, "auto_update": False}
r = client.post("/api/update/settings", json={"auto_update": True})
assert r.get_json() == {"ok": True}
assert store["auto_update"] is True
d = client.get("/api/update/settings").get_json()
assert d["auto_update"] is True

83
update_gdpr.sh Executable file
View File

@ -0,0 +1,83 @@
#!/usr/bin/env bash
# GDPRScanner — self-update script.
#
# Pulls the latest release from origin, reinstalls dependencies if they
# changed, and restarts the systemd service if one is installed.
# Safe to run from cron: exits quietly when already up to date, and
# auto-stashes local hotfixes instead of aborting the merge.
#
# Usage:
# ./update_gdpr.sh # update if origin has new commits
# ./update_gdpr.sh --check # report status only, change nothing
#
# Environment:
# GDPR_BRANCH branch to track (default: main)
# GDPR_SERVICE systemd unit to restart (default: gdprscanner, if it exists)
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
BRANCH="${GDPR_BRANCH:-main}"
SERVICE="${GDPR_SERVICE:-gdprscanner}"
log() { printf '[%s] %s\n' "$(date '+%Y-%m-%d %H:%M:%S')" "$*"; }
cd "$SCRIPT_DIR"
if [ ! -d .git ]; then
log "ERROR: $SCRIPT_DIR is not a git checkout — cannot self-update."
exit 1
fi
git fetch origin "$BRANCH" --quiet
LOCAL="$(git rev-parse HEAD)"
REMOTE="$(git rev-parse "origin/$BRANCH")"
if [ "$LOCAL" = "$REMOTE" ]; then
log "Already up to date ($(git describe --always HEAD))."
exit 0
fi
log "Update available: $(git rev-parse --short HEAD) -> $(git rev-parse --short "$REMOTE")"
git log --oneline "HEAD..origin/$BRANCH" | sed 's/^/ /'
if [ "${1:-}" = "--check" ]; then
exit 0
fi
# Local edits (e.g. a hotfix applied directly on the server) would make the
# merge abort. Stash them so the update proceeds; the stash is kept so
# nothing is lost.
if ! git diff-index --quiet HEAD --; then
log "Local changes detected — stashing:"
git diff --stat HEAD | sed 's/^/ /'
git stash push --quiet -m "update_gdpr.sh auto-stash $(date '+%Y-%m-%d %H:%M:%S')"
log "Recover later with: git stash show -p / git stash pop"
fi
REQS_CHANGED=false
if ! git diff --quiet "HEAD..origin/$BRANCH" -- requirements.txt; then
REQS_CHANGED=true
fi
# Fast-forward only: the server checkout must never diverge from origin.
git merge --ff-only --quiet "origin/$BRANCH"
log "Updated to $(git rev-parse --short HEAD)."
if [ "$REQS_CHANGED" = true ]; then
log "requirements.txt changed — updating dependencies..."
"$SCRIPT_DIR/venv/bin/pip" install --quiet -r requirements.txt
log "Dependencies updated."
fi
if command -v systemctl >/dev/null 2>&1 \
&& systemctl list-unit-files --type=service 2>/dev/null | grep -q "^$SERVICE\.service"; then
log "Restarting $SERVICE.service..."
systemctl restart "$SERVICE"
log "Service restarted."
else
log "No systemd unit '$SERVICE' found — restart GDPRScanner manually."
fi
log "Done."