Compare commits

..

72 Commits
latest ... main

Author SHA1 Message Date
StyxX65
efbbeb7306 Restore M365Connector.delete_message (was an orphaned method body)
Some checks are pending
Build — Windows, Linux & macOS / GDPRScanner / linux (push) Waiting to run
Build — Windows, Linux & macOS / GDPRScanner / macos (push) Waiting to run
Build — Windows, Linux & macOS / GDPRScanner / windows (push) Waiting to run
Build — Windows, Linux & macOS / Create GitHub Release (push) Blocked by required conditions
The def line for delete_message had been lost, leaving its body as
unreachable dead code at the end of _delete() and no delete_message
attribute on the connector. Deleting an Outlook message therefore failed
with "'M365Connector' object has no attribute 'delete_message'". Restored
the method (soft-delete: move to Deleted Items, fall back to DELETE).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 15:43:46 +02:00
StyxX65
54f8848e30 Document renderGrid landing-card hiding in static/js/CLAUDE.md
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 14:49:43 +02:00
StyxX65
8a446509c6 Hide landing/last-scan card whenever results render
The live scan_file_flagged handler showed the grid but never hid
#emptyState / #lastScanSummary, so when a scan ran with the landing
card visible, results appeared underneath it until a manual refresh
(which re-ran loadOpenItems and cleared it). Hide both panels in
renderGrid whenever files are present, covering every render path
(live SSE, open-items load, history, filters).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 14:45:50 +02:00
StyxX65
d55778ab35 Release 1.7.9: changelog + manual updates
Document this cycle's changes: open-items default results view,
interrupted-scan recovery, restored user/group badges, the SMTP
username-key fix, and the new "always send via SMTP" toggle. Stamp
manuals (EN/DA) to 1.7.9.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 11:36:41 +02:00
StyxX65
874c3ccec1 Add "prefer SMTP" toggle to skip Microsoft Graph for email
When the M365 connector is connected the app always tries Graph first,
and a Graph 202 ends the send — so report mail to recipients Exchange
silently drops (Google-hosted subdomains of the O365 domain) never
reaches them, even with working SMTP configured.

New prefer_smtp flag gates all three Graph branches (smtp_test,
send_report, _maybe_send_auto_email) so they go straight to SMTP. UI
toggle #st-smtpPreferSmtp in Settings → E-mailrapport, saved/loaded by
scheduler.js, with da/de/en strings.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 11:30:45 +02:00
StyxX65
526e2b0b78 Fix SMTP auth: settings tab saved wrong config keys
The Settings → E-mailrapport tab (scheduler.js) saved the SMTP username
as `user` and TLS flag as `starttls`, but every backend reader expects
`username`/`use_tls` (routes/email.py). Result: username was always
empty, server.login() was skipped, and the SMTP server rejected the
send — surfacing as a misleading "authentication failed" message even
with a valid App Password. The bug was latent because Graph is preferred
whenever M365 is connected, so the SMTP path was rarely exercised.

- scheduler.js: send/load canonical keys (username, use_tls). The
  send-report modal (scan.js) already used these.
- _load_smtp_config(): normalise legacy user→username / starttls→use_tls
  so configs saved before the fix work without re-entry.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 11:25:15 +02:00
StyxX65
b661a94f98 Restore user/group badges on DB-loaded result cards
The card badge only rendered when f.account_name was set, and the
group (role) badge was nested inside that same check. But save_item
never persisted account_name — only account_id (a GUID) and user_role.
Live SSE cards carried account_name so badges showed during a scan;
now that the grid loads finalized scans from the DB, the gap is exposed
and both badges vanish for earlier scans.

- Persist account_name (migration 11 + save_item) so future scans show
  the user badge. Both M365 and Google cards already carry it.
- _accountPill() in results.js drives the group badge off user_role
  alone (shows for legacy rows) and resolves a best-effort user label:
  account_name → S._allUsers (id/email) → email-style account_id → omit.
  Both card layouts share the one helper.

Legacy rows still lack account_name (never captured), but now show the
group badge and a resolved/email user label where possible.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 10:15:19 +02:00
StyxX65
29d9168643 Recover unfinished scans so their items aren't stranded
get_session_items / get_open_items / latest_scan_id all require
finished_at IS NOT NULL, but the M365 and Google engines return early
on abort (skipping finish_scan) and a process kill mid-scan (deploy,
OOM, crash) never reaches it either. Result on prod: 41/42 scans had
finished_at NULL, so 291 already-saved flagged items were invisible —
the grid showed nothing.

- finalize_orphan_scans(): finalises every finished_at-NULL scan; runs
  once at startup before the scheduler (nothing is scanning at boot, so
  any unfinished scan is dead). Recovers existing stranded items and
  guards against future mid-scan restarts.
- run_scan: finalise the DB scan on the abort early-return too, so a
  stopped scan's items stay visible without waiting for a restart.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 09:51:22 +02:00
StyxX65
7bf589bf7a Update ZORAXY_SETUP.md 2026-06-22 09:21:08 +02:00
StyxX65
68076eba52 Show all open (unactioned) items by default, not just the last scan
The default results view loaded only the latest scan session (±300s
window), so items dropped out of sight once a newer scan started — and
a long scheduled scan could show little or nothing on browser open.

Add get_open_items(): every flagged item with no disposition (or status
'unreviewed') across all scans, deduped by id to the latest finished
scan. GET /api/db/flagged now serves it when no ?ref is given; ?ref=N
still loads a specific past session. Frontend loadHistorySession(null)
routes to a new loadOpenItems() loader. Rename the banner button to
"Open items" (da/de/en).

get_session_items() default is unchanged — export.py and
scan_scheduler.py still rely on latest-session for the current scan's
report/email.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 09:19:55 +02:00
StyxX65
67f66c8441 Document self-update system and related changes in CLAUDE.md
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-16 12:16:14 +02:00
StyxX65
8bb482925f Release 1.7.8
- CHANGELOG: cut the 1.7.8 release (dated 2026-06-16); reset Unreleased.
- VERSION: 1.7.7 -> 1.7.8.
- Manuals (DA + EN): bump version stamps.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-16 11:56:12 +02:00
StyxX65
f84c8516df Reliably restore last session on refresh after a server restart
The page-load restore was one-shot and bailed when a completed scan's
replayed scan_phase left a running flag set; sse_replay_done (the other
retry) only fires for a non-empty replay buffer, which is empty after a
restart — so refreshing post-update showed a blank grid despite the
results being in the DB. The watchdog now retries the restore on each
4s poll while nothing is shown and no scan runs, clearing stale flags
first. /api/scan/status also reports google_running separately so a
refresh during a live Google scan is no longer treated as idle.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-16 11:53:07 +02:00
StyxX65
9fd1aa1f8a Manuals: describe new share-link create flow
After Create the form clears and the new link appears highlighted in
the Active links list, copied from there — not from a preview row.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-15 10:52:22 +02:00
StyxX65
da356fb310 Release 1.7.7
- CHANGELOG: cut the 1.7.7 release (dated 2026-06-15); reset Unreleased.
- VERSION: 1.7.6 -> 1.7.7.
- Manuals (DA + EN): bump version stamps.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-15 10:12:00 +02:00
StyxX65
bdba80e72d Remove stale link preview from share modal after create
The generated-link "Copy link:" row stayed visible after creating,
looking like the form hadn't reset — but the new link was already in
the Active links list with its own Copy button. Drop the redundant
preview row; on create, reset the form and briefly highlight the new
entry in the active list. Removes the now-dead shareNewLinkRow markup
and copyShareLink().

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-15 10:11:03 +02:00
StyxX65
c26dd7d320 Add Zoraxy HTTPS setup guide, correct SECURITY.md bind address
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 15:20:33 +02:00
StyxX65
841311a6bd Release 1.7.6
- CHANGELOG: cut the 1.7.6 release (dated 2026-06-11); reset Unreleased.
- VERSION: 1.7.5 -> 1.7.6.
- Manuals (DA + EN): bump version stamps.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 15:02:44 +02:00
StyxX65
dd19be8bbf Close leaked listening socket on update restart
Werkzeug sets its server socket inheritable unconditionally, so the
os.execv restart carried it into the new process as a zombie listener:
one PID listening on both 5100 (never accepted) and 5101 (the real
server). Mark all fds above stderr close-on-exec before exec'ing so
the old socket dies and the new server rebinds the original port.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 15:01:17 +02:00
StyxX65
c43725ca7f Release 1.7.5
- CHANGELOG: cut the 1.7.5 release (dated 2026-06-11); reset Unreleased.
- VERSION: 1.7.4 -> 1.7.5.
- Manuals (DA + EN): bump version stamps.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 14:42:06 +02:00
StyxX65
a1712ae178 Make static files revalidate so the UI is fresh after updates
No Cache-Control header meant browsers cached JS/CSS heuristically for
days; after a server update (including the in-app self-update reload)
the backend was new but the frontend stayed stale. SEND_FILE_MAX_AGE
_DEFAULT=0 forces ETag revalidation — 304 when unchanged, fresh file
immediately after an update.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 14:39:45 +02:00
StyxX65
c1cddb8ea7 Release 1.7.4
- CHANGELOG: cut the 1.7.4 release (dated 2026-06-10); reset Unreleased.
- VERSION: 1.7.3 -> 1.7.4.
- Manuals (DA + EN): bump version stamps.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 15:33:16 +02:00
StyxX65
9cbd93e1f5 Reset all share modal fields after creating a link
Create only cleared the label; scope type, user email, date range, and
expiry carried over, so the next link silently inherited the previous
link's scope. Extracted openShareModal's reset logic into
_resetShareForm() and call it after every successful create.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 15:32:03 +02:00
StyxX65
d4cf2db347 Release 1.7.3
- CHANGELOG: cut the 1.7.3 release (dated 2026-06-10); reset Unreleased.
- VERSION: 1.7.2 -> 1.7.3.
- Manuals (DA + EN): bump version stamps.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 15:24:27 +02:00
StyxX65
d6bf80a68a Keep the same port across app restarts
The port probe did a plain bind() without SO_REUSEADDR, so TIME_WAIT
connections left by the previous instance (e.g. the in-app update
restart) made the port look occupied and the app hopped to the next
one. Probe with SO_REUSEADDR like Werkzeug binds, and give the
requested port a 10-second grace period before auto-incrementing.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 15:23:18 +02:00
StyxX65
679f91da2c Use page origin for share links except when browsing at localhost
The LAN-IP rewrite in _getShareBaseUrl() exists to fix unusable
127.0.0.1 links; applying it to every origin meant links copied behind
a reverse proxy pointed at http://<LAN-IP>:5100, bypassing TLS. HTTPS
and non-localhost origins are now used as-is.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 15:14:33 +02:00
StyxX65
c79e7097ea Release 1.7.2
- CHANGELOG: cut the 1.7.2 release (dated 2026-06-10); reset Unreleased.
- VERSION: 1.7.1 -> 1.7.2.
- Manuals (DA + EN): bump version stamps.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 15:10:59 +02:00
StyxX65
35e767b506 Fix copy buttons doing nothing over plain HTTP
navigator.clipboard is undefined in non-secure contexts, so the direct
writeText() call threw synchronously and the execCommand fallback in its
.catch() never ran. _copyText() now feature-detects the API, falls back
to execCommand('copy'), then to a prompt() for manual copying. log.js
reuses the helper; _getShareBaseUrl() caches the LAN-IP lookup so token
Copy buttons stay within the click gesture execCommand requires.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 15:09:34 +02:00
StyxX65
652031b31d Release 1.7.1
- CHANGELOG: cut the 1.7.1 release (dated 2026-06-10); reset Unreleased.
- VERSION: 1.7.0 -> 1.7.1.
- Manuals (DA + EN): bump version stamps; document the new
  Settings -> General -> Software update group (check/install/auto-update,
  git-checkout-only, self-restart, refused during scans).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 14:50:31 +02:00
StyxX65
df54b20735 Document software updates in README, refresh test suite table
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 14:47:58 +02:00
StyxX65
a325349ecd Fix stale ~/.gdpr_scanner_* paths in help text, docs, and UI strings
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 14:41:23 +02:00
StyxX65
6a4b0e1706 Show delta token source count, add hint bubble, fix README data paths
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 14:27:14 +02:00
StyxX65
c0e45df440 Add software update from Settings GUI and update_gdpr.sh script
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 12:54:29 +02:00
StyxX65
fcf32f3751 Release 1.7.0
- CHANGELOG: cut the 1.7.0 release (dated 2026-06-10); reset Unreleased.
- VERSION: 1.6.28 → 1.7.0.
- Manuals (DA + EN): bump version stamps; correct the redaction section
  (cards are now kept/greyed until the next scan, not removed) and add the
  same keep-until-next-scan note to the deletion section, including the
  partial-failure behaviour.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 12:06:36 +02:00
StyxX65
95f1f39a1f Keep data-subject-deleted cards in grid until next scan
Apply the keep-until-next-scan behaviour to deleteSubjectItems: mark the
deleted items _deleted (using deleted_ids from the response) and keep them
greyed in the grid instead of filtering them out. Also fixes a latent bug
where renderGrid() was called with no argument and threw on files.forEach,
which the surrounding try/catch swallowed as a false "Delete failed" after a
successful erasure.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 11:47:52 +02:00
StyxX65
386831c423 Keep bulk-deleted cards in grid until next scan
Extend the keep-until-next-scan behaviour to the bulk delete modal: instead
of removing matched cards on success, mark them _deleted and keep them greyed
with a "🗑 Deleted" badge and hidden buttons. /api/delete_bulk now returns
deleted_ids so the grid marks exactly the items the server actually deleted —
partial failures stay active and re-deletable. Already-handled (_deleted /
_redacted) items are excluded from the bulk-delete match set so they aren't
re-counted or re-processed.

201 tests pass.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 11:46:14 +02:00
StyxX65
ed3c3a80d6 Keep deleted cards in grid until next scan
Mirror the redact behaviour for the card delete button (🗑): instead of
removing the card on success, mark the item _deleted and keep it in the grid
— greyed via card-resolved, shown with a red "🗑 Deleted" badge, action
buttons hidden so it can't be re-processed. The grid is rebuilt on the next
scan run, clearing the markers. results.js only — no server change.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 11:44:10 +02:00
StyxX65
7c1c2b390d Keep selected card in view when opening preview
Opening the preview panel narrows .grid-area and reflows the auto-fill grid
to fewer columns, moving the clicked card to a new row. The single-frame
scrollIntoView ran while the browser's scroll-anchoring re-adjusted scrollTop
mid-reflow, so the card scrolled out of view. Disable scroll anchoring on
.grid-area (overflow-anchor:none) and defer the scroll by two animation
frames against the settled layout, centring the card (block:'center').

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 11:35:04 +02:00
StyxX65
d82a0d6004 Keep redacted cards in grid until next scan
Redacting a card (✏) previously removed it from the grid and from
S.flaggedData/S.filteredData immediately. Now the item is marked _redacted
and kept: greyed via card-resolved styling, shown with a "✏ Redacted" badge,
and its delete/redact buttons hidden so it can't be re-processed. The grid is
rebuilt on the next scan run, which clears the markers. results.js only — no
server change.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 11:30:41 +02:00
StyxX65
1b3d7f5698 Fix card action buttons clipped in grid view (missing position:relative)
The real cause behind the invisible redact/delete buttons: .card lacked
position:relative, so the position:absolute action buttons (delete, redact)
and the bulk-select checkbox anchored to the viewport instead of the card
and were clipped by .card overflow:hidden. They only showed in list view,
where those elements are position:static. Add position:relative to .card so
all three position within each card. Keep the 0.35 baseline opacity on the
redact button for discoverability.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 11:24:00 +02:00
StyxX65
39500edfbc Changelog: note redact button visibility fix
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 11:21:37 +02:00
StyxX65
35fd00437f Fix redact button invisible in grid view
.card-redact-btn had opacity:0 at rest (only opacity:1 on .card:hover), so
the ✏ redact button was completely invisible in the default grid/thumbnail
view — it only showed in list view, which forces opacity:1. Give it the same
0.35 baseline opacity as .card-delete-btn so it's discoverable at rest and
brightens on hover. The button was always rendered in the DOM; this is a
pure visibility fix.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 11:20:06 +02:00
StyxX65
c39d68ca19 Document XSS escaping + secret-encryption hardening
- CHANGELOG: add Unreleased ### Security section covering the stored XSS
  in the results grid, the reflected XSS in /api/thumb, and the Claude API
  key now being encrypted at rest.
- CLAUDE.md / static/js/CLAUDE.md: add the esc() / _html_esc escaping rule
  for scan-derived strings and the onclick-JSON &quot; pattern.
- CLAUDE.md / routes/CLAUDE.md: note that secret config fields use the
  machine-keyed Fernet and must be read via a decrypting accessor
  (get_claude_api_key()), never config.json directly.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 11:15:39 +02:00
StyxX65
b6d2915d49 Harden XSS escaping and encrypt Claude API key at rest
- results.js: add esc() helper and apply to all scan-derived fields
  (name, account_name, folder, source, modified, label, img alt) across
  card/list/preview/subject-lookup/related views. Scan-derived strings can
  carry attacker-controlled markup (e.g. a OneDrive file named with HTML),
  so they must be escaped before innerHTML/attribute embedding. Also escape
  the related-docs onclick JSON to match the delete/redact &quot; pattern.
- cpr_detector._placeholder_svg: escape label/name before embedding — served
  as image/svg+xml via /api/thumb?name=, so an unescaped value was a
  reflected-XSS vector when the URL is opened directly.
- cpr_detector: remove 44-line unreachable duplicate of the face-detection
  body left inside _extract_audio_metadata after its return.
- app_config: encrypt claude_api_key at rest with the machine-keyed Fernet
  (same as the SMTP password); add get_claude_api_key() for decryption.
  Legacy plaintext keys still read and are re-encrypted on next save.
  Update readers in document_scanner.py and routes/app_routes.py.

201 tests pass.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 11:06:36 +02:00
StyxX65
1903115e02 CLAUDE.md restructured 2026-06-08 14:44:37 +02:00
StyxX65
f845a2f686 ### Fixed - **Cards not shown after browser refresh** — when the browser reconnected to the SSE stream after a completed scan, the scan_phase events in the replay buffer temporarily set S._m365ScanRunning = true (all running flags start at false after a page reload). The watchdog's loadHistorySession call fired in this window and bailed on the stale flag; once scan_done cleared the flag, _initialStatusChecked was already true so loadHistorySession was never retried. Fixed by having the sse_replay_done handler retry loadHistorySession(null) when no scan is running and S._historyRefScanId is still null after replay. 2026-06-08 14:28:24 +02:00
StyxX65
79e589b525 Bugfix in Scheduler 2026-06-04 14:47:01 +02:00
StyxX65
fa6601ffdd Bugfixes 2026-06-01 15:15:43 +02:00
StyxX65
4e5a8934d7 Fix Google scan not stopping cleanly before a new scan starts 2026-05-29 04:53:42 +02:00
StyxX65
66986a16f9 ※ recap: Extended in-place CPR redaction to Google Drive, SFTP, SMB, and local PDFs, then updated CLAUDE.md and both manuals. Everything is committed and all 201 tests pass. (disable recaps in /config) 2026-05-28 17:53:53 +02:00
StyxX65
034ced943e Extended document redaction to Google Drive, SFTP, SMB, and local PDFs Extends the ✂ in-place redaction feature beyond local DOCX/XLSX/CSV/TXT files to cover all remaining file source types and adds PDF support for local files. 2026-05-28 17:47:02 +02:00
StyxX65
6ce7583b26 Added NER/AI integration 2026-05-28 11:50:10 +02:00
StyxX65
6e0dc8ee92 Minor changes to layout in Manuals 2026-05-28 11:23:20 +02:00
StyxX65
26c45165b9 v1.6.28 — Scheduled report-only jobs, compliance audit log, and documentation update
- Scheduled jobs can now run in report-only mode (skip scan, email latest DB results)
- Compliance audit log records all significant admin actions in an immutable DB table
- VERSION bumped to 1.6.28; CHANGELOG [Unreleased] sealed as [1.6.28] — 2026-05-28
- Both manuals updated: CPR-only mode, OCR language, file redaction, related documents,
  date-range token scoping, report-only jobs, audit log tab, two new FAQ entries
- TODO.md updated with all completed tasks

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-28 11:08:52 +02:00
StyxX65
744813f4ac Add compliance audit log
Immutable audit_log table in the scanner DB records every significant
admin action (profile save/delete, token create/revoke, PIN changes,
source add/update/delete, scheduler job changes, scan start/stop, SMTP
save, dispositions, item delete/redact). GET /api/audit_log exposes
entries newest-first. New Audit Log tab in the Settings modal renders
the table on demand. Settings modal widened 540→640 px and tab labels
set to white-space:nowrap so the six-tab row fits on one line.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-28 10:51:23 +02:00
StyxX65
4ef2dfb352 Date-range scoping for viewer tokens 2026-05-28 10:34:55 +02:00
StyxX65
c820d6f6db Two bugs in the abort mechanism: 1. POST /api/scan/stop only set state._scan_abort (M365/file abort event) but never touched state._google_scan_abort. Now sets both. 2. _check_abort() inside _run_google_scan imported gdpr_scanner._scan_abort (= state._scan_abort, the M365 event) instead of using the module-level _scan_abort alias (= state._google_scan_abort). This meant the dedicated /api/google/scan/cancel endpoint — which correctly sets _google_scan_abort — was silently ignored by the scan loop. Fixed to use the module-level alias consistently. Also aligned the end-of-scan checkpoint-clear check. 2026-05-28 10:20:22 +02:00
StyxX65
7ffd8370f4 Fix Stop button not halting Google Workspace scan
Two bugs in the abort mechanism:

1. POST /api/scan/stop only set state._scan_abort (M365/file abort event)
   but never touched state._google_scan_abort. Now sets both.

2. _check_abort() inside _run_google_scan imported gdpr_scanner._scan_abort
   (= state._scan_abort, the M365 event) instead of using the module-level
   _scan_abort alias (= state._google_scan_abort). This meant the dedicated
   /api/google/scan/cancel endpoint — which correctly sets _google_scan_abort
   — was silently ignored by the scan loop. Fixed to use the module-level
   alias consistently. Also aligned the end-of-scan checkpoint-clear check.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-28 10:19:54 +02:00
StyxX65
2c5f5d3283 Add OCR language override setting
Operators can now choose Tesseract language pack(s) per profile via a
sidebar select (#optOcrLang) and profile editor (#peOptOcrLang). Presets:
dan+eng (default), dan, eng, dan+eng+deu, dan+eng+swe, dan+eng+fra. The
ocr_lang option flows from the UI through all three scan engines (M365
files/attachments, Google Drive, Gmail) down to document_scanner.scan_pdf
and scan_image — including the spawned PDF-OCR subprocess worker.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-28 09:59:40 +02:00
StyxX65
23b9555dcf Built-in file redaction for local files 2026-05-27 14:49:06 +02:00
StyxX65
c490b3d76a Merge remote CHANGELOG entries and add Preview section to CLAUDE.md
Resolved conflict in CHANGELOG.md: combined the two bug fixes from the
remote branch (stale history results, selected card scroll) with the
local Gmail/Drive preview fix under a single [1.6.26] — 2026-04-29 entry.
Added Preview dispatch documentation to CLAUDE.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 13:43:59 +02:00
StyxX65
051a53ae85 Update CHANGELOG.md 2026-05-27 13:40:21 +02:00
Henrik Højmark
99157e6fd7
Update CHANGELOG for version 1.6.26
Updated release date for version 1.6.26 and added detailed fixes related to scan history, card visibility, and Google Drive/Gmail previews.
2026-05-27 13:38:40 +02:00
StyxX65
78fb406422 Fixed two bugs: selected cards staying visible after preview opens, and stale history results showing when a new scan starts. 2026-04-29 15:18:58 +02:00
StyxX65
a76df463e8 Changelog updated 2026-04-27 18:47:43 +02:00
StyxX65
ce5a5f1cbb Fixed Gmail and Google Drive preview: items were being sent to the Microsoft Graph API instead of handled correctly. 2026-04-26 11:04:05 +02:00
StyxX65
d84e57239a Add CPR cross-referencing (related documents)
Clicking any flagged card that contains CPR hits now shows a "Related documents" section in the preview panel,
  listing other items from the same scan session that share at least one CPR number. Items are ordered by number of
  shared CPRs; clicking any entry opens it in the preview panel. Works in both live mode and scan history mode.

  Implementation
  - GDPRDb.get_related_items() — SQL self-join on the existing cpr_index table using the same symmetric 300 s session
  window as get_session_items. No new data collection needed.
  - GET /api/db/related/<item_id>?ref=N — new endpoint in routes/database.py, consistent with the ?ref convention used
   by /api/db/flagged.
  - #previewRelated div injected between the metadata block and disposition row in the preview panel.
  - _loadRelated(f) in results.js fetches and renders the list; window._openRelated() resolves items from the live
  grid or falls back to the API response for history-mode items.

  Also
  - Added keyword/FTS5 search as a deferred idea in SUGGESTIONS.md
  - Updated CHANGELOG.md, README.md, and CLAUDE.md
2026-04-25 21:15:50 +02:00
StyxX65
8b55e9d933 Extended the M365 checkpoint/resume mechanism to all three scan engines. Each engine writes its own +file (checkpoint_m365.json, checkpoint_google.json, checkpoint_file_{source_id}.json) every 25 + items. 2026-04-25 20:30:59 +02:00
StyxX65
2254e00481 recap: Added email and phone number detection as opt-in scan options across all three engines, plus translation fixes. Both CHANGELOG and SUGGESTIONS are updated — everything is committed and ready to test. 2026-04-25 19:33:28 +02:00
StyxX65
56a744d896 Fixed missing translation in Sources 2026-04-25 10:57:41 +02:00
StyxX65
9da4403bdf Update VERSION 2026-04-25 08:51:28 +02:00
StyxX65
e35bbe78a5 Added SFTP to sources 2026-04-25 08:48:54 +02:00
61 changed files with 5698 additions and 658 deletions

View File

@ -7,6 +7,232 @@ Version numbers follow [Semantic Versioning](https://semver.org/spec/v2.0.0.html
--- ---
## [Unreleased]
---
## [1.7.9] — 2026-06-22
### Added
- **"Always send via SMTP" option for email reports** — new toggle in **Settings → E-mailrapport**. When the scanner is signed in to Microsoft 365 it normally sends email through Microsoft Graph; Graph reports "accepted" the instant a message is queued, which hides the case where Exchange Online later silently drops it (e.g. a recipient on a Google-hosted subdomain of your Microsoft 365 domain — the message is treated as internal, finds no mailbox, and is discarded, with no delivery and no bounce). Enabling this option makes the manual report, the test email, and the after-scan auto-email all go straight through your configured SMTP server (e.g. Google Workspace `smtp.gmail.com` / `smtp-relay.gmail.com`), bypassing the Graph routing entirely.
### Changed
- **The results grid now shows every open item by default, not just the last scan** — when you open the app (or refresh after a scheduled or manual scan), the grid loads *all* flagged items that still need action — i.e. those with no disposition — across every scan, instead of only the most recent scan session. Items you have already tagged (kept, redacted, deleted, false positive, …) drop out of the view. Re-scans are de-duplicated so each item appears once, showing its most recent state. The session picker still loads any individual past scan, and the history banner button (formerly "Latest scan") is now **"Open items"** and returns to this default view.
### Fixed
- **Interrupted scans no longer lose their results** — a scan only became visible once it was *finalised*, but the Microsoft 365 and Google scan engines skipped finalisation when a scan was stopped, and any scan cut short by a server restart, crash, or out-of-memory kill never finalised at all. Its already-found items were then stranded in the database and invisible in the grid (this is what caused "scan finished but no results shown", especially after the in-app self-update restarts). Unfinished scans are now finalised automatically on startup (nothing is scanning at boot, so any unfinished scan is known to be dead), and a manually stopped Microsoft 365 scan finalises immediately so its partial results stay visible.
- **User and group badges were missing on result cards loaded from the database** — the reviewer's display name was shown live during a scan but never saved, so cards loaded from a past scan (now the default view) lost both the person badge and the Elev/Ansat group badge. The display name is now stored with each item, and the group badge is shown from the saved role even for older items that predate this fix (where a name can't be recovered, the group badge and a resolved e-mail still appear).
- **Email reports sent via SMTP failed with "authentication failed"** — the **Settings → E-mailrapport** tab saved the SMTP username under the wrong field name, so the username never reached the mail server and sign-in was skipped — the server then rejected the unauthenticated message, which surfaced as a misleading authentication error even with a correct password or app password. The setting is now saved correctly, and configurations saved before the fix are migrated automatically.
---
## [1.7.8] — 2026-06-16
### Fixed
- **Blank results grid after a browser refresh (especially after a server restart)** — restoring the last scan session on page load was one-shot: `_sseWatchdog()` called `loadHistorySession(null)` a single time, guarded by `_initialStatusChecked`. If that attempt was blocked — a completed scan's replayed `scan_phase` event leaves a `_*ScanRunning` flag set, and the `loadHistorySession` guard then bails — nothing retried, because `sse_replay_done` (the other retry path) only fires when the SSE replay buffer is non-empty, and the buffer is empty after a server restart (so refreshing after the in-app self-update reliably showed an empty grid even though the results were in the database). The watchdog now re-attempts the restore on every 4-second poll while nothing is shown and no scan is running, clearing stale running flags first (both scan locks are confirmed free at that point). Additionally, `/api/scan/status` now reports `google_running` separately from `running` (which only ever reflected the M365 + file lock), so a refresh during a live Google scan is detected instead of being treated as idle.
---
## [1.7.7] — 2026-06-15
### Changed
- **Share modal no longer leaves a stale link in the create box** — after clicking "Create", the generated-link preview row ("Copy link:") stayed visible at the top of the modal even though the new link was already listed under Active links with its own Copy button — so it looked like the form hadn't cleared. The redundant preview row is removed; creating a link now resets the form and briefly highlights the new entry in the Active links list, where it can be copied. (The 1.7.4 fix cleared the input fields but not this preview row.)
### Added
- **Reverse-proxy / HTTPS setup guide** — new `docs/setup/ZORAXY_SETUP.md` walks through putting the scanner behind Zoraxy with a Let's Encrypt certificate on a LAN-only deployment: DNS A-record to a private IP, ACME via DNS-01 challenge (HTTP-01 cannot reach a LAN-only host), proxy rule to `127.0.0.1:5100`, binding the app to loopback with `--host 127.0.0.1`, and scanner-specific verification (SSE streaming, HTTPS share links, self-update). Linked from the README (new "HTTPS / reverse proxy" section) and SECURITY.md.
### Fixed
- **SECURITY.md corrections** — the web UI binds to `0.0.0.0` by default, not `127.0.0.1` as claimed; the MSAL token cache path was still the pre-1.x `~/.gdpr_scanner_config.json` (actual: `~/.gdprscanner/token.json`).
---
## [1.7.6] — 2026-06-11
### Fixed
- **Update restart leaked the listening socket and hopped to port 5101** — Werkzeug marks its server socket inheritable (`srv.socket.set_inheritable(True)`, unconditionally, for its debug reloader), so the in-app update's `os.execv` restart carried the old listening socket into the new process as a zombie listener: same PID listening on both 5100 (never accepted — clients hang) and 5101 (the actual server). The 1.7.3 `SO_REUSEADDR`/grace-period fix couldn't help because the port genuinely was occupied — by the restarting process itself. `_restart_self()` now marks every fd above stderr close-on-exec before the exec (`_mark_fds_cloexec()`, enumerating `/proc/self/fd` on Linux), so the old socket dies with the exec and the new server rebinds 5100 immediately.
---
## [1.7.5] — 2026-06-11
### Fixed
- **Stale UI after updating the server** — Flask served `/static/` files with no `Cache-Control` header, so browsers cached JS/CSS heuristically (often for days). After a server update — including the new in-app self-update, whose post-install reload hit the cache — the backend was new but the frontend stayed old, and fixes appeared "not to work" until a hard refresh. `SEND_FILE_MAX_AGE_DEFAULT = 0` now makes every static file revalidate via ETag: unchanged files answer with a cheap 304, changed files are re-fetched immediately on the next normal page load.
---
## [1.7.4] — 2026-06-10
### Fixed
- **Share modal kept stale input after creating a link** — clicking "Create" only cleared the label field; scope type, user email, date range, and expiry kept their values, so the next link silently inherited the previous link's scope settings. The form-reset logic from `openShareModal()` is now a shared `_resetShareForm()` helper called after every successful create (the generated link row stays visible for copying).
---
## [1.7.3] — 2026-06-10
### Fixed
- **App restart no longer hops to a new port** — the in-app update restart (and any quick stop/start) left connections from the previous instance in TIME_WAIT, and the startup port probe did a plain `bind()` that treats TIME_WAIT as occupied — so the restarted app silently came up on 5101 and the browser's reload poll never found it. The probe now sets `SO_REUSEADDR` (matching how Werkzeug actually binds, so an actively listening port is still detected as occupied), and the requested port gets a 10-second grace period before the auto-increment fallback kicks in, covering the brief window where the old process hasn't fully released the socket.
- **Share links now respect a reverse proxy**`_getShareBaseUrl()` rewrote every copied share link to `http://<LAN-IP>:5100` (via `/api/local_ip`), which would bypass TLS when the scanner sits behind a reverse proxy (Zoraxy, Caddy, nginx, …): a DPO opening the link would silently fall back to plain HTTP. The LAN-IP rewrite now only applies in the case it was built for — browsing the app at `localhost` over HTTP, where `window.location.origin` would produce links unusable from other machines. Any HTTPS or non-localhost origin is used as-is.
---
## [1.7.2] — 2026-06-10
### Fixed
- **Copy buttons did nothing over plain HTTP** — the share modal's "Copy" buttons (new link + active links) and the log panel's copy button called `navigator.clipboard.writeText()` directly. The Clipboard API only exists in secure contexts (HTTPS or localhost), so when the scanner is reached at `http://<LAN-IP>:5100` the call threw synchronously and the intended `execCommand` fallback never ran — the button silently did nothing. `_copyText()` in `viewer.js` now feature-detects the API, falls back to `document.execCommand('copy')`, and as a last resort shows the link in a `prompt()` for manual copying; `log.js` reuses the same helper via `window._copyText`. `_getShareBaseUrl()` now caches the LAN-IP lookup so the token-list Copy buttons copy synchronously within the click gesture (required for `execCommand`).
---
## [1.7.1] — 2026-06-10
### Added
- **Software update from the GUI** — a new **Settings → General → Software update** group lets the operator check for and install updates without touching the server shell. "Check for updates" fetches origin and shows either "You are running the latest version" or the list of pending commits; "Install update" fast-forwards the git checkout to `origin/<branch>`, reinstalls dependencies only if `requirements.txt` changed, writes an `app_update` audit-log entry, and restarts the app in place by re-exec'ing the process (`os.execv` — same PID, so it works both under systemd and when launched via `start_gdpr.sh`). The page polls until the server is back and reloads itself. Local server-side edits are auto-stashed (kept, never discarded) before the merge. Updating is refused with a clear message while any scan is running. An **"Install updates automatically"** toggle (stored in `config.json` under `auto_update`) enables a background thread that checks once a day and installs unattended, skipping (and retrying hourly) while a scan runs. The group is only shown when the app runs from a git checkout — the frozen desktop build hides it. New blueprint `routes/updates.py` with `GET /api/update/check`, `POST /api/update/apply`, `GET/POST /api/update/settings`; 11 new tests in `tests/test_updates.py` with fully mocked git.
- **`update_gdpr.sh`** — standalone CLI/cron equivalent of the GUI update: fetch + fast-forward-only merge with auto-stash of local hotfixes, dependency reinstall only when `requirements.txt` changed, and a `systemctl restart` if a `gdprscanner.service` unit exists (override with `GDPR_SERVICE`). `./update_gdpr.sh --check` reports pending commits without changing anything; safe to run from cron (quiet no-op when already up to date).
### Fixed
- **Delta token status hid the source count** — the "Tokens saved" line under the Δ Delta scan toggle always showed the bare translation ("Tokens gemt") because the source count only existed in the JS fallback string, which is ignored whenever the lang key exists. The translations now carry a `{n}` placeholder ("Tokens gemt for {n} kilde(r)") substituted in `checkDeltaStatus()`, and the row gained a "?" hint bubble explaining what the saved change-tokens do and that "Clear tokens" forces the next scan to be a full scan.
- **Stale data-file paths in docs and UI text** — README, SECURITY.md, MAINTAINER.md, the `--headless` argparse help (`--settings`, `--reset-db`, epilog), the DB-import replace warning/confirm strings (all three languages), and two code comments still referenced the pre-1.x flat dotfile layout (`~/.gdpr_scanner_delta.json`, `~/.gdpr_scanner_smtp.json`, `~/.gdpr_scanner_machine_id`, `~/.gdpr_scanner.db`). All now point to the actual locations under `~/.gdprscanner/` (`delta.json`, `smtp.json`, `machine_id`, `scanner.db`). The legacy-migration rename tables in `gdpr_scanner.py` intentionally keep the old names.
---
## [1.7.0] — 2026-06-10
### Added
- **PDF redaction for local files** — the ✂ redact button now works on local PDF files in addition to DOCX, XLSX, CSV, and TXT. Text-based PDFs are redacted using PyMuPDF's physical redaction (`page.apply_redactions()`), which removes the underlying text data from the PDF stream — not just paints over it. Scanned/image-based PDFs go through the OCR bbox path: CPR positions are found via Tesseract then physically painted and sanitised. Falls back to a reportlab overlay if PyMuPDF is not installed; raises a clear error if both libraries are absent.
- **Google Drive file redaction** — the ✂ redact button now works on native DOCX, XLSX, and PDF files stored in Google Drive (both Google Workspace service-account and personal OAuth connectors). The file is downloaded via the Drive API, redacted locally using the same PyMuPDF / python-docx / openpyxl pipeline as local files, then uploaded back as a new revision via `files().update()`. Google Docs/Sheets exported as DOCX are detected by MIME type and refused with a clear message (re-upload after exporting manually). Requires the `drive` scope (not `drive.readonly`) on the service-account domain-wide delegation grant; a 403 surfaces the exact Google error so admins can add the scope. Methods added: `get_drive_file_mime`, `download_drive_file_by_id`, `update_drive_file` on both `GoogleWorkspaceConnector` and `PersonalGoogleConnector`.
- **SFTP file redaction** — the ✂ button now works on SFTP files (DOCX, XLSX, CSV, TXT, PDF). The file is downloaded via paramiko, redacted locally, then written back with `sftp.open(path, "wb")`. Source config is matched from `_load_file_sources()` by host + username; credentials are resolved from the keychain via `_resolve_sftp_credentials`. Requires the item to be in the current session's `state.flagged_items` (SFTP host info is not stored in the DB). New method: `SFTPScanner.write_file(remote_path, content)`.
- **SMB file redaction** — the ✂ button now works on SMB/CIFS network share files (DOCX, XLSX, CSV, TXT, PDF). Source config is looked up by matching the host parsed from `full_path` (`//host/share/…`). File is downloaded and re-uploaded using smbprotocol with `CreateDisposition.FILE_SUPERSEDE` so the file is atomically replaced. New function: `file_scanner.write_smb_file(path, content, username, password, domain)`.
- **AI-enhanced NER via Claude** — Named Entity Recognition (names, addresses, organisations) can now be powered by Claude Haiku instead of spaCy. Enable in **Settings → AI / NER**: paste an Anthropic API key, toggle on, click Test to confirm. When enabled, `document_scanner.py` calls the Claude API (`claude-haiku-4-5-20251001`) instead of spaCy for all three scan engines; results are cached in-memory per document (bounded at 2 000 entries) so repeated scans of the same file never re-charge the API. Falls back to spaCy automatically if the key is missing or the `anthropic` package is not installed. API key stored in `config.json` under `claude_api_key`; toggle stored under `claude_ner`. Routes: `GET/POST /api/settings/claude`, `POST /api/settings/claude/test`.
### Changed
- **Redacted and deleted cards stay in the grid until the next scan** — previously redacting (✏) or deleting (🗑) a card — or running a bulk delete — removed the affected cards from the grid and from `S.flaggedData`/`S.filteredData` immediately. Now each item is kept and marked: the card is greyed (`card-resolved` styling), shows a `✏ Redacted` (green) or `🗑 Deleted` (red) badge, and its action buttons are hidden so it can't be re-processed. The operator can see what was handled during the session; the grid is rebuilt on the next scan run, which clears the markers. Implemented with `_redacted` / `_deleted` flags in `results.js` (`appendCard`, `redactItem`, `deleteItem`, `executeBulkDelete`, `deleteSubjectItems`); handled items are also excluded from the bulk-delete match set. `POST /api/delete_bulk` now returns `deleted_ids` so the grid marks exactly the items the server actually deleted (partial failures stay active). Also fixes a latent bug in the data-subject delete flow where `renderGrid()` was called with no argument and threw, falsely reporting "Delete failed" after a successful erasure.
### Fixed
- **Selected card scrolled out of view when opening the preview** — opening the preview panel narrows `.grid-area`, which reflows the `auto-fill` grid to fewer columns and moves every card to a new row. The single-frame `scrollIntoView` ran while the browser's scroll-anchoring re-adjusted `scrollTop` mid-reflow, fighting the scroll so the clicked card ended up off-screen. Fixed by disabling scroll anchoring on `.grid-area` (`overflow-anchor: none`) and deferring the scroll by two animation frames so it runs against the settled layout; the card is now centred (`block: 'center'`) instead of `'nearest'` so it stays clearly visible.
- **Cards not shown after browser refresh** — when the browser reconnected to the SSE stream after a completed scan, the `scan_phase` events in the replay buffer temporarily set `S._m365ScanRunning = true` (all running flags start at `false` after a page reload). The watchdog's `loadHistorySession` call fired in this window and bailed on the stale flag; once `scan_done` cleared the flag, `_initialStatusChecked` was already `true` so `loadHistorySession` was never retried. Fixed by having the `sse_replay_done` handler retry `loadHistorySession(null)` when no scan is running and `S._historyRefScanId` is still `null` after replay.
- **Settings modal too narrow for seven tabs** — widened from 640 px to 720 px so all tab labels fit on one line without wrapping.
- **Card action buttons invisible in grid view**`.card` was missing `position: relative`, so the `position:absolute` delete (🗑), redact (✏), and bulk-select checkbox elements anchored to the viewport instead of the card and were then clipped away by the card's `overflow:hidden`. They only appeared in list view, where those elements are `position:static` and flow inline. Added `position: relative` to `.card` so all three position correctly within each card. Also gave `.card-redact-btn` the same `0.35` baseline opacity as the delete button (it was `opacity:0` at rest) so it's discoverable without hovering.
### Security
- **Stored XSS in the results grid** — scan-derived strings (file name, account/display name, folder, source label, modified date, image `alt`) were interpolated straight into `innerHTML` and `title=` attributes across the card, list, preview, data-subject lookup, and related-documents views. Because these values come from scanned content (e.g. a OneDrive file deliberately named with markup), a crafted filename could execute script in a reviewer's session — including a shared read-only viewer/DPO session. A new `esc()` helper in `static/js/results.js` (escapes `& < > " '`) is now applied to every untrusted field before embedding. The related-documents `onclick` JSON is also escaped with `.replace(/"/g,'&quot;')` to match the delete/redact button pattern, closing an attribute-injection hole where a filename containing `"` could break out of the handler.
- **Reflected XSS in `/api/thumb`** — the `?name=` query parameter was embedded unescaped into the placeholder SVG served as `image/svg+xml`, so opening a crafted `/api/thumb?name=<script>…` URL directly executed script in the app origin. `cpr_detector._placeholder_svg` now HTML-escapes both the type label and the filename before embedding them in the SVG.
- **Claude API key now encrypted at rest** — the Anthropic API key was stored in plaintext in `config.json` while the SMTP password was already Fernet-encrypted. `save_claude_config()` now encrypts the key with the same machine-keyed Fernet (`_encrypt_password`); a new `get_claude_api_key()` decrypts it for use. Legacy plaintext keys are still read transparently and re-encrypted on the next save. Readers in `document_scanner.py` and `routes/app_routes.py` updated accordingly.
---
## [1.6.28] — 2026-05-28
### Added
- **Date-range scoping for viewer tokens** — tokens can now carry optional `valid_from` and `valid_to` scope fields (YYYY-MM-DD). When set, `GET /api/db/flagged` filters items whose `modified` date falls outside the range. The share modal now shows two date inputs ("Items from" / "Items until") that apply to any scope type (all/role/user). The token list shows a green date-range badge when a range is stored. The server validates format and enforces `valid_from ≤ valid_to`. All three scope dimensions (role, user, date-range) are independent and combinable.
- **CPR-only mode** — a new `cpr_only` scan option (sidebar toggle `#optCprOnly`, profile editor `#peOptCprOnly`) makes all three scan engines skip items that have no qualifying CPR numbers. Files whose only hits are email addresses, phone numbers, detected faces, or EXIF/GPS metadata are not flagged. The flag already detected is still shown on cards when `cpr_only=false` (default). Gated in all three engines: file scan skip condition, M365 email flagging, M365 file flagging, and Google Gmail/Drive flagging.
- **OCR language override** — a new `ocr_lang` scan option (sidebar select `#optOcrLang`, profile editor `#peOptOcrLang`) lets operators choose the Tesseract language pack(s) used when scanning scanned PDFs and images. Presets: `dan+eng` (default), `dan`, `eng`, `dan+eng+deu`, `dan+eng+swe`, `dan+eng+fra`. The setting flows from the UI through the profile, into all three scan engines (M365 `_scan_bytes_timeout`, M365 attachments `_scan_bytes`, M365 files `_scan_bytes`, Google `_scan_bytes` for both Gmail and Drive). The `lang` parameter is threaded through `cpr_detector._scan_bytes``document_scanner.scan_pdf` / `scan_image` and the spawned PDF-OCR subprocess worker. The OCR cache key already included `lang`, so per-language results are cached independently.
- **Built-in file redaction for local files** — a scissor button (`✂`) appears on cards for local DOCX, XLSX, CSV, and TXT files. Clicking it rewrites the file in-place with all detected CPR numbers replaced by `██████-████` (DOCX/XLSX) or `█`-blocks (CSV/TXT), then removes the card from the grid and logs a `"redacted"` disposition. The redaction is atomic: a temp file in the same directory is written first and then moved over the original, so a crash never leaves a half-written file. Implemented in `routes/export.py` (`POST /api/redact_item`) using the existing `document_scanner` redact functions; front-end in `results.js` (`redactItem`) with the button hidden for non-local or unsupported-extension items and for resolved/viewer-mode cards.
- **`DELETE /api/delete_item` route registration fix** — the `delete_item` handler in `routes/export.py` was missing its `@bp.route` decorator, so the endpoint was never registered in Flask's URL map. The route now works correctly.
- **Scheduled report-only email job** — scheduled jobs can now be configured as "report only" (toggle `#schedReportOnly`). When enabled, the job skips the scan entirely and instead emails the latest scan results already in the database. If the in-memory result list is empty (e.g. after a server restart), results are loaded from the DB via `get_session_items()`. M365 authentication is not required for report-only jobs — email is sent Graph-first if authenticated, SMTP otherwise. Jobs fail with a clear error if no scan results are available. The job list card shows a blue "Report only" badge. Setting `report_only=True` in the editor automatically enables "Email report automatically" and dims the Profile field (unused for report-only runs).
- **Compliance audit log** — every significant admin action is now written to an immutable `audit_log` table in the scanner database. Recorded events: profile save/delete, viewer token create/revoke, viewer/interface/admin PIN set/change/clear, file source add/update/delete, scheduler job save/delete, scan start/stop, SMTP config save, single and bulk disposition changes, item delete, and item redact. Each record stores a Unix timestamp, an action key, a human-readable detail string, and the client IP address. Accessible via `GET /api/audit_log` (returns newest-first, max 1000 entries; filterable by `?action=`). Visible in the Settings modal under a new **Audit Log** tab; the table refreshes whenever the tab is opened. The `log_audit_event()` module-level helper in `gdpr_db.py` silently no-ops if the DB is unavailable, so all call sites are safe in test and offline contexts.
### Fixed
- **Stop button had no effect on Google Workspace scans**`POST /api/scan/stop` only set `state._scan_abort` (the M365/file abort event) and never touched `state._google_scan_abort`. Separately, `_check_abort()` inside `_run_google_scan` was checking `gdpr_scanner._scan_abort` (the M365 event) instead of the module-level `_scan_abort` alias that points to `state._google_scan_abort`. Both bugs combined meant neither the Stop button nor `POST /api/google/scan/cancel` had any effect on a running Google scan. Fixed by having `scan_stop()` set both events and having `_check_abort()` use the correct module-level alias.
- **Settings tab labels wrapping to two lines** — adding the Audit Log tab pushed the six-tab row past the 540 px modal width, causing "E-mailrapport" (and similar long translations) to break onto a second line. The modal is now 640 px wide and tabs carry `white-space:nowrap`; `.settings-tabs` retains `flex-wrap:wrap` as a safety net on very small screens.
---
## [1.6.27] — 2026-05-27
### Added
- **Email body excerpt preserved for offline preview** — when an M365 email or Gmail message is flagged, the first 500 characters of its plain-text body are stored in the card (`body_excerpt`), the checkpoint JSON, and a new `body_excerpt` DB column (migration #10). The M365 email preview now falls back to this excerpt when Graph is unavailable (not authenticated, token expired) or when resuming from a checkpoint without a live connection. The Gmail preview now shows the stored excerpt as the primary content (with the "Open in Gmail" link appended below) rather than the previous plain link-card. A helper `_excerpt_page()` in `routes/database.py` renders the excerpt with the same header layout as the full Graph-fetched preview.
- **Re-scan diff — resolved items in history view** — when browsing a past scan session, items that were flagged in the immediately preceding session but are no longer present in the current one are automatically appended below a "N items no longer present" divider. Resolved items are greyed out and carry a green `✓ Resolved` badge; the delete button is hidden since the file is already gone. The history banner updates to show the resolved count alongside the flagged count. The diff is computed client-side by fetching the previous session's items and comparing IDs — no new API endpoint needed. Implemented in `history.js` (`loadHistorySession`) and `results.js` (`appendCard`).
- **Google Workspace scan test suite** — 19 new tests in `tests/test_google_scan.py` covering all three routes (`GET /api/google/scan/users`, `POST /api/google/scan/start`, `POST /api/google/scan/cancel`) and the core scan engine (`_run_google_scan`). Route tests verify: 401 when unauthenticated, 409 when scan already running, lock released on both normal completion and exception, abort event cleared on start. Engine tests verify: CPR hits are broadcast as `scan_file_flagged`, clean items are not, `source_type` is correctly set to `"gmail"` for Gmail items and `"gdrive"` for Drive items, and `google_scan_done` always fires with correct `flagged_count` / `total_scanned` values.
---
## [1.6.26] — 2026-04-29
### Fixed
- **Previous scan results visible when a new scan starts** — two async functions (`loadHistorySession` and `loadLastScanSummary`) could resolve after `startScan` had already cleared the grid. `loadHistorySession` would re-populate the grid with old history items; `loadLastScanSummary` would re-show the last-scan summary card. Both functions now bail early after each `await` if any of the three scan-running flags (`S._m365ScanRunning`, `S._googleScanRunning`, `S._fileScanRunning`) is set — those flags are written synchronously by `startScan` before any awaits, so the check is race-free.
- **Selected card scrolls out of view when preview panel opens** — clicking a card in grid view opens the 420 px preview panel, which shrinks the grid area and reflows the card columns. The selected card was no longer visible. `openPreview()` now schedules a `requestAnimationFrame` after removing `.hidden` from the panel so the card is scrolled back into view (`scrollIntoView block: nearest`) once the layout has settled.
- **Gmail and Google Drive preview crashed with a 404 Graph API error**`_source_type` was never set on Google items in `routes/google_scan.py`, so Gmail and Google Drive cards carried an empty `source_type`. The preview route in `routes/database.py` only checked for `"local"`, `"smb"`, and `"email"` before falling through to the M365 else-branch, which tried to call `https://graph.microsoft.com/.../drive/items/gmail:{id}/preview` — always a 404. Fixed by tagging Gmail items as `_source_type = "gmail"` and Google Drive items as `"gdrive"` at scan time. The preview route now handles both: Google Drive files get an embeddable `https://drive.google.com/file/d/{id}/preview` iframe; Gmail messages (not embeddable) show an info card with an "Open in Gmail" link. The `state.connector` (M365 auth) guard was also moved inside the `email` and M365 `else` branches so Google-only setups no longer receive a 401 when opening a Gmail or Drive preview.
---
## [1.6.25] — 2026-04-25
### Added
- **Checkpoint / resume for Google and File scans** — stopping a Google Workspace or file (local/SMB/SFTP) scan mid-way and restarting now resumes from where it left off, exactly like M365 scans have always done. Each engine writes its own checkpoint file (`checkpoint_google.json`, `checkpoint_file_{source_id}.json`) every 25 items. On restart, previously found cards are re-emitted via SSE so the grid is repopulated before new items arrive. The Scan button now always checks for a live checkpoint before starting — if one exists the resume banner is shown regardless of whether the user reloaded the page. `POST /api/scan/checkpoint` returns a per-engine breakdown; `POST /api/scan/clear_checkpoint` wipes all `checkpoint_*.json` files. Google users' email addresses are included in the checkpoint payload from the frontend so the server can compute a matching key. `checkpoint.py` functions gained a `prefix` keyword argument (default `"m365"`) — existing M365 call sites are unchanged.
- **CPR cross-referencing (related documents)** — clicking any flagged card that contains CPR hits now shows a "Related documents" section in the preview panel listing other items from the same scan session that share at least one CPR number. Items are ordered by number of shared CPRs; clicking any entry opens it in the preview panel. Works in both live mode and history mode (respects `?ref=N`). Powered by a self-join on the existing `cpr_index` table — no new data collection needed. New `GDPRDb.get_related_items(item_id, ref_scan_id)` method and `GET /api/db/related/<item_id>?ref=N` endpoint in `routes/database.py`. Frontend: `#previewRelated` div in the preview panel, `_loadRelated(f)` in `results.js`, `window._openRelated(id, itemData)` helper (looks up live `S.flaggedData` first, falls back to API response for history items).
- **Email address and Danish phone number detection** — all three scan engines (M365, Google Workspace, local/SMB/SFTP) can now flag files and messages containing email addresses or Danish phone numbers in addition to CPR numbers. Detection is opt-in per profile: two new toggle options **Scan for email addresses** and **Scan for phone numbers** (default off) appear in the scan options panel and profile editor. When enabled, matches are stored as `email_count` / `phone_count` on each DB row and surfaced as colour-coded badges in list view, grid view, and the preview panel. Email regex requires a structurally valid address (`local@domain.tld`); phone regex covers 8-digit Danish numbers with optional `+45`/`0045` prefix and common spacing patterns. Both are deduplicated before counting. Requires DB migration (adds two INTEGER columns to `flagged_items`; applied automatically on first startup via `_MIGRATIONS`).
- **SFTP as a 4th file connector** — SFTP servers can now be added as file sources alongside local folders, SMB shares, and cloud sources. A new `SFTPScanner` class in `sftp_connector.py` implements the same `iter_files()` interface as `FileScanner`, so `run_file_scan()`, SSE broadcasting, DB persistence, card building, scheduled scans, and exports work without changes. Supports password auth and SSH private key auth (RSA, Ed25519, ECDSA, DSS); passphrases stored in the OS keychain. Key files uploaded via `POST /api/file_sources/upload_key` and stored in `~/.gdprscanner/sftp_keys/` with `chmod 600`. SFTP sources appear with a 🔒 icon in the sources panel. Requires `paramiko>=3.4` (optional — scanner falls back gracefully if not installed). New source-type selector (Local / Network (SMB) / SFTP) replaces the SMB path-prefix auto-detection in the add-source form.
- **`POST /api/file_sources/upload_key`** — new endpoint that validates and stores an SSH private key file, returning a `key_path` for use in the source definition.
- **SFTP entry in export SOURCE_MAP** — Excel and Article 30 exports render SFTP sources as "🔒 SFTP" with a purple tint (`EDE9F7`), consistent with the existing per-source tab and summary table logic.
### Fixed
- **File source form placeholders untranslated** — all nine placeholder texts in the Add source and Edit source forms (source name, path, SMB host/user, SFTP host/user/path, passphrase) were hardcoded English strings. Nine new `data-i18n-placeholder` keys added to `en.json`, `da.json`, and `de.json`; all 12 affected `<input>` elements now carry `data-i18n-placeholder` attributes.
- **"Name" and "Auth" labels untranslated in SFTP form** — the source-name label and the Auth toggle label in the add-source panel had no `data-i18n` attributes. Added keys `m365_fsrc_name` (DA: "Navn") and `m365_fsrc_sftp_auth` (same across languages). The name label used an inner `<span data-i18n>` to preserve the required-field `*` indicator, which would have been clobbered by a `data-i18n` on the outer `<label>` element. The same clobber bug was fixed for the `m365_fsrc_label` usage in the edit form.
- **Password field placeholder showed "Stored in OS keychain" in English** — added translation key `m365_fsrc_pw_keychain_placeholder` (DA: "Gemt i OS-nøglering") and applied `data-i18n-placeholder` to the three password inputs across both forms (SMB add, SFTP add, SMB edit).
---
## [1.6.24] — 2026-04-25
### Fixed
- **Scheduler UI showed untranslated English strings** — frequency labels ("Daily", "Weekly", "Monthly"), "Next:", "Running...", "Disabled", and both empty-state messages ("No scheduled scans yet." / "No scheduled runs yet") were hardcoded English strings in `scheduler.js` instead of using `t()`. All six call sites in `schedLoad()`, `schedRenderJobs()`, and `schedLoadHistory()` now call `t()` with the appropriate key. Three new translation keys added to `en.json`, `da.json`, and `de.json`: `m365_sched_no_jobs`, `m365_sched_running`, `m365_sched_disabled`.
---
## [1.6.23] — 2026-04-21 ## [1.6.23] — 2026-04-21
### Added ### Added

153
CLAUDE.md
View File

@ -16,19 +16,27 @@ python -m pytest tests/ -q
**Split modules:** `scan_engine.py` (M365 + file scan), `sse.py` (SSE broadcast), `checkpoint.py`, `app_config.py` (all persistence), `cpr_detector.py` **Split modules:** `scan_engine.py` (M365 + file scan), `sse.py` (SSE broadcast), `checkpoint.py`, `app_config.py` (all persistence), `cpr_detector.py`
**Google Drive delta scan** — `routes/google_scan.py` reads `scan_opts.get("delta", False)` (same flag as M365). Per user, delta key is `f"gdrive:{user_email}"` stored in `~/.gdprscanner/delta.json` alongside M365 tokens. First delta-enabled scan fetches all files then records a Changes API start page token via `conn.get_drive_start_token(user_email)`. Subsequent scans call `conn.get_drive_changes(user_email, token)` (Changes API) and update the token. Token save loads the current file fresh before writing (`{**current_tokens, **_new_drive_tokens}`) to avoid overwriting M365 tokens written by a concurrent scan thread. Invalid/expired tokens fall back to full scan automatically. `google_scan_done` now includes `"delta": bool` and `"delta_sources": int`. **Google Drive delta scan** — `routes/google_scan.py` reads `scan_opts.get("delta", False)` (same flag as M365). Per user, delta key is `f"gdrive:{user_email}"` stored in `~/.gdprscanner/delta.json` alongside M365 tokens. First delta-enabled scan fetches all files then records a Changes API start page token via `conn.get_drive_start_token(user_email)`. Subsequent scans call `conn.get_drive_changes(user_email, token)` and update the token. Invalid/expired tokens fall back to full scan automatically.
**Shared content processing** — all three scan engines (M365, Google, file) funnel downloaded bytes through a single function: `cpr_detector._scan_bytes(content, filename)`. It dispatches to the correct parser by file extension. `scan_engine.py` uses the `_scan_bytes_timeout` wrapper for PDFs (subprocess + hard timeout). `routes/google_scan.py` uses `_scan_bytes` directly. Do not duplicate file-type handling in per-source code. **Google connector write-back** — `google_connector.py` exposes `get_drive_file_mime`, `download_drive_file_by_id`, `update_drive_file` on both connectors for in-place Drive redaction. These use `DRIVE_WRITE_SCOPES` (`drive`, not `drive.readonly`) — the service-account delegation must include this scope or the call raises 403.
**`cpr_detector.SUPPORTED_EXTS` is the single source of truth** for which file extensions are scanned across all sources. `file_scanner.py` imports it as `DEFAULT_EXTENSIONS` so local/SMB scans stay in sync automatically. `scan_engine.py` uses it to gate M365/SharePoint/Teams file downloads. Do not maintain a separate extension list anywhere else. **SFTP connector** — `sftp_connector.py` provides `SFTPScanner` with the same `iter_files()` interface as `FileScanner`. `run_file_scan()` in `scan_engine.py` checks `source.get("source_type") == "sftp"` and instantiates `SFTPScanner`; the rest of the pipeline is source-agnostic. Auth: `"password"` via OS keychain; `"key"` from `~/.gdprscanner/sftp_keys/<uuid>`. `SFTP_OK` flag guards graceful degradation if `paramiko` is not installed. Single-file I/O: `_ssh_connect()`, `read_file(remote_path)`, `write_file(remote_path, content)` — do not duplicate SSH setup outside these methods.
**`_scan_bytes` injection pattern** — `scan_engine.py` defines a no-op stub for `_scan_bytes` / `_scan_bytes_timeout` at module level (avoids circular import). `gdpr_scanner.py` overwrites them with the real `cpr_detector` implementations at startup. `routes/google_scan.py` resolves them lazily via `gdpr_scanner.__getattr__`. This is intentional — do not try to import them directly in those modules. **Shared content processing** — all three scan engines funnel downloaded bytes through `cpr_detector._scan_bytes(content, filename)`. `scan_engine.py` uses `_scan_bytes_timeout` for PDFs (subprocess + hard timeout). Do not duplicate file-type handling in per-source code.
**Blueprints** in `routes/` — see `routes/CLAUDE.md` for state/SSE rules. **`cpr_detector.SUPPORTED_EXTS` is the single source of truth** for which file extensions are scanned. `file_scanner.py` imports it as `DEFAULT_EXTENSIONS`. Do not maintain a separate extension list anywhere else.
**`_scan_bytes` injection pattern** — `scan_engine.py` defines no-op stubs at module level (avoids circular import). `gdpr_scanner.py` overwrites them at startup. `routes/google_scan.py` resolves them lazily via `gdpr_scanner.__getattr__`. Do not import them directly in those modules.
**Blueprints** in `routes/` — see `routes/CLAUDE.md` for SSE constraints, export, preview, scheduler, NER, audit log, viewer, software update, and other route-specific rules.
**Self-update (server only)** — `routes/updates.py` powers **Settings → General → Software update**: git fetch → ff-only merge → conditional `pip install``os.execv` restart (same PID; marks fds close-on-exec first so Werkzeug's inheritable listening socket doesn't leak and squat the port). Only enabled for git checkouts (`_supported()` is false for frozen desktop builds). `update_gdpr.sh` is the CLI/cron equivalent. Refused while a scan runs; optional daily auto-update thread (`config.json["auto_update"]`). Restart keeps port 5100 (the port probe uses `SO_REUSEADDR` + a 10s grace). See `routes/CLAUDE.md` → "Software update".
**Frontend:** `templates/index.html` (SPA), `static/style.css` (all styles), `static/js/*.js` (11 ES modules + `state.js`). `static/app.js` is an archived monolith — no longer loaded. **Frontend:** `templates/index.html` (SPA), `static/style.css` (all styles), `static/js/*.js` (11 ES modules + `state.js`). `static/app.js` is an archived monolith — no longer loaded.
**Data dir** `~/.gdprscanner/`: `scanner.db`, `config.json`, `settings.json`, `schedule.json`, `token.json`, `delta.json`, `checkpoint.json`, `smtp.json`, `machine_id` (**never delete** — Fernet key), `role_overrides.json`, `google_sa.json`, `google.json`, `src_toggles.json`, `app.lock`, `viewer_tokens.json` **Checkpoint / resume** — all three scan engines save progress to `~/.gdprscanner/checkpoint_{prefix}.json` every 25 items. Prefixes: `m365`, `google`, `file_{source_id}`. Use `_cp_path(prefix)` — do not hard-code filenames. The Scan button calls `checkCheckpoint(() => startScan(false))` so a resume banner is offered before any grid clearing. `POST /api/scan/clear_checkpoint` globs and deletes all `checkpoint_*.json` files.
**Data dir** `~/.gdprscanner/`: `scanner.db`, `config.json` (also holds `claude_api_key`/`claude_ner` and the `auto_update` flag), `settings.json`, `schedule.json`, `token.json`, `delta.json`, `checkpoint_m365.json`, `checkpoint_google.json`, `checkpoint_file_*.json`, `smtp.json`, `machine_id` (**never delete** — Fernet key), `role_overrides.json`, `google_sa.json`, `google.json`, `src_toggles.json`, `app.lock`, `viewer_tokens.json`. Static files are served with `SEND_FILE_MAX_AGE_DEFAULT=0` (ETag revalidation) so the UI is fresh after a self-update — do not re-add long static caching.
## Non-obvious files ## Non-obvious files
@ -38,127 +46,70 @@ python -m pytest tests/ -q
| `routes/state.py` | Shared mutable state + scan locks (not a typical Flask state file) | | `routes/state.py` | Shared mutable state + scan locks (not a typical Flask state file) |
| `routes/google_scan.py` | Google scan execution lives here, not in `google_connector.py` | | `routes/google_scan.py` | Google scan execution lives here, not in `google_connector.py` |
| `routes/viewer.py` | Viewer token + PIN API; also owns brute-force rate-limit state | | `routes/viewer.py` | Viewer token + PIN API; also owns brute-force rate-limit state |
| `static/js/viewer.js` | Share modal, token CRUD, viewer PIN settings UI | | `static/js/viewer.js` | Share modal, token CRUD, viewer PIN settings UI. Also defines `window._copyText` (HTTP-safe clipboard helper reused by `log.js`) |
| `lang/da.json` | Primary language — source of truth is `en.json` | | `lang/da.json` | Primary language — source of truth is `en.json` |
| `build_gdpr.py` | Desktop app builder; contains embedded `LAUNCHER_CODE` for PyInstaller | | `build_gdpr.py` | Desktop app builder; contains embedded `LAUNCHER_CODE` for PyInstaller |
| `routes/updates.py` | Self-update routes + `os.execv` restart with fd-cleanup; git-checkout only |
| `update_gdpr.sh` | CLI/cron self-update (fetch, ff-merge, deps, service restart) |
| `docs/setup/ZORAXY_SETUP.md` | HTTPS via Zoraxy reverse proxy (LAN-only, Let's Encrypt DNS-01) |
## Tests ## Tests
182 tests in `tests/`. No integration tests for live M365/Google connections. 215 tests in `tests/`. No integration tests for live M365/Google connections.
**`tests/test_route_integration.py`** — 54 Flask test-client tests covering security-sensitive paths: viewer token CRUD and scope validation, `GET /api/db/flagged` role/user scope enforcement, bulk disposition isolation, viewer PIN (set/verify/rate-limit/change/clear), interface PIN gate (multi-step flows require `session["interface_ok"] = True` after PIN set — the `before_request` hook blocks the same endpoint once a PIN exists), scan lock release on `run_scan()` exception, `GET /api/db/sessions` shape and ordering, profile routes CRUD and rename (including the rename-after-copy regression). Uses a tmp-path `ScanDB` monkeypatched into `routes.database._get_db` — tests never touch the real database. Interface PIN tests manipulate the real `config.json` via `setup_method`/`teardown_method` calling `clear_interface_pin()`. **`tests/test_updates.py`** — 12 tests for the software-update routes (`routes/updates.py`). All git interaction goes through a mocked `_git()`; `_schedule_restart` is patched so no test re-execs the process, and `gdpr_db.log_audit_event` is patched so no test writes the real database. Includes `_mark_fds_cloexec` (the socket-leak guard for the restart).
**Local-file scan fixtures** — `tests/fixtures/local_files/` holds 19 files for manual/UI-level testing of the file scanner. 14 should be flagged; 5 are true negatives. All CPR numbers verified against `is_valid_cpr`. `generate_fixtures.py` (requires `python-docx`, `openpyxl`, `mutagen` — all in venv) regenerates the binary `.docx`/`.xlsx`/`.mp3`/`.flac`/`.mp4` files. Audio fixtures need 2 silent MPEG frames so mutagen can sync; FLAC uses a hand-packed STREAMINFO + Vorbis comment block; MP4 uses a minimal `ftyp`+`moov`/`mvhd` base that mutagen can tag. **`tests/test_google_scan.py`** — 19 tests for the Google Workspace scan module. Route tests for `GET /api/google/scan/users`, `POST /api/google/scan/start`, `POST /api/google/scan/cancel`. Engine tests for `_run_google_scan` using synchronous invocation with mocked `broadcast`, `_scan_bytes`, `checkpoint.*`, `scan_engine._with_disposition`, and `gdpr_db.get_db`. The `clean_google_state` autouse fixture releases `_google_scan_lock` and clears `_google_scan_abort` after each test.
**`_CPR_PREFIX_NOISE` in `.docx` fixtures** — `scan_docx` builds a single string by concatenating all run texts with no separators between paragraphs. If a CPR value run is immediately followed by text from the next paragraph without a word boundary, `\b` in `CPR_PATTERN` fails and the number is silently missed. The fixture generator appends a trailing `" "` to every value run so CPRs are always surrounded by word boundaries after concatenation. Do not remove this trailing space — the detection will silently regress. **`tests/test_route_integration.py`** — 54 Flask test-client tests covering security-sensitive paths: viewer token CRUD and scope validation, `GET /api/db/flagged` role/user scope enforcement, bulk disposition isolation, viewer PIN (set/verify/rate-limit/change/clear), interface PIN gate (multi-step flows require `session["interface_ok"] = True` after PIN set), scan lock release on `run_scan()` exception, `GET /api/db/sessions` shape and ordering, profile routes CRUD and rename. Uses a tmp-path `ScanDB` monkeypatched into `routes.database._get_db` — tests never touch the real database.
## Viewer mode (#33) — routes/viewer.py + static/js/viewer.js **Local-file scan fixtures** — `tests/fixtures/local_files/` holds 19 files (14 flagged, 5 true negatives). `generate_fixtures.py` regenerates the binary files. Audio fixtures need 2 silent MPEG frames so mutagen can sync; FLAC uses a hand-packed STREAMINFO + Vorbis comment block.
Read-only access for DPOs and reviewers. Key invariants: **`_CPR_PREFIX_NOISE` in `.docx` fixtures** — `scan_docx` concatenates all run texts with no separators. The fixture generator appends a trailing `" "` to every value run so CPRs are always surrounded by word boundaries. Do not remove this trailing space — the detection will silently regress.
- **`/view` auth chain** — token (`?token=`) → session cookie (`session["viewer_ok"]`) → PIN form (if PIN configured) → 403. Never skip this order.
- **`window.VIEWER_MODE`** — injected by Jinja2 in `index.html`. `auth.js` reads it at startup; adds `viewer-mode` class to `<body>`. All hide rules are CSS (`body.viewer-mode …`), not scattered JS checks — except `delBtn` in the card builder which is also guarded in JS. Hidden in viewer mode: `.sidebar` (entire left panel), `#logWrap`, `#progressBar`, scan/stop/profile/bulk-delete buttons, share button.
- **`window.VIEWER_SCOPE`** — injected alongside `VIEWER_MODE`. Contains the scope dict from the token (e.g. `{"role": "student"}`). Empty object `{}` means unrestricted. `auth.js` reads it at startup; if `VIEWER_SCOPE.role` is set, it pre-sets `#filterRole` to that value and hides the dropdown so the viewer cannot change it.
- **Token scope** — stored as `"scope": {"role": "student"|"staff"}` or `"scope": {}` in each token dict inside `viewer_tokens.json`. Enforced in two places: server-side (`GET /api/db/flagged` skips items whose `user_role` column does not match `session["viewer_scope"].role`) and client-side (the `#filterRole` dropdown is locked). Server-side is the authoritative guard. **Column name is `user_role`** — do not use `role`; the DB row has no such key and the filter silently returns nothing.
- **`session["viewer_scope"]`** — set when a token is validated at `/view`. Persists for the browser session alongside `session["viewer_ok"]`. Reads from `session.get("viewer_scope", {})` in `/api/db/flagged` — defaults to `{}` (unrestricted) for PIN-authenticated sessions and legacy tokens without a scope key.
- **`viewer_tokens.json` format** — stored as `{"tokens": [...], "__pin__": {"hash": "…", "salt": "…"}}`. Token dicts now include `"scope": {}`. The old bare-list format and tokens without a `scope` key are handled transparently (`t.get("scope", {})`). Do not write the file as a bare list.
- **`app.secret_key`** — derived from `machine_id` bytes so Flask sessions survive restarts. Set once at startup in `gdpr_scanner.py`; do not override it.
- **`GET /api/db/flagged`** — returns `get_session_items()` (last completed scan session, joined with dispositions), filtered by `session["viewer_scope"].role` when set. Used exclusively by `_loadViewerResults()` in `results.js`. Do not confuse with `get_flagged_items()` (single scan_id, no disposition join).
- **Rate-limit state** (`_pin_attempts` dict in `routes/viewer.py`) — in-memory only, resets on server restart. Intentional — a restart clears lockouts without a persistent store.
- **User-scoped tokens (#34)** — scope `{"user": ["alice@m365.dk", "alice@gws.dk"], "display_name": "Alice Smith"}` filters `GET /api/db/flagged` by `account_id IN (list)`, covering both M365 and GWS items for the same person. `scope.user` is always stored as a list; a legacy single-string value is coerced to `[string]` on read. `scope.display_name` is used for UI only (badge, viewer header) — not for filtering. File-scan items (`account_id = ""`) never appear in user-scoped views. `POST /api/viewer/tokens` rejects combined `role`+`user` scope with 400. Share modal: scope-type `<select>` (`#shareScopeType`) reveals either the role dropdown (`#shareScopeRoleWrap`) or a name-search autocomplete (`#shareScopeUserWrap`). Autocomplete reads `S._allUsers`; selecting a row stores `{ emails, display_name }` in module-level `_selectedScopeUser`; editing the input manually clears it (free-text email fallback). In viewer mode, `auth.js` shows `#viewerIdentityBadge` with `VIEWER_SCOPE.display_name`.
- **Token onclick attributes** — Copy/Revoke buttons in `_renderTokenList()` pass the token as a single-quoted JS string literal (`'\'' + tok.token + '\''`), never via `JSON.stringify`. `JSON.stringify` produces double-quoted strings that break the surrounding `onclick="…"` HTML attribute.
- **Settings Security pane** — Admin PIN and Viewer PIN groups live in `stPaneSecurity`, not `stPaneGeneral`. `switchSettingsTab('security')` in `sources.js` triggers both `stLoadPinStatus()` and `stLoadViewerPinStatus()`. The Share modal Configure button opens `openSettings('security')`.
- **`stClearViewerPin` guard** — validates that the current-PIN field is non-empty client-side before sending the DELETE request; shows an inline error and focuses the field if empty.
- **Share link base URL**`_getShareBaseUrl()` in `viewer.js` fetches `/api/local_ip` (returns the machine's LAN IP via a UDP probe to `8.8.8.8`) and substitutes it so copied links are routable from other machines. Falls back to `window.location.origin` on error. Both `createShareLink` and `copyTokenLink` are `async` and `await` this helper. Do not revert to a bare `window.location.origin` — that produces `127.0.0.1` links useless to remote viewers.
- **Flask binds to `0.0.0.0`**`gdpr_scanner.py` default `--host`, `m365_launcher.py`, and `build_gdpr.py` all use `host="0.0.0.0"`. Internal loopback URLs (urllib exports, webview window, port probe) intentionally keep `127.0.0.1` — do not change those to `0.0.0.0`.
## Sources panel resize — static/js/log.js + sources.js
- **`_fitSourcesPanel()`** — called at the end of every `renderSourcesPanel()` call. Clears the panel's inline height, reads `scrollHeight` (natural content height), then either restores a saved smaller preference from `localStorage` (`gdpr_sources_h`) or pins the height to `scrollHeight`. This keeps the panel exactly as tall as needed to show all sources.
- **`_initSourcesResize()`** — attaches pointer-drag to `#sourcesResizeHandle`. On `pointerdown` it captures `scrollHeight` as the hard max; drag up shrinks, drag down is capped at that max. Saves to `localStorage` on release; clears the key if the user drags back to full height.
- **Do not add a fixed `max-height` or `height` to `#sourcesPanel` in HTML** — height is controlled entirely by `_fitSourcesPanel()` at runtime.
- **Do not call `_fitSourcesPanel()` before the panel has rendered**`scrollHeight` will be 0. The call in `renderSourcesPanel()` is the correct hook; `_initSourcesResize()` only sets up the drag handler.
## Scan filter options — scan_engine.py ## Scan filter options — scan_engine.py
Both options live in the profile `options` dict and apply to **all three scan engines** (M365, Google, file scan). All options live in the profile `options` dict and apply to **all three scan engines** (M365, Google, file scan).
- **`skip_gps_images` (bool, default `false`)** — When enabled, images whose only PII is GPS coordinates are not flagged. GPS data is still extracted and stored in the card `exif` field if the item is flagged by another signal (faces, EXIF author/comment). The `gps_location` special category is also suppressed. Evaluated via `_exif_has_pii` which rechecks `pii_fields` and `author` when GPS is skipped. - **`skip_gps_images` (bool, default `false`)** — images whose only PII is GPS coordinates are not flagged. GPS data still stored in `exif` field if flagged by another signal.
- **`min_cpr_count` (int, default `1`)** — Minimum number of **distinct** CPR numbers in a file before it is flagged. Deduplication uses `list(dict.fromkeys(c["formatted"] for c in cprs))``cprs` is a list of dicts from `extract_matches`, not strings. Do not revert to `dict.fromkeys(cprs)` — that raises `TypeError: unhashable type: 'dict'` on every file with CPR hits. Files with faces or EXIF PII are still flagged regardless of CPR count — the threshold gates only CPR-based hits. - **`min_cpr_count` (int, default `1`)** — minimum distinct CPR numbers before flagging. Deduplication uses `list(dict.fromkeys(c["formatted"] for c in cprs))` — do not revert to `dict.fromkeys(cprs)` (raises `TypeError: unhashable type: 'dict'`). Files with faces or EXIF PII are still flagged regardless.
- **File scan** reads both from `source` dict keys (passed directly from the `/api/file_scan/start` payload). **M365 scan** reads both from `scan_opts = options.get("options", {})`. Both paths apply the same `_cpr_qualifies` / `_exif_has_pii` logic before the flagging gate. - **`cpr_only` (bool, default `false`)** — skip items whose only hits are email addresses, phone numbers, faces, or EXIF/GPS metadata.
- **UI:** sidebar controls `#optSkipGps` (toggle) and `#optMinCpr` (number); profile editor controls `#peOptSkipGps` and `#peOptMinCpr`. Both are saved/loaded by `profiles.js`. - **`ocr_lang` (str, default `"dan+eng"`)** — Tesseract language packs. Threaded through `_scan_bytes`/`_scan_bytes_timeout``document_scanner` and the PDF-OCR subprocess worker. Cache key already includes `lang`.
- **File scan** reads options from `source` dict keys directly. **M365 scan** reads from `scan_opts = options.get("options", {})`. Both paths apply the same `_cpr_qualifies` / `_exif_has_pii` logic.
## M365 connector exceptions — m365_connector.py - **UI:** sidebar `#optSkipGps`, `#optMinCpr`, `#optCprOnly`, `#optOcrLang`; profile editor `#peOptSkipGps`, `#peOptMinCpr`, `#peOptCprOnly`, `#peOptOcrLang`. All saved/loaded by `profiles.js`.
Exception hierarchy (all inherit `M365Error(Exception)`):
| Exception | Trigger | Handler |
|---|---|---|
| `M365PermissionError` | 403 Forbidden | `scan_error` broadcast with human-readable permission hint |
| `M365DeltaTokenExpired` | 410 Gone on delta endpoint | Caller clears token and falls back to full scan |
| `M365DriveNotFound` | 404 Not Found on any path | `scan_phase` broadcast ("not provisioned — skipped") in `_scan_user_onedrive`; full-scan path's `except Exception: return` also silences it |
**`M365DriveNotFound` — why it exists:** `_get()` previously fell through to `raise_for_status()` on 404, which was caught by the generic `except Exception` handler in `_scan_user_onedrive` and broadcast as a red `scan_error`. The full-scan path (`_iter_drive_folder_for`) silently swallowed the same 404 via `except Exception: return`. Adding the specific exception makes the delta path consistent with the full-scan path: a user without a provisioned OneDrive is skipped without an error card. Common causes: no OneDrive licence, service plan disabled, drive never initialised (account never signed in), account suspended.
**Do not add a 404 handler to `_get()` that returns a fallback value** — that would silently mask genuine path bugs elsewhere. Raising `M365DriveNotFound` keeps the error visible to callers that need to act on it.
## Memory management — scan_engine.py ## Memory management — scan_engine.py
Large M365 tenants can generate enormous memory pressure. Key rules to preserve: - **Email body stripped at collection time**`_scan_user_email` stores body as `msg["_precomputed_body"]`, deletes `msg["body"]` and `msg["bodyPreview"]`. Processing loop reads `meta.pop("_precomputed_body", "")`. Do not re-add `body` to `$select` without also stripping it.
- **`body_excerpt`** — 500-char plain-text preview stored per flagged email; flows into `flagged_items`, checkpoint JSON, and DB. Do not remove before broadcasting — needed for preview on checkpoint resume.
- **`work_items``deque` before processing** — drained via `popleft()` so each item's memory is released immediately. Do not convert back to a list.
- **`del content` / `del body_text`** — raw bytes and body text deleted immediately after use. Both hit and no-hit paths have explicit deletes.
- **PDF OCR rendered page-by-page**`convert_from_path(first_page=N, last_page=N)` inside the loop; only one page image in memory at a time. Do NOT revert to a bulk call — triggers OOM on large PDFs.
- **OCR memory guard**`_ocr_mem_ok()` checks `psutil.virtual_memory().available >= 500 MB` before each page render.
- **Memory guard**`psutil.virtual_memory().available` checked before each M365 file download; skips if < 300 MB free.
- **Email body stripped at collection time**`_scan_user_email` calls `conn.get_message_body_text(msg)`, stores the result as `msg["_precomputed_body"]`, then deletes `msg["body"]` and `msg["bodyPreview"]` before appending to `work_items`. The processing loop reads `meta.pop("_precomputed_body", "")`. Do not re-add `body` to the `$select` query without also stripping it here. ## Scan history browser — gdpr_db.py
- **`work_items``deque` before processing** — converted with `deque(work_items)` and drained via `popleft()` so each item's memory is released immediately after processing. Do not convert back to a list or iterate with `enumerate()`.
- **`del content` in file branch** — raw download bytes are deleted as soon as `content.decode()` is done (before NER/PII counting). Both the hit and no-hit paths have explicit `del content`.
- **`del body_text` in email branch** — deleted after `_broadcast_card` call.
- **PDF OCR rendered page-by-page**`document_scanner.scan_pdf` (and the redact paths) call `convert_from_path(first_page=N, last_page=N)` inside the loop, so only one page image is in memory at a time. Do NOT move back to a bulk `convert_from_path()` call — that allocates all pages at once and triggers OOM kills on large PDFs.
- **OCR memory guard**`_ocr_mem_ok()` checks `psutil.virtual_memory().available >= 500 MB` before each page render. Pages that would exceed this threshold are skipped with a printed warning and recorded as `"skipped"` in `page_methods`.
- **Memory guard**`psutil.virtual_memory().available` checked before each M365 file download; scan skips the file if < 300 MB free.
## Export — routes/export.py - **`get_sessions(limit=50, window_seconds=300)`** — groups `scans` rows by 300 s window. Groups built ascending, returned descending. `ref_scan_id` is the highest `scan_id` in each group. Do not change window size independently of `get_session_items`.
- **`get_session_items(ref_scan_id=N)`** — anchors 300 s window to that scan's `started_at`. Window is **symmetric**: `started_at BETWEEN ref.started_at - 300 AND ref.started_at + 300`. Do not revert to a one-sided lower bound.
- **`GDPRDb.get_session_sources()`** — returns a `set` of source-key strings (e.g. `{"gmail", "gdrive", "email"}`) for every scan in the current session window. Used by both `_build_excel_bytes()` and `_build_article30_docx()` to include zero-hit sources in summary tables. Do not derive the scanned-source set from `by_source` alone — that dict only contains sources with flagged items. - **`get_related_items(item_id, ref_scan_id, window_seconds=300)`** — self-joins `cpr_index` to find items sharing ≥1 CPR hash. Uses same 300 s symmetric window — do not change independently.
- **Excel Summary sheet vs. per-source tabs** — the Summary sheet shows all scanned sources (even with 0 items). Per-source tabs are only created for sources with items; an empty tab has no value. - **`account_name` (display name) is persisted** (migration 11) so DB-loaded cards show the user badge. Legacy rows predating it have `account_name=''` — the frontend `_accountPill` resolves a fallback and still shows the group badge from `user_role`. `save_item` must keep writing `card["account_name"]` (both M365 and Google cards carry it).
- **ART.30 breakdown table** — iterates `scanned_sources` (not `by_source`) so Gmail, Google Drive, etc. appear with `0 | 0 | 0 | —` when the scan found nothing. - **Scans must be finalised or their items are invisible**`get_session_items`, `get_open_items`, and `latest_scan_id` all filter on `finished_at IS NOT NULL`. The file scan finalises in a `finally`; M365 (`run_scan`) and Google (`_run_google_scan`) `return` early on abort, so each now calls `finish_scan` before that abort-return. A process kill (deploy/OOM/crash) mid-scan still strands a scan → **`finalize_orphan_scans()`** runs once at server startup (`gdpr_scanner.py` `__main__`, before the scheduler) and finalises every `finished_at IS NULL` scan (safe because nothing is scanning at boot). Do not add a scan-results query that ignores `finished_at` instead of fixing finalisation.
- **Role-filtered exports**`_build_excel_bytes(role='')` and `_build_article30_docx(role='')` accept `role='student'` or `role='staff'`. A local `_items` list is built at the top of each function and used everywhere instead of `state.flagged_items` directly — GPS sheet, External transfers sheet, and Art.30 staff/student tables all see only the filtered subset. Route handlers read `request.args.get('role', '')` and forward it. Filenames get `_elever` / `_ansatte` suffix. The `#filterRole` dropdown in the filter bar drives both the client-side grid filter and the export URL param — do not separate them. - **`get_open_items()`** — returns every flagged item with **no action taken**, across **all** scans (not just the latest session window). "Open" = no `dispositions` row, or one whose `status='unreviewed'`. Because `flagged_items` PK is `(id, scan_id)`, the same item recurs per scan; the query dedupes by `id`, keeping the row from the highest finished `scan_id`. This powers the **default landing view** so items don't drop out of sight once a newer scan opens a fresh session.
- **`GET /api/db/flagged`** — **with `?ref=N`**`get_session_items(ref_scan_id=N)` (history mode); **without ref**`get_open_items()` (default + viewer). Viewer scope enforcement applies to both. Do not change the no-ref `get_session_items()` default elsewhere (`export.py`, `scan_scheduler.py` still rely on latest-session for the current scan's report/email).
## Scan history browser — static/js/history.js + gdpr_db.py + routes/database.py - See `static/js/CLAUDE.md` for the frontend history browser behaviour and `sse_replay_done` retry fix.
Allows reviewing results from any past scan session without running a new scan. Key invariants:
- **`S._historyRefScanId`** — `null` = live/SSE mode; positive int = viewing a past session (the highest `scan_id` in that session's 300 s window). Set by `loadHistorySession()`; cleared to `null` by `exitHistoryMode()`.
- **`GET /api/db/sessions`** (`routes/database.py`) — calls `_get_db().get_sessions()`. Returns newest-first list; each entry has `ref_scan_id`, `started_at`, `finished_at`, `sources` (list of source-key strings), `flagged_count`, `total_scanned`, `delta` (bool). No auth restriction — viewer tokens share this endpoint.
- **`get_sessions(limit=50, window_seconds=300)`** (`gdpr_db.py`) — groups `scans` rows by 300 s window (same window logic as `get_session_items`). Groups are built ascending, returned descending. `ref_scan_id` is the highest `scan_id` in each group. Do not change the window size independently of `get_session_items`.
- **`get_session_items(ref_scan_id=N)`** (`gdpr_db.py`) — when `ref_scan_id` is given, anchors the 300 s window to that scan's `started_at`. Falls back to latest scan when `ref_scan_id=None`. Window is **symmetric**: `started_at BETWEEN ref.started_at - 300 AND ref.started_at + 300` — do not revert to a one-sided lower bound or historical sessions will include all newer scans.
- **`GET /api/db/flagged?ref=N`** — passes `ref_scan_id` to `get_session_items`; viewer scope enforcement (role/user filters) still applies. Used by both history mode and the normal post-scan viewer path.
- **History banner** (`#historyBanner`) — shown when `S._historyRefScanId` is set. Contains `#historyBannerText` (session date · sources · N items), `#historyPickerBtn` (opens `#historyDropdown`), and `#historyLatestBtn` (visible only when the viewed session is not the latest). Do not hide/show these elements from outside `history.js`.
- **Session picker** (`#historyDropdown`) — rendered inside `[data-history-wrap]` container so the outside-click handler (`document` listener, closes on clicks outside `[data-history-wrap]`) works correctly. Do not move the picker outside this wrapper.
- **Cache invalidation**`_sessions` and `_latestRefScanId` are module-level in `history.js`. `invalidateHistoryCache()` clears both. All three `*_done` SSE handlers in `scan.js` call `window.invalidateHistoryCache?.()` so the picker reflects the newest scan after completion.
- **Auto-load on page load**`results.js` calls `window.loadHistorySession?.(null)` once when the SSE watchdog confirms `!status.running`. `null` resolves to the latest completed session via `_fetchSessions()[0].ref_scan_id`. The `_initialStatusChecked` guard ensures this fires at most once per page load.
- **Mode transitions**`startScan()` calls `window.exitHistoryMode?.()` before clearing the grid, so any history banner is dismissed and `S._historyRefScanId` is reset before SSE events start arriving.
## SSE teardown — static/js/scan.js
- **Do not close `S.es` in `scan_done` if other scans are still running** — M365 (`scan_done`), Google (`google_scan_done`), and File (`file_scan_done`) each emit their own done event. If M365 finishes first and the SSE is closed, the remaining done events are never received and the UI hangs at 100% indefinitely.
- **Rule:** close `S.es` (and reset `S._userStartedScan`) only inside the branch where *all* concurrent scans have finished: `scan_done` checks `!S._googleScanRunning && !S._fileScanRunning`; `google_scan_done` checks `!S._m365ScanRunning && !S._fileScanRunning`; `file_scan_done` checks `!S._m365ScanRunning && !S._googleScanRunning`.
- **Scheduled scans**`S._userStartedScan` is false for scheduler-triggered runs, so the SSE connection is never closed and future scheduler events continue to arrive.
- **`scan_start` is M365-only** — `run_scan()` broadcasts `scan_start`; `run_file_scan()` and `routes/google_scan.py` must NOT. The `scan_start` handler in `_attachSchedulerListeners` unconditionally sets `S._m365ScanRunning = true`. If a file scan emits `scan_start`, the flag is set without a matching `scan_done` to clear it, and `file_scan_done` refuses to re-enable the scan button because `!S._m365ScanRunning` is false. Use `scan_phase` (file) and `google_scan_phase` (google) instead — these are routed correctly by the phase-source detection logic in `_attachScanListeners`.
## Email sending — routes/email.py + m365_connector.py
- **`_post()` returns `{}` on empty body** — `m365_connector._post()` returns `r.json() if r.content else {}`. The Graph `sendMail` endpoint returns HTTP 202 with **no body** on success; calling `r.json()` on an empty response raises `JSONDecodeError`. Do not change this back to an unconditional `r.json()` — it would falsely report every successful email send as an error.
- **Graph preferred over SMTP**`smtp_test` and `send_report` both try `_send_email_graph()` first when `state.connector` is authenticated. Only falls back to SMTP if Graph raises. If Graph fails and no SMTP host is saved, the Graph exception is surfaced directly (not swallowed by the "No SMTP host" message).
- **Auto-email after manual scan**`_maybe_send_auto_email()` in `routes/scan.py` is called from the `_run()` thread immediately after `run_scan()` returns. Reads `smtp_cfg.get("auto_email_manual")` from `smtp.json`; no-ops if the flag is false, no flagged items, or no recipients. Same Graph-first → SMTP-fallback pattern as the scheduler. Toggle: **Settings → Email report → Email report after manual scan** (`#st-smtpAutoEmail`), saved by `stSmtpSave()` in `scheduler.js`.
- **Gmail vs Google Workspace detection** — auth error handlers check whether the SMTP username ends in `@gmail.com` / `@googlemail.com`. If not, the account is treated as Google Workspace (custom domain) and the error message points to the Workspace admin console rather than the user's personal security settings.
## Global gotchas ## Global gotchas
- **Pattern matching in Python** — when using `str.replace()` to patch JS/HTML, whitespace and quote style must match exactly. Use `in` check first and print if not found. - **Pattern matching in Python** — when using `str.replace()` to patch JS/HTML, whitespace and quote style must match exactly. Use `in` check first and print if not found.
- **`__getattr__` on modules** — only resolves `module.name` access from outside, not bare name lookups inside function bodies. Always import directly. - **`__getattr__` on modules** — only resolves `module.name` access from outside, not bare name lookups inside function bodies. Always import directly.
- **`JSON.stringify` inside `onclick="…"` attributes** — produces double-quoted strings that terminate the HTML attribute early. Use single-quoted JS string literals instead, or `data-*` attributes read from the handler. - **`JSON.stringify` inside `onclick="…"` attributes** — produces double-quoted strings that terminate the HTML attribute early. Use single-quoted JS string literals instead, or `data-*` attributes read from the handler. When the object is embedded as an `onclick` payload, also `.replace(/"/g,'&quot;')` it (matches the delete/redact button pattern) so a `"` in a filename can't break out.
- **Escape scan-derived strings before `innerHTML`** — file names, account/display names, folders, and source labels come from scanned content and may contain markup. Pass them through `esc()` (in `results.js`) before embedding in `innerHTML` or `title=`/`alt=` attributes. Server-side SVG/HTML built from request params (e.g. `_placeholder_svg` for `/api/thumb`) must use `_html_esc`. Skipping either re-introduces stored/reflected XSS.
- **Secrets at rest use the machine-keyed Fernet** — the SMTP password and Claude API key are encrypted via `app_config._encrypt_password` / `_decrypt_password`. New secret-bearing config fields must follow the same pattern; read them through a decrypting accessor (e.g. `get_claude_api_key()`), never `_load_config().get(...)` directly.
## Directory-scoped rules ## Directory-scoped rules
- `routes/CLAUDE.md` — SSE constraints, scan_progress source field, file_sources, Python gotchas - `routes/CLAUDE.md` — SSE constraints, M365 exceptions, export, preview, audit log, email, scheduler, Claude NER, viewer route, Python gotchas
- `static/js/CLAUDE.md` — profile dropdown, progress bar phase parsing, JS gotchas - `static/js/CLAUDE.md` — profile dropdown, progress bar, SSE teardown, history browser, CPR cross-referencing, sources panel resize, viewer JS, JS gotchas
- `templates/CLAUDE.md` — CSS variable names, sizing rules, badge standard, design rules - `templates/CLAUDE.md` — CSS variable names, sizing rules, badge standard, design rules
- `lang/CLAUDE.md` — i18n conventions - `lang/CLAUDE.md` — i18n conventions

View File

@ -102,7 +102,7 @@ tests/ pytest test suite — 112 tests, all should pass.
**Settings stats show 0 (Scanned / Flagged / Scans)** **Settings stats show 0 (Scanned / Flagged / Scans)**
`routes/database.py``db_stats()` — queries `flagged_items` and `scans` directly `routes/database.py``db_stats()` — queries `flagged_items` and `scans` directly
→ Stats populate from existing DB on app start — no re-scan needed → Stats populate from existing DB on app start — no re-scan needed
→ If still 0 after a completed scan: check `~/.gdpr_scanner.db` exists and is not empty → If still 0 after a completed scan: check `~/.gdprscanner/scanner.db` exists and is not empty
**File scan results not persisting to DB** **File scan results not persisting to DB**
`scan_engine.py``run_file_scan()` — must call `_db.begin_scan()` not `start_scan()` `scan_engine.py``run_file_scan()` — must call `_db.begin_scan()` not `start_scan()`

67
OSS_LANDSCAPE.md Normal file
View File

@ -0,0 +1,67 @@
# Open Source Landscape — GDPR / PII Document Scanners
An overview of existing open source tools in the same space as GDPRScanner, and where the gaps are.
---
## Summary
No open source project covers the same combination of M365 + Google Workspace connectors, Danish CPR detection, and GDPR Article 30 reporting in a single web UI. The closest commercial equivalent is [PII Tools](https://pii-tools.com) (closed source, SaaS).
---
## Existing open source tools
### [Microsoft Presidio](https://github.com/microsoft/presidio)
A well-maintained PII detection *library* (not an application) from Microsoft. Supports custom recognisers — a CPR pattern could be added. Covers text, images, and structured data via NLP + regex pipelines. No M365/GWS connectors, no UI, no reports, no scheduling. You would have to build the entire scanning application around it. ~9k GitHub stars.
### [Octopii](https://github.com/redhuntlabs/Octopii)
Local filesystem / S3 / Apache open-directory scanner using OCR + NLP + regex. Detects passports, government IDs, emails, and addresses in image and document files. No cloud connectors, no CPR awareness, no web UI.
### [pdscan](https://github.com/ankane/pdscan) / [piicatcher](https://github.com/tokern/piicatcher)
CLI tools that scan *databases* and data warehouses for PII columns using column-name heuristics and NLP sampling. No file storage scanning, no email, no cloud connectors.
### "GDPR scanners" on GitHub
Projects such as [baudev/gdpr-checker-backend](https://github.com/baudev/gdpr-checker-backend), [dev4privacy/gdpr-analyzer](https://github.com/dev4privacy/gdpr-analyzer), [mammuth/gdpr-scanner](https://github.com/mammuth/gdpr-scanner), and [City-of-Helsinki/GDPR-compliance-scanner](https://github.com/City-of-Helsinki/GDPR-compliance-scanner) are all **website and cookie compliance** scanners. They check whether a domain sets tracking cookies without consent — a completely different problem.
### CPR libraries
Several small libraries exist for validating or generating Danish CPR numbers ([mathiasvr/danish-ssn](https://github.com/mathiasvr/danish-ssn), [anhoej/cprr](https://github.com/anhoej/cprr), [ekstroem/DKcpr](https://github.com/ekstroem/DKcpr)). None of them are document or cloud-storage scanners.
---
## Commercial products that do cover it
| Product | M365 | GWS | CPR | Article 30 | Open source |
|---|---|---|---|---|---|
| [PII Tools](https://pii-tools.com) | ✅ | ✅ | ❌ | ❌ | ❌ |
| BigID | ✅ | ✅ | ❌ | ❌ | ❌ |
| Varonis | ✅ | partial | ❌ | ❌ | ❌ |
| Spirion | ✅ | ❌ | ❌ | ❌ | ❌ |
PII Tools is the most direct commercial equivalent: Graph API + GWS service account connectors, document scanning, web UI. Closed source, SaaS pricing targeted at enterprise.
---
## Capability comparison
| Capability | GDPRScanner | Presidio | Octopii | Commercial |
|---|---|---|---|---|
| M365 (Exchange / OneDrive / SharePoint / Teams) | ✅ | ❌ | ❌ | ✅ |
| Google Workspace (Gmail / Drive) | ✅ | ❌ | ❌ | ✅ |
| Local / SMB / SFTP | ✅ | ❌ | partial | ✅ |
| Danish CPR with modulus-11 validation | ✅ | plugin only | ❌ | ❌ |
| Email address + phone number detection | ✅ | ✅ | ✅ | ✅ |
| GDPR Article 30 report generation | ✅ | ❌ | ❌ | partial |
| Disposition tagging + bulk deletion | ✅ | ❌ | ❌ | partial |
| Scheduled scans | ✅ | ❌ | ❌ | ✅ |
| Checkpoint / resume | ✅ | ❌ | ❌ | unknown |
| Read-only viewer / share links | ✅ | ❌ | ❌ | partial |
| Web UI for non-technical staff | ✅ | ❌ | ❌ | ✅ |
| Danish-language UI | ✅ | ❌ | ❌ | ❌ |
| Open source | ✅ | ✅ | ✅ | ❌ |
---
## What makes GDPRScanner unique
The combination of Danish CPR specificity (modulus-11 validation, date sanity checks), M365 + Google Workspace connectors in a single tool, and GDPR Article 30 output is the gap no open source project fills. The Danish public-sector target audience (schools, municipalities) also drives requirements — role classification (student/staff), Danish-language UI, municipal data retention rules — that no general-purpose PII tool addresses.

View File

@ -1,8 +1,8 @@
# GDPRScanner # GDPRScanner
Scans Microsoft 365, Google Workspace, and local/network file systems for Danish Scans Microsoft 365, Google Workspace, local/network file systems, and SFTP servers
CPR numbers and personal data (PII). Produces GDPR compliance reports and supports for Danish CPR numbers and personal data (PII). Produces GDPR compliance reports and
Article 30 record-keeping obligations. supports Article 30 record-keeping obligations.
--- ---
@ -32,7 +32,7 @@ an IDE with intelligent completion. The result is the author's work.
- **Folder path in results** — each email result shows its full folder path (e.g. `Inbox / Ansøgninger pædagog SFO`) in the card and in Excel export - **Folder path in results** — each email result shows its full folder path (e.g. `Inbox / Ansøgninger pædagog SFO`) in the card and in Excel export
- **Delete items** — flagged results can be deleted directly from the UI, individually or in bulk - **Delete items** — flagged results can be deleted directly from the UI, individually or in bulk
- **CPR false-positive reduction** — strict CPR validation - **CPR false-positive reduction** — strict CPR validation
- **Excel export** — multi-tab `.xlsx` report with per-source breakdown, auto-filters, and URL hyperlinks. Columns include: Name, CPR Hits, Face count, GPS (✔ if GPS in EXIF), Special category, EXIF author, Folder, Account, Role, Disposition, Date Modified, Size (KB), URL. A dedicated **GPS locations** sheet lists all items with GPS coordinates including a Google Maps link. Separate tabs for Outlook (Exchange), OneDrive, SharePoint, Teams, Gmail, Google Drive, local folders, and SMB/network shares. Summary sheet shows counts by source and GPS item total. When M365, Google Workspace, and file scans run concurrently, all results are captured in the export — not just the last completed scan - **Excel export** — multi-tab `.xlsx` report with per-source breakdown, auto-filters, and URL hyperlinks. Columns include: Name, CPR Hits, Face count, GPS (✔ if GPS in EXIF), Special category, EXIF author, Folder, Account, Role, Disposition, Date Modified, Size (KB), URL. A dedicated **GPS locations** sheet lists all items with GPS coordinates including a Google Maps link. Separate tabs for Outlook (Exchange), OneDrive, SharePoint, Teams, Gmail, Google Drive, local folders, SMB/network shares, and SFTP. Summary sheet shows counts by source and GPS item total. When M365, Google Workspace, and file scans run concurrently, all results are captured in the export — not just the last completed scan
- **Progressive streaming** — results stream card-by-card via Server-Sent Events as the scan runs - **Progressive streaming** — results stream card-by-card via Server-Sent Events as the scan runs
- **Token auto-refresh** — expired tokens are detected and silently refreshed mid-scan without interrupting the UI - **Token auto-refresh** — expired tokens are detected and silently refreshed mid-scan without interrupting the UI
- **Incremental / resumable scans** — interrupted scans save a checkpoint; the next run resumes from where it stopped rather than starting over - **Incremental / resumable scans** — interrupted scans save a checkpoint; the next run resumes from where it stopped rather than starting over
@ -46,11 +46,13 @@ an IDE with intelligent completion. The result is the author's work.
- **Account name on cards** — when scanning multiple users, each card displays the owner's display name so results from different mailboxes are instantly distinguishable - **Account name on cards** — when scanning multiple users, each card displays the owner's display name so results from different mailboxes are instantly distinguishable
- **Retention policy enforcement** — flag items older than a configurable retention period with a Overdue badge; supports both rolling and fiscal-year-aligned cutoffs (e.g. Bogføringsloven Dec 31); headless auto-delete via `--retention-years` - **Retention policy enforcement** — flag items older than a configurable retention period with a Overdue badge; supports both rolling and fiscal-year-aligned cutoffs (e.g. Bogføringsloven Dec 31); headless auto-delete via `--retention-years`
- **Data subject lookup** — find all flagged items containing a specific CPR number across all scans; CPR is SHA-256 hashed before querying — never stored in plaintext - **Data subject lookup** — find all flagged items containing a specific CPR number across all scans; CPR is SHA-256 hashed before querying — never stored in plaintext
- **CPR cross-referencing** — clicking any flagged card with CPR hits shows a "Related documents" section listing other items from the same scan session that share at least one CPR number, ordered by number of shared CPRs. Clicking any entry opens it in the preview panel. Works in live mode and history mode. Powered by a SQL self-join on the `cpr_index` table — no new data collection required
- **Disposition tagging** — compliance officers can tag each flagged item with a legal basis (retain / delete-scheduled / deleted) directly from the preview panel; **bulk disposition tagging** lets you select multiple cards with checkboxes and apply a disposition to all of them at once. A stats bar above the grid shows total · unreviewed · retain · delete counts and the percentage reviewed - **Disposition tagging** — compliance officers can tag each flagged item with a legal basis (retain / delete-scheduled / deleted) directly from the preview panel; **bulk disposition tagging** lets you select multiple cards with checkboxes and apply a disposition to all of them at once. A stats bar above the grid shows total · unreviewed · retain · delete counts and the percentage reviewed
- **Interface PIN** — optional session-level PIN that gates the main scanner interface (`/`). Set a 48 digit PIN in **Settings → Security → Interface PIN**; unauthenticated visitors are redirected to `/login`. The `/view` viewer route and all viewer API endpoints are exempt — reviewers are unaffected. Salted SHA-256 hash; brute-force protection (5 attempts / 5 min per IP) - **Interface PIN** — optional session-level PIN that gates the main scanner interface (`/`). Set a 48 digit PIN in **Settings → Security → Interface PIN**; unauthenticated visitors are redirected to `/login`. The `/view` viewer route and all viewer API endpoints are exempt — reviewers are unaffected. Salted SHA-256 hash; brute-force protection (5 attempts / 5 min per IP)
- **Read-only viewer mode** — share scan results with a DPO or manager via a secure token URL (`/view?token=…`) or a numeric PIN; viewers see the full results grid and disposition panel but cannot scan, delete, or change settings. Tokens can be **role-scoped** (Ansatte / Elever) so a recipient only sees items for their group, or **user-scoped** so an individual employee only sees their own flagged files (supports dual M365 + Google Workspace identity) - **Read-only viewer mode** — share scan results with a DPO or manager via a secure token URL (`/view?token=…`) or a numeric PIN; viewers see the full results grid and disposition panel but cannot scan, delete, or change settings. Tokens can be **role-scoped** (Ansatte / Elever) so a recipient only sees items for their group, or **user-scoped** so an individual employee only sees their own flagged files (supports dual M365 + Google Workspace identity)
- **Article 30 report** — one-click export of a structured Word document (`.docx`) satisfying the GDPR Article 30 register of processing activities obligation - **Article 30 report** — one-click export of a structured Word document (`.docx`) satisfying the GDPR Article 30 register of processing activities obligation
- **SQLite results database** — scan results, CPR index, PII breakdown, disposition decisions, and scan history are persisted to `~/.gdprscanner/scanner.db` alongside the JSON cache, enabling cross-scan queries and trend tracking - **SQLite results database** — scan results, CPR index, PII breakdown, disposition decisions, and scan history are persisted to `~/.gdprscanner/scanner.db` alongside the JSON cache, enabling cross-scan queries and trend tracking
- **Software updates from the UI** — check for and install new versions from **Settings → General → Software update**, or enable automatic daily updates; the app restarts itself in place (see [Software updates](#software-updates) below)
- **Built-in user manual** — click the **?** button in the top bar to open the manual in a dedicated window. Available in Danish and English. Printable via the browser's print function. Served from `MANUAL-DA.md` / `MANUAL-EN.md` at `/manual?lang=da|en` — always in sync with the installed version, no internet required. In the packaged desktop app the manual opens as a native pywebview window; in the browser it opens as a popup. - **Built-in user manual** — click the **?** button in the top bar to open the manual in a dedicated window. Available in Danish and English. Printable via the browser's print function. Served from `MANUAL-DA.md` / `MANUAL-EN.md` at `/manual?lang=da|en` — always in sync with the installed version, no internet required. In the packaged desktop app the manual opens as a native pywebview window; in the browser it opens as a popup.
--- ---
@ -79,7 +81,7 @@ The sidebar sources panel lists all configured scan sources. Click **Sources** t
**Google Workspace tab** — Two authentication modes: **Workspace** (service account with domain-wide delegation — scans all users) and **Personal account** (OAuth 2.0 device-code flow — scans the signed-in account only). Once connected, per-source toggles control whether Gmail and/or Google Drive appear in the sidebar panel and are included in scans. See [GOOGLE_SETUP.md](docs/setup/GOOGLE_SETUP.md) for setup instructions. **Google Workspace tab** — Two authentication modes: **Workspace** (service account with domain-wide delegation — scans all users) and **Personal account** (OAuth 2.0 device-code flow — scans the signed-in account only). Once connected, per-source toggles control whether Gmail and/or Google Drive appear in the sidebar panel and are included in scans. See [GOOGLE_SETUP.md](docs/setup/GOOGLE_SETUP.md) for setup instructions.
**File sources tab** — Add local folder paths or SMB/CIFS network shares with a name, path, and optional SMB credentials. Each saved source appears as a checkbox in the sidebar panel (local, SMB/network). Use the **Edit** button on each row to update credentials or rename a source without deleting it. **File sources tab** — Add local folder paths, SMB/CIFS network shares, or SFTP servers. A pill selector (Local / Network / SFTP) switches the form fields. SFTP sources require host, port, username, remote path, and auth type (password or private key). SSH private keys are uploaded via the UI, validated with paramiko, and stored in `~/.gdprscanner/sftp_keys/` with `600` permissions; passwords and passphrases are stored in the OS keychain. Each saved source appears as a checkbox in the sidebar panel. Use the **Edit** button on each row to update credentials or rename a source without deleting it.
**Skipped automatically:** `.recycle`, `.sync`, `.btsync`, `.trash`, `.git`, `node_modules`, `System Volume Information`, and other system/sync folders. Hidden directories (`.` prefix) are skipped too. **Skipped automatically:** `.recycle`, `.sync`, `.btsync`, `.trash`, `.git`, `node_modules`, `System Volume Information`, and other system/sync folders. Hidden directories (`.` prefix) are skipped too.
@ -207,6 +209,11 @@ The **⬇ Excel** button exports all current results to a `.xlsx` file (`m365_sc
| OneDrive | Flagged OneDrive files | | OneDrive | Flagged OneDrive files |
| SharePoint | Flagged SharePoint files | | SharePoint | Flagged SharePoint files |
| Teams | Flagged Teams files | | Teams | Flagged Teams files |
| Gmail | Flagged Gmail messages |
| Google Drive | Flagged Google Drive files |
| Local | Flagged local-folder files |
| Network | Flagged SMB/NAS files |
| SFTP | Flagged SFTP server files |
In macOS app builds, the export opens a native Save dialog instead of a browser download. In macOS app builds, the export opens a native Save dialog instead of a browser download.
@ -221,7 +228,7 @@ Configure email delivery in **Settings → Email report**. Click **Save** to sto
| SMTP host | e.g. `smtp.office365.com`, `smtp.gmail.com` | | SMTP host | e.g. `smtp.office365.com`, `smtp.gmail.com` |
| Port | `587` for STARTTLS (default), `465` for SMTPS/SSL | | Port | `587` for STARTTLS (default), `465` for SMTPS/SSL |
| Username | SMTP login — usually your sender email address | | Username | SMTP login — usually your sender email address |
| Password | Saved to `~/.gdpr_scanner_smtp.json` (permissions 600). Encrypted at rest using Fernet — key in `~/.gdpr_scanner_machine_id` (chmod 0o600, never share) | | Password | Saved to `~/.gdprscanner/smtp.json` (permissions 600). Encrypted at rest using Fernet — key in `~/.gdprscanner/machine_id` (chmod 0o600, never share) |
| Graph API | When connected to M365, email is sent via `/me/sendMail` (delegated) or `/users/{sender}/sendMail` (app mode) — no SMTP password needed. Requires `Mail.Send` Graph permission with admin consent. | | Graph API | When connected to M365, email is sent via `/me/sendMail` (delegated) or `/users/{sender}/sendMail` (app mode) — no SMTP password needed. Requires `Mail.Send` Graph permission with admin consent. |
| From address | Sender address (defaults to username if blank) | | From address | Sender address (defaults to username if blank) |
| STARTTLS | Enable STARTTLS on port 587 (recommended) | | STARTTLS | Enable STARTTLS on port 587 (recommended) |
@ -267,7 +274,7 @@ Delta scan uses the Microsoft Graph `/delta` API (M365) and the Google Drive **C
1. Run one **full scan** first (Delta checkbox off) — this establishes baseline delta tokens 1. Run one **full scan** first (Delta checkbox off) — this establishes baseline delta tokens
2. Tick **Δ Delta scan** and run again — only items added, modified, or deleted since the previous scan are fetched and CPR-scanned 2. Tick **Δ Delta scan** and run again — only items added, modified, or deleted since the previous scan are fetched and CPR-scanned
3. Delta tokens are saved automatically to `~/.gdpr_scanner_delta.json` after each successful scan 3. Delta tokens are saved automatically to `~/.gdprscanner/delta.json` after each successful scan
4. To force a full rescan, click **Clear tokens** under the checkbox (or delete the file) 4. To force a full rescan, click **Clear tokens** under the checkbox (or delete the file)
Delta tokens are stored **per-source**: Delta tokens are stored **per-source**:
@ -492,6 +499,49 @@ python gdpr_scanner.py --import-db ~/compliance/gdpr_export_2026.zip --import-mo
--- ---
### Software updates
When the app runs from a git checkout (the normal server install), it can update itself. The **Settings → General → Software update** group offers:
- **Check for updates** — fetches the upstream repository and shows either "You are running the latest version" or the list of pending commits
- **Install update** — fast-forwards the checkout, reinstalls dependencies if `requirements.txt` changed, and restarts the app in place; the browser waits for the server to come back and reloads automatically
- **Install updates automatically** — optional toggle; a background thread checks once a day and installs unattended
Safety guarantees:
- Updating is **refused while any scan is running** — manual attempts get a clear message, and the auto-updater simply retries on its next hourly tick, so a scheduled scan is never killed mid-run
- Local edits on the server are **auto-stashed** (kept, never discarded) before the merge; the merge is fast-forward-only, so a diverged checkout stops the update instead of creating a merge mess
- Every applied update is recorded in the **compliance audit log** (`app_update`, old → new commit)
- The restart re-execs the process with the same PID, so it works identically under systemd and when launched via `start_gdpr.sh`
The Settings group is hidden in the packaged desktop app (no git checkout to update) — desktop users update by installing a new build.
**CLI / cron equivalent** — `update_gdpr.sh` performs the same update from a shell:
```bash
./update_gdpr.sh # update if upstream has new commits, restart service
./update_gdpr.sh --check # report pending commits, change nothing
```
It restarts a `gdprscanner.service` systemd unit if one exists (override the name with `GDPR_SERVICE=…`) and is quiet when already up to date, so it is safe to run from cron:
```bash
# /etc/cron.d/gdprscanner-update — nightly at 04:00
0 4 * * * root /opt/gdprscanner/update_gdpr.sh >> /var/log/gdpr_update.log 2>&1
```
API endpoints: `GET /api/update/check`, `POST /api/update/apply`, `GET/POST /api/update/settings`.
---
### HTTPS / reverse proxy
The scanner itself serves plain HTTP. For encrypted transport on a LAN — recommended, since scan results contain CPR numbers — put it behind a TLS-terminating reverse proxy and bind the app to loopback (`--host 127.0.0.1`) so the proxy is the only way in. Share links automatically follow the HTTPS hostname, and the browser Clipboard API (Copy buttons) works natively in a secure context.
See [ZORAXY_SETUP.md](docs/setup/ZORAXY_SETUP.md) for a complete walkthrough: Zoraxy, Let's Encrypt via DNS-01 challenge (required when the hostname resolves to a private IP), proxy rule, and the scanner-specific verification steps.
---
### Article 30 report ### Article 30 report
The **Art.30** button in the filter bar generates a GDPR **Article 30 Register of Processing Activities** as a Word document (`.docx`). The **Art.30** button in the filter bar generates a GDPR **Article 30 Register of Processing Activities** as a Word document (`.docx`).
@ -601,15 +651,18 @@ pip install pytest
pytest tests/ pytest tests/
``` ```
**182 tests across 5 modules — all expected to pass.** **212 tests across 8 modules — all expected to pass.**
| Module | Tests | Covers | | Module | Tests | Covers |
|---|---|---| |---|---|---|
| `tests/test_document_scanner.py` | 36 | `is_valid_cpr`, `extract_matches`, `scan_docx`, `scan_xlsx`, `_scan_bytes` — CPR detection, false-positive suppression, binary crash safety | | `tests/test_document_scanner.py` | 37 | `is_valid_cpr`, `extract_matches`, `scan_docx`, `scan_xlsx`, `_scan_bytes` — CPR detection, false-positive suppression, binary crash safety |
| `tests/test_app_config.py` | 34 | i18n loading, Article 9 keyword detection, config round-trip, admin PIN, profiles CRUD, Fernet encryption | | `tests/test_app_config.py` | 34 | i18n loading, Article 9 keyword detection, config round-trip, admin PIN, profiles CRUD, Fernet encryption |
| `tests/test_checkpoint.py` | 18 | Checkpoint key stability, save/load/clear, wrong-key isolation, delta token round-trip | | `tests/test_checkpoint.py` | 18 | Checkpoint key stability, save/load/clear, wrong-key isolation, delta token round-trip |
| `tests/test_db.py` | 24 | Scan lifecycle, CPR hash-only storage, data subject lookup, dispositions, export/import cycle | | `tests/test_db.py` | 23 | Scan lifecycle, CPR hash-only storage, data subject lookup, dispositions, export/import cycle |
| `tests/test_routes.py` | 16 | Core route behaviour — scan status/start/stop, DB stats, dispositions, Excel and Article 30 export |
| `tests/test_route_integration.py` | 54 | Viewer token CRUD, role/user scope enforcement, bulk disposition isolation, viewer PIN, interface PIN gate, scan lock release on failure, session history ordering, profile routes CRUD and rename | | `tests/test_route_integration.py` | 54 | Viewer token CRUD, role/user scope enforcement, bulk disposition isolation, viewer PIN, interface PIN gate, scan lock release on failure, session history ordering, profile routes CRUD and rename |
| `tests/test_google_scan.py` | 19 | Google scan routes (users/start/cancel) and `_run_google_scan` engine with mocked connector, checkpoints, and DB |
| `tests/test_updates.py` | 11 | Software-update routes — check/apply with mocked git, scan-running refusal, dirty-tree auto-stash, requirements reinstall, settings round-trip |
Each unit-test module (`cpr_detector.py`, `app_config.py`, `checkpoint.py`, `gdpr_db.py`) is importable in isolation without Flask or MSAL — tests run without any cloud credentials or a running server. Each unit-test module (`cpr_detector.py`, `app_config.py`, `checkpoint.py`, `gdpr_db.py`) is importable in isolation without Flask or MSAL — tests run without any cloud credentials or a running server.
@ -654,7 +707,7 @@ See [SUGGESTIONS.md](SUGGESTIONS.md) for the full feature roadmap with implement
| File | Description | | File | Description |
|---|---| |---|---|
| `gdpr_scanner.py` | Flask entry point — scan orchestration, SSE route (`/api/scan/stream`), root route | | `gdpr_scanner.py` | Flask entry point — scan orchestration, SSE route (`/api/scan/stream`), root route |
| `scan_engine.py` | M365 and local/SMB scan logic — `run_scan()`, `run_file_scan()` | | `scan_engine.py` | M365 and local/SMB/SFTP scan logic — `run_scan()`, `run_file_scan()` |
| `app_config.py` | All persistence — profiles, settings, SMTP config, lang loading, Fernet encryption | | `app_config.py` | All persistence — profiles, settings, SMTP config, lang loading, Fernet encryption |
| `sse.py` | SSE broadcast queue and `_current_scan_id` | | `sse.py` | SSE broadcast queue and `_current_scan_id` |
| `checkpoint.py` | Mid-scan checkpoint save/load, `_checkpoint_key()` | | `checkpoint.py` | Mid-scan checkpoint save/load, `_checkpoint_key()` |
@ -664,6 +717,7 @@ See [SUGGESTIONS.md](SUGGESTIONS.md) for the full feature roadmap with implement
| `m365_connector.py` | Microsoft Graph API client — auth, token refresh, email/OneDrive/SharePoint/Teams fetchers, delete methods | | `m365_connector.py` | Microsoft Graph API client — auth, token refresh, email/OneDrive/SharePoint/Teams fetchers, delete methods |
| `google_connector.py` | Google Workspace API client — Gmail, Drive, Admin SDK | | `google_connector.py` | Google Workspace API client — Gmail, Drive, Admin SDK |
| `file_scanner.py` | Unified local + SMB/CIFS file iterator — `FileScanner.iter_files()` yields `(path, bytes, metadata)`. SMB reads use a 1-slot sliding-window `ThreadPoolExecutor` (`PREFETCH_WINDOW=1`) with a 60-second per-file timeout. `DEFAULT_EXTENSIONS` is imported from `cpr_detector.SUPPORTED_EXTS` (not a local hardcoded set) so the scannable extension list stays in sync automatically. | | `file_scanner.py` | Unified local + SMB/CIFS file iterator — `FileScanner.iter_files()` yields `(path, bytes, metadata)`. SMB reads use a 1-slot sliding-window `ThreadPoolExecutor` (`PREFETCH_WINDOW=1`) with a 60-second per-file timeout. `DEFAULT_EXTENSIONS` is imported from `cpr_detector.SUPPORTED_EXTS` (not a local hardcoded set) so the scannable extension list stays in sync automatically. |
| `sftp_connector.py` | SFTP file iterator — `SFTPScanner.iter_files()` yields the same `(path, bytes, metadata)` tuple as `FileScanner`. Uses paramiko (`AutoAddPolicy`); supports password auth and private-key auth (RSA / Ed25519 / ECDSA / DSS). Passwords and key passphrases are stored in the OS keychain; key files live in `~/.gdprscanner/sftp_keys/`. Gracefully degrades when paramiko is not installed (`SFTP_OK` flag). |
| `scan_scheduler.py` | In-process APScheduler wrapper — multi-job scheduled scan engine | | `scan_scheduler.py` | In-process APScheduler wrapper — multi-job scheduled scan engine |
| `templates/index.html` | Single-page HTML shell — Jinja2 template. Two variables: `app_version`, `lang_json`. | | `templates/index.html` | Single-page HTML shell — Jinja2 template. Two variables: `app_version`, `lang_json`. |
| `static/style.css` | All application CSS — custom properties, layout, components, light/dark themes | | `static/style.css` | All application CSS — custom properties, layout, components, light/dark themes |
@ -685,10 +739,13 @@ See [SUGGESTIONS.md](SUGGESTIONS.md) for the full feature roadmap with implement
| `routes/export.py` | `/api/export_excel`, `/api/export_article30`, `/api/delete_bulk` | | `routes/export.py` | `/api/export_excel`, `/api/export_article30`, `/api/delete_bulk` |
| `routes/viewer.py` | `/view`, `/api/viewer/tokens`, `/api/viewer/pin` — read-only viewer mode: token + PIN auth, share-link management, role-scoped and user-scoped tokens | | `routes/viewer.py` | `/view`, `/api/viewer/tokens`, `/api/viewer/pin` — read-only viewer mode: token + PIN auth, share-link management, role-scoped and user-scoped tokens |
| `routes/app_routes.py` | `/api/about`, `/api/langs`, `/api/lang`, `/manual` | | `routes/app_routes.py` | `/api/about`, `/api/langs`, `/api/lang`, `/manual` |
| `routes/updates.py` | `/api/update/*` — software update check/apply, auto-update background thread |
| `update_gdpr.sh` | CLI/cron self-update script — fetch, fast-forward merge, dependency reinstall, service restart |
| `docs/manuals/MANUAL-EN.md` | End-user manual in English (15 sections) — served at `/manual?lang=en` | | `docs/manuals/MANUAL-EN.md` | End-user manual in English (15 sections) — served at `/manual?lang=en` |
| `docs/manuals/MANUAL-DA.md` | End-user manual in Danish (15 sections) — served at `/manual?lang=da` | | `docs/manuals/MANUAL-DA.md` | End-user manual in Danish (15 sections) — served at `/manual?lang=da` |
| `docs/setup/M365_SETUP.md` | Step-by-step Microsoft 365 setup guide | | `docs/setup/M365_SETUP.md` | Step-by-step Microsoft 365 setup guide |
| `docs/setup/GOOGLE_SETUP.md` | Step-by-step Google Workspace setup guide | | `docs/setup/GOOGLE_SETUP.md` | Step-by-step Google Workspace setup guide |
| `docs/setup/ZORAXY_SETUP.md` | HTTPS via Zoraxy reverse proxy — LAN-only deployment with Let's Encrypt DNS-01 |
| `build_gdpr.py` | PyInstaller build script — generates `m365_launcher.py`, packages desktop app | | `build_gdpr.py` | PyInstaller build script — generates `m365_launcher.py`, packages desktop app |
| `lang/en.json` | English translations (source of truth) | | `lang/en.json` | English translations (source of truth) |
| `lang/da.json` | Danish translations (primary language) | | `lang/da.json` | Danish translations (primary language) |

View File

@ -54,10 +54,10 @@ Out of scope:
## Data Handling Notes for Security Researchers ## Data Handling Notes for Security Researchers
- CPR numbers are stored in the SQLite database as **SHA-256 hashes only** — never in plaintext - CPR numbers are stored in the SQLite database as **SHA-256 hashes only** — never in plaintext
- SMTP passwords are stored in `~/.gdpr_scanner_smtp.json` with chmod 600 - SMTP passwords are stored in `~/.gdprscanner/smtp.json` with chmod 600
- Microsoft OAuth tokens are stored in the MSAL token cache in `~/.gdpr_scanner_config.json` - Microsoft OAuth tokens are stored in the MSAL token cache in `~/.gdprscanner/token.json`
- Scan results are stored locally in `~/.gdpr_scanner.db` — never transmitted externally - Scan results are stored locally in `~/.gdprscanner/scanner.db` — never transmitted externally
- The web UI binds to `127.0.0.1` by default — it is not designed to be exposed to the internet - The web UI binds to `0.0.0.0` by default so reviewers on the LAN can reach it — it is not designed to be exposed to the internet. For encrypted transport, put it behind a TLS-terminating reverse proxy and bind the app to loopback with `--host 127.0.0.1` — see [docs/setup/ZORAXY_SETUP.md](docs/setup/ZORAXY_SETUP.md)
--- ---

View File

@ -350,3 +350,31 @@ Write redacted copies of flagged files with CPR numbers replaced by `XXX XXXX-XX
### Email notification on scan completion (non-scheduled) ✅ ### Email notification on scan completion (non-scheduled) ✅
Auto-email now fires on manual scans when **Email report after manual scan** is enabled in Settings → Email report. Toggle stored as `auto_email_manual` in `smtp.json`. Implemented in `routes/scan.py``_maybe_send_auto_email()` is called from the `_run()` thread after `run_scan()` returns. Same Graph-first → SMTP-fallback pattern as scheduled scans. Only fires when there are flagged items and at least one recipient is configured. Auto-email now fires on manual scans when **Email report after manual scan** is enabled in Settings → Email report. Toggle stored as `auto_email_manual` in `smtp.json`. Implemented in `routes/scan.py``_maybe_send_auto_email()` is called from the `_run()` thread after `run_scan()` returns. Same Graph-first → SMTP-fallback pattern as scheduled scans. Only fires when there are flagged items and at least one recipient is configured.
### Keyword / name search across flagged document content
Allow a DPO to type a name (or any keyword) into a search box and find every flagged document whose extracted text contains that string. Complements CPR cross-referencing (#see above) for cases where the person's CPR is not present but their name is.
**Implementation outline:**
1. **Store text snippets at scan time**`_scan_bytes` already extracts plain text for CPR matching; store a 24 KB prefix of that text per item in a new `text_snippet TEXT` column on `flagged_items`, or in a separate `content_index` table. Truncation avoids bloating the DB; the snippet covers most short documents in full.
2. **SQLite FTS5 virtual table**`CREATE VIRTUAL TABLE content_fts USING fts5(item_id UNINDEXED, snippet)`. Populated at scan time alongside `cpr_index`. FTS5 is bundled with SQLite ≥ 3.9 (macOS ships ≥ 3.37) — no external dependency.
3. **`GET /api/db/search?q=<term>&ref=N`** — queries `content_fts` with `MATCH ?`, joins back to `flagged_items` within the session window, returns matching items. SQLite FTS5 supports phrase queries, prefix wildcards (`name*`), and Boolean operators automatically.
4. **Search bar in the filter strip** — a plain `<input type="search">` next to the existing role/source filters. Debounced 300 ms. Results replace the grid (with a "Clear search" pill to return to full view). No new UI paradigm needed.
**Why deferred:** requires a DB migration + storing text at scan time (increases DB size). The CPR cross-reference (already implemented) covers the most common "find all data about this person" use case without storing any raw text. Implement if a school requests free-text search.
**Size:** Medium · **Priority:** Low
---
### Phase 2 PII: name-based roster lookup
Flag documents containing the full names of students or staff — even when no CPR is present. Implementation outline:
1. **Roster source** — pull names from the M365 directory (`/users?$select=displayName`), the GWS directory (`admin.list_users`), or a user-uploaded CSV. Store as a flat list of `(first, last)` pairs, minimum length threshold (~5 chars per part) to suppress common first-name noise.
2. **Multi-pattern search** — build an Aho-Corasick automaton from the roster at scan start (`pyahocorasick`, ~50 KB, optional dep). Run each extracted text through the automaton; a hit qualifies only when the match falls on a word boundary and both first + last name appear within a configurable window (e.g. 100 characters apart).
3. **Integration** — same `_find_emails_phones`-style helper in `cpr_detector.py`; roster loaded once per scan run and passed as a parameter. New `name_count` column in `flagged_items` (DB migration). New `name-badge` in the UI. Opt-in profile toggle like `scan_emails`.
4. **NER fallback** — optionally run `spaCy` `da_core_news_sm` (~200 MB) when no roster is available to detect PERSON entities. Much higher false-positive rate; only useful as a discovery tool.
**Why deferred:** requires a roster-management UI (upload CSV, choose directory source, refresh cadence), and false-positive rate depends heavily on roster quality. Name-only matches also carry lower legal weight than CPR hits. Implement after a school explicitly requests it.

89
TODO.md
View File

@ -111,6 +111,95 @@ Optional session-level authentication gate for the main scanner interface. Set i
--- ---
### OCR language override ✅
Tesseract language pack(s) used for scanned PDFs and images are now configurable per profile. Option `ocr_lang` (default `dan+eng`). Presets: `dan+eng`, `dan`, `eng`, `dan+eng+deu`, `dan+eng+swe`, `dan+eng+fra`. Threaded through `_scan_bytes`/`_scan_bytes_timeout``document_scanner.scan_pdf`/`scan_image` and the spawned PDF-OCR subprocess. OCR result cache keys include `lang` so per-language results are cached independently. Sidebar select `#optOcrLang`; profile editor `#peOptOcrLang`.
---
### CPR-only mode ✅
New scan option `cpr_only` (default `false`). When enabled, items whose only hits are email addresses, phone numbers, detected faces, or EXIF/GPS metadata are skipped — only items with at least one qualifying CPR number are flagged. Implemented as a compact short-circuit at each engine's flagging gate. Sidebar toggle `#optCprOnly`; profile editor `#peOptCprOnly`.
Also added `min_cpr_count` (default `1`) — minimum number of **distinct** CPR numbers required before a file is flagged. Files with faces or EXIF PII are still flagged regardless of this threshold.
---
### Skip GPS images ✅
Scan option `skip_gps_images` (default `false`). When enabled, images whose only PII is GPS coordinates are not flagged. GPS data is still stored in the card `exif` field if the item is flagged by another signal. Sidebar toggle `#optSkipGps`; profile editor `#peOptSkipGps`.
---
### CPR cross-referencing (related documents) ✅
The preview panel now shows a "Related documents" section listing other items in the same scan session that share ≥1 CPR number. Clicking any related item opens its preview. Implemented as a query-time self-join on the existing `cpr_index` table — no new data collection needed. `GET /api/db/related/<item_id>?ref=N` returns rows ordered by shared CPR count descending.
---
### Email preview on checkpoint resume ✅
A 500-character plain-text body excerpt (`body_excerpt`) is now stored per flagged email at broadcast time and persisted in the DB. When the preview modal opens for an email item, this excerpt is shown immediately without requiring a live Graph/Gmail connection. Enables email preview to work correctly after a server restart and checkpoint resume.
---
### Built-in file redaction ✅
Local files (`.docx`, `.xlsx`, `.csv`, `.txt`) can be redacted in-place: CPR numbers are replaced by `██████-████` / `█` blocks, the card is removed from the grid, and a `"redacted"` disposition is logged. The ✂ button appears on redactable local file cards (hidden in viewer mode and for resolved items). File is written to a temp path in the same directory before `shutil.move` to avoid cross-device rename failures.
---
### Date-range scoping for viewer tokens ✅
Viewer tokens can now carry `valid_from` and/or `valid_to` fields (YYYY-MM-DD). `GET /api/db/flagged` filters out items whose `modified` date falls outside the range. All three scope dimensions (role, user, date-range) are independent and combinable. The share modal exposes `#shareValidFrom` / `#shareValidTo` date inputs. Token list shows a green date-range badge when a range is present.
---
### Re-scan diff ✅
When viewing a history session, items present in the immediately preceding session but absent from the current one are shown below a `.resolved-divider` separator with a green ✓ Resolved badge (opacity dimmed). These resolved items are grid-only — they are not added to `S.flaggedData` and cannot be bulk-selected or exported. The history banner shows a resolved count when applicable.
---
### Tests for Google Workspace scan engine ✅
19 tests added in `tests/test_google_scan.py` covering: `GET /api/google/scan/users`, `POST /api/google/scan/start`, `POST /api/google/scan/cancel`, and `_run_google_scan` engine internals. Uses synchronous invocation with mocked `broadcast`, `_scan_bytes`, `checkpoint.*`, and `gdpr_db.get_db`. The `clean_google_state` autouse fixture releases `_google_scan_lock` and clears `_google_scan_abort` after each test.
---
### Compliance audit log ✅
Every significant admin action is written to an immutable `audit_log` table in the scanner database. Recorded events: profile save/delete, viewer token create/revoke, viewer/interface/admin PIN set/change/clear, file source add/update/delete, scheduler job save/delete, scan start/stop, SMTP config save, single and bulk disposition changes, item delete, and item redact. Each record stores a Unix timestamp, action key, human-readable detail, and client IP. `GET /api/audit_log` returns newest-first (max 1000; filterable by `?action=`). Visible in Settings → **Audit Log** tab; refreshes when the tab is opened. `log_audit_event()` helper in `gdpr_db.py` silently no-ops if the DB is unavailable.
---
### Scheduled report-only email job ✅
Scheduler jobs can now be configured as "report only" (toggle `#schedReportOnly`). The job skips the scan entirely and emails the latest results already in the database. If the in-memory result list is empty (e.g. after a server restart), results are loaded from DB via `get_session_items()`. M365 auth is not required — email is sent Graph-first if authenticated, SMTP otherwise. Jobs fail with a clear error if no scan results are available. The job list card shows a blue "Report only" badge. Enabling report-only automatically checks "Email report automatically" and dims the Profile field (unused for report-only runs).
---
### SFTP as a 4th file connector ✅
Scan SFTP servers (SSH File Transfer Protocol) alongside local, SMB, and cloud sources. A new `SFTPScanner` class in `sftp_connector.py` implements the same `iter_files()` interface as `FileScanner`, so `run_file_scan()` and everything downstream (SSE, DB, export, scheduling) is unchanged. Auth supports password and SSH private key (+ optional passphrase). Key files stored in `~/.gdprscanner/sftp_keys/`. SFTP sources appear in the file sources panel with a 🔒 icon, are profile-aware, and are included in scheduled scans automatically.
**Files changed:** `sftp_connector.py` (new), `scan_engine.py`, `routes/sources.py`, `app_config.py`, `static/js/sources.js`, `templates/index.html`, `lang/en|da|de.json`, `routes/export.py`, `requirements.txt`
---
### Checkpoint / resume for Google and File scans ✅
Extended the M365 checkpoint/resume mechanism to all three scan engines. Each engine writes its own file (`checkpoint_m365.json`, `checkpoint_google.json`, `checkpoint_file_{source_id}.json`) every 25 items. Previously found cards are re-emitted via SSE on resume so the grid repopulates before new items arrive. The Scan button now checks for a checkpoint before clearing the grid, so the resume banner appears even without a page reload. `POST /api/scan/checkpoint` returns a per-engine breakdown; `POST /api/scan/clear_checkpoint` wipes all `checkpoint_*.json` files. `checkpoint.py` functions gained a `prefix` keyword (default `"m365"`); M365 call sites are unchanged.
---
### Extended document anonymisation (redaction beyond local DOCX/XLSX/CSV/TXT)
Currently the ✂ redact button only works for local files with extensions `.docx`, `.xlsx`, `.csv`, `.txt`. Several valuable cases are not yet covered:
**1. PDF redaction for local files** ✅ — `redact_pdf_secure` (PyMuPDF physical redaction) wired to `_REDACT_EXTS` and the ✂ button. Falls back to reportlab overlay if PyMuPDF is absent.
**2. OneDrive / SharePoint / Teams file redaction** ✅ — `put_drive_item_content()` added to `m365_connector.py`; `redact_item()` in `routes/export.py` extended with a cloud branch: download via Graph, redact to a local temp file, re-upload via PUT. Supports DOCX, XLSX, PDF. ✂ button shown on cloud cards with supported extensions.
**3. Google Drive file redaction** ✅ — `get_drive_file_mime`, `download_drive_file_by_id`, `update_drive_file` added to both `GoogleWorkspaceConnector` and `PersonalGoogleConnector`. `redact_item()` extended with a `gdrive` branch: check MIME type (rejects Google Docs/Sheets), download bytes, redact locally, upload back via `files().update()`. Requires `drive` scope (not `drive.readonly`) on the service-account delegation. ✂ button shown on Drive cards with DOCX/XLSX/PDF extension.
**4. SMB / SFTP file redaction** ✅ — `write_file(remote_path, content)` added to `SFTPScanner`; `write_smb_file(path, content, user, password, domain)` added to `file_scanner.py`. `redact_item()` extended with `sftp` and `smb` branches: download via native protocol, redact locally, write back. Source config matched from `_load_file_sources()`. SFTP requires the item to still be in `state.flagged_items` (in-session only). ✂ button shown on SMB/SFTP cards with DOCX/XLSX/CSV/TXT/PDF extension.
**5. Email body redaction (Exchange / Gmail)** — overwrite the message body via Graph `PATCH /messages/{id}` or Gmail API. High effort and high risk: HTML formatting must be preserved, inline images handled, and a mistake permanently corrupts the email. **Recommendation: skip** — deleting the email is a safer and simpler GDPR response for emails containing CPR numbers.
**Priority order:** PDF (1) first since it reuses existing code. Cloud files (24) on demand.
**Size:** Small (PDF) · Medium (cloud/SMB/SFTP) · **Priority:** Medium
---
### #32 — Windowed mode for Profiles, Sources, and Settings ✗ Won't do ### #32 — Windowed mode for Profiles, Sources, and Settings ✗ Won't do
The workflow is sequential (configure → scan → review), not parallel — there is no realistic scenario where a modal and the results grid need to be open simultaneously. The Sources panel is already visible in the sidebar. Option A (the least-work path) still loads the full 3800-line JS stack twice. Closed. The workflow is sequential (configure → scan → review), not parallel — there is no realistic scenario where a modal and the results grid need to be open simultaneously. The Sources panel is already visible in the sidebar. Option A (the least-work path) still loads the full 3800-line JS stack twice. Closed.

View File

@ -1 +1 @@
1.6.22 1.7.9

View File

@ -329,6 +329,43 @@ def _save_config(cfg: dict):
pass pass
# ── Claude NER config ─────────────────────────────────────────────────────────
def get_claude_config() -> dict:
cfg = _load_config()
return {
"enabled": bool(cfg.get("claude_ner", False)),
"api_key_set": bool(cfg.get("claude_api_key", "")),
}
def save_claude_config(enabled: bool, api_key: "str | None" = None) -> None:
cfg = _load_config()
cfg["claude_ner"] = bool(enabled)
if api_key is not None:
# Encrypt at rest with the machine-keyed Fernet (same as the SMTP
# password). Falls back to plaintext only if cryptography is missing.
cfg["claude_api_key"] = _encrypt_password(api_key) if api_key else ""
_save_config(cfg)
def get_claude_api_key() -> str:
"""Return the decrypted Claude API key (handles legacy plaintext)."""
return _decrypt_password(_load_config().get("claude_api_key", ""))
# ── Software update config ────────────────────────────────────────────────────
def get_update_config() -> dict:
return {"auto_update": bool(_load_config().get("auto_update", False))}
def save_update_config(auto_update: bool) -> None:
cfg = _load_config()
cfg["auto_update"] = bool(auto_update)
_save_config(cfg)
# ── Profile storage (15a) ───────────────────────────────────────────────────── # ── Profile storage (15a) ─────────────────────────────────────────────────────
_SETTINGS_PATH = _DATA_DIR / "settings.json" _SETTINGS_PATH = _DATA_DIR / "settings.json"
_SRC_TOGGLES_PATH = _DATA_DIR / "src_toggles.json" _SRC_TOGGLES_PATH = _DATA_DIR / "src_toggles.json"
@ -544,6 +581,8 @@ def _save_role_overrides(overrides: dict) -> None:
# ── File source settings (#8) ───────────────────────────────────────────────── # ── File source settings (#8) ─────────────────────────────────────────────────
_FILE_SOURCES_PATH = _DATA_DIR / "file_sources.json" _FILE_SOURCES_PATH = _DATA_DIR / "file_sources.json"
_SFTP_KEYS_DIR = _DATA_DIR / "sftp_keys"
_SFTP_KEYS_DIR.mkdir(exist_ok=True)
def _load_file_sources() -> list: def _load_file_sources() -> list:
@ -568,6 +607,32 @@ def _save_file_sources(sources: list) -> None:
except Exception as e: except Exception as e:
logger.error("[file_sources] write failed: %s", e) logger.error("[file_sources] write failed: %s", e)
def _resolve_sftp_credentials(source: dict) -> dict:
"""Return a copy of source with password/passphrase resolved from keychain.
Callers (run_file_scan, upload_key endpoint) should use this rather than
reading keychain credentials themselves, so the lookup logic stays in one place.
"""
try:
from sftp_connector import get_sftp_password
except ImportError:
return source
resolved = dict(source)
keychain_key = source.get("keychain_key") or None
host = source.get("sftp_host", "")
user = source.get("sftp_user", "")
if not resolved.get("sftp_password"):
resolved["sftp_password"] = get_sftp_password(host, user, keychain_key)
if not resolved.get("sftp_passphrase"):
# Passphrase stored under a distinct account name
passphrase_key = (keychain_key + ":passphrase") if keychain_key else None
resolved["sftp_passphrase"] = get_sftp_password(host, user, passphrase_key)
return resolved
# ── Viewer tokens ──────────────────────────────────────────────────────────── # ── Viewer tokens ────────────────────────────────────────────────────────────
# Read-only viewer tokens allow sharing scan results with a DPO or compliance # Read-only viewer tokens allow sharing scan results with a DPO or compliance
# officer without exposing scan controls or credentials. Each token is a # officer without exposing scan controls or credentials. Each token is a
@ -748,7 +813,7 @@ def clear_viewer_pin() -> None:
# ── SMTP password encryption ───────────────────────────────────────────────── # ── SMTP password encryption ─────────────────────────────────────────────────
# The SMTP password is encrypted at rest using Fernet symmetric encryption. # The SMTP password is encrypted at rest using Fernet symmetric encryption.
# The encryption key is derived from a stable machine-specific UUID stored in # The encryption key is derived from a stable machine-specific UUID stored in
# ~/.gdpr_scanner_machine_id. This key is only usable on the same machine — # ~/.gdprscanner/machine_id. This key is only usable on the same machine —
# the encrypted password cannot be decrypted if the config file is copied to # the encrypted password cannot be decrypted if the config file is copied to
# another host. # another host.
@ -813,6 +878,13 @@ def _load_smtp_config() -> dict:
cfg = json.loads(_SMTP_CONFIG_PATH.read_text(encoding="utf-8")) cfg = json.loads(_SMTP_CONFIG_PATH.read_text(encoding="utf-8"))
if cfg.get("password"): if cfg.get("password"):
cfg["password"] = _decrypt_password(cfg["password"]) cfg["password"] = _decrypt_password(cfg["password"])
# Normalise legacy key names written by an older settings-tab UI
# (`user`/`starttls`) to the canonical keys every reader expects
# (`username`/`use_tls`), so configs saved before the fix still work.
if "username" not in cfg and "user" in cfg:
cfg["username"] = cfg["user"]
if "use_tls" not in cfg and "starttls" in cfg:
cfg["use_tls"] = cfg["starttls"]
return cfg return cfg
except Exception: except Exception:
pass pass

View File

@ -15,7 +15,9 @@ logger = logging.getLogger(__name__)
_DATA_DIR = Path.home() / ".gdprscanner" _DATA_DIR = Path.home() / ".gdprscanner"
_DATA_DIR.mkdir(exist_ok=True) _DATA_DIR.mkdir(exist_ok=True)
_CHECKPOINT_PATH = _DATA_DIR / "checkpoint.json"
def _cp_path(prefix: str) -> Path:
return _DATA_DIR / f"checkpoint_{prefix}.json"
def _checkpoint_key(options: dict) -> str: def _checkpoint_key(options: dict) -> str:
"""Stable hash of the scan options — used to detect when a checkpoint """Stable hash of the scan options — used to detect when a checkpoint
@ -27,7 +29,7 @@ def _checkpoint_key(options: dict) -> str:
}, sort_keys=True) }, sort_keys=True)
return hashlib.sha256(sig.encode()).hexdigest()[:16] return hashlib.sha256(sig.encode()).hexdigest()[:16]
def _save_checkpoint(key: str, scanned_ids: set, flagged: list, meta: dict) -> None: def _save_checkpoint(key: str, scanned_ids: set, flagged: list, meta: dict, *, prefix: str = "m365") -> None:
"""Write checkpoint to disk. Called periodically during scanning.""" """Write checkpoint to disk. Called periodically during scanning."""
try: try:
payload = { payload = {
@ -36,28 +38,31 @@ def _save_checkpoint(key: str, scanned_ids: set, flagged: list, meta: dict) -> N
"flagged": flagged, "flagged": flagged,
"meta": {k: v for k, v in meta.items() if k != "options"}, "meta": {k: v for k, v in meta.items() if k != "options"},
} }
tmp = _CHECKPOINT_PATH.with_suffix(".tmp") path = _cp_path(prefix)
tmp = path.with_suffix(".tmp")
tmp.write_text(json.dumps(payload, ensure_ascii=False, default=str), encoding="utf-8") tmp.write_text(json.dumps(payload, ensure_ascii=False, default=str), encoding="utf-8")
tmp.replace(_CHECKPOINT_PATH) tmp.replace(path)
except Exception as e: except Exception as e:
logger.error("[checkpoint] save failed: %s", e) logger.error("[checkpoint] save failed: %s", e)
def _load_checkpoint(key: str) -> dict | None: def _load_checkpoint(key: str, *, prefix: str = "m365") -> dict | None:
"""Load checkpoint if it matches the current scan key. Returns None on mismatch or error.""" """Load checkpoint if it matches the current scan key. Returns None on mismatch or error."""
try: try:
if not _CHECKPOINT_PATH.exists(): path = _cp_path(prefix)
if not path.exists():
return None return None
payload = json.loads(_CHECKPOINT_PATH.read_text(encoding="utf-8")) payload = json.loads(path.read_text(encoding="utf-8"))
if payload.get("key") != key: if payload.get("key") != key:
return None return None
return payload return payload
except Exception: except Exception:
return None return None
def _clear_checkpoint() -> None: def _clear_checkpoint(*, prefix: str = "m365") -> None:
try: try:
if _CHECKPOINT_PATH.exists(): path = _cp_path(prefix)
_CHECKPOINT_PATH.unlink() if path.exists():
path.unlink()
except Exception: except Exception:
pass pass

View File

@ -22,6 +22,7 @@ from __future__ import annotations
import base64 import base64
import hashlib import hashlib
import io import io
import re
import tempfile import tempfile
import threading import threading
from pathlib import Path from pathlib import Path
@ -419,49 +420,6 @@ def _extract_audio_metadata(content: bytes, filename: str) -> dict:
return result return result
"""Detect faces in an image file using OpenCV Haar cascades.
Returns the number of faces detected, or 0 if cv2 is unavailable,
the file is not a supported image format, or decoding fails.
Face detection is intentionally strict (minNeighbors=8, min_size=80px) to
reduce false positives on background textures, labels, and artwork.
Haar cascades are tuned for compliance flagging, not exhaustive detection. (#9)
"""
if not SCANNER_OK:
return 0
try:
cv2_mod = getattr(ds, "_get_cv2", None)
if cv2_mod is None:
return 0
cv2, np = ds._get_cv2()
if cv2 is None or np is None:
return 0
except Exception:
return 0
try:
# Decode image bytes → cv2 BGR array
arr = np.frombuffer(content, dtype=np.uint8)
img = cv2.imdecode(arr, cv2.IMREAD_COLOR)
if img is None:
# imdecode failed (e.g. HEIC without codec) — try PIL fallback
if PIL_OK:
try:
from PIL import Image as _PILImg
import io as _io
pil_img = _PILImg.open(_io.BytesIO(content)).convert("RGB")
pil_arr = np.array(pil_img)
img = cv2.cvtColor(pil_arr, cv2.COLOR_RGB2BGR)
except Exception:
return 0
else:
return 0
faces = ds.detect_faces_cv2(img, min_size=80, neighbors=8)
return len(faces)
except Exception:
return 0
def _detect_photo_faces(content: bytes, filename: str) -> int: def _detect_photo_faces(content: bytes, filename: str) -> int:
"""Detect faces in an image file using OpenCV Haar cascades. """Detect faces in an image file using OpenCV Haar cascades.
@ -505,67 +463,151 @@ def _detect_photo_faces(content: bytes, filename: str) -> int:
return 0 return 0
def _scan_bytes(content: bytes, filename: str, poppler_path=None) -> dict: _EMAIL_RE = re.compile(
"""Scan raw bytes for CPRs. Returns scanner result dict.""" r'\b[a-zA-Z0-9][a-zA-Z0-9._%+\-]*@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}\b'
)
_PHONE_RE = re.compile(
r'(?:'
r'(?:\+45|0045)[\s\-]?[2-9]\d{3}[\s\-]?\d{4}' # +45/0045 DDDD DDDD
r'|(?:\+45|0045)[\s\-]?[2-9]\d(?:[\s\-]\d{2}){3}' # +45/0045 DD DD DD DD
r'|\b[2-9]\d{7}\b' # 8 consecutive digits
r'|\b[2-9]\d{3}[\s\-]\d{4}\b' # DDDD DDDD
r'|\b[2-9]\d(?:[\s\-]\d{2}){3}\b' # DD DD DD DD
r')'
)
def _extract_text_from_bytes(content: bytes, filename: str) -> str:
"""Extract plain text from file bytes for email/phone pattern matching.
Returns empty string for binary media files (photos, video, audio) and
on any parse error callers must never raise from this function.
"""
ext = Path(filename).suffix.lower()
try:
if ext in {".txt", ".csv", ".eml", ".msg"}:
return content.decode("utf-8", errors="replace")
if ext in {".docx", ".doc"}:
from docx import Document as _Doc
doc = _Doc(io.BytesIO(content))
parts = [p.text for p in doc.paragraphs]
for tbl in doc.tables:
for row in tbl.rows:
for cell in row.cells:
parts.append(cell.text)
return "\n".join(parts)
if ext in {".xlsx", ".xlsm"}:
import openpyxl as _xl
wb = _xl.load_workbook(io.BytesIO(content), read_only=True, data_only=True)
parts = [
str(cell.value)
for ws in wb.worksheets
for row in ws.iter_rows()
for cell in row
if cell.value is not None
]
wb.close()
return " ".join(parts)
if ext == ".pdf":
import pdfplumber as _pp
with _pp.open(io.BytesIO(content)) as pdf:
parts = [p.extract_text() or "" for p in pdf.pages]
return "\n".join(parts)
except Exception:
pass
if ext not in PHOTO_EXTS | VIDEO_EXTS | AUDIO_EXTS:
try:
return content.decode("utf-8", errors="replace")
except Exception:
pass
return ""
def _find_emails_phones(text: str) -> dict:
"""Extract unique email addresses and Danish phone numbers from text.
Returns {"emails": [{"formatted": str}, ...], "phones": [{"formatted": str}, ...]}.
Phones are normalised to digit-only strings (preserving a leading '+').
"""
if not text:
return {"emails": [], "phones": []}
emails = list(dict.fromkeys(m.group(0).lower() for m in _EMAIL_RE.finditer(text)))
phones = list(dict.fromkeys(
('+' + re.sub(r'[\s\-]', '', m.group(0)[1:]) if m.group(0).lstrip().startswith('+')
else re.sub(r'[\s\-]', '', m.group(0)))
for m in _PHONE_RE.finditer(text)
))
return {
"emails": [{"formatted": e} for e in emails],
"phones": [{"formatted": p} for p in phones],
}
def _scan_bytes(content: bytes, filename: str, poppler_path=None, lang: str = "dan+eng") -> dict:
"""Scan raw bytes for CPRs, emails, and phone numbers. Returns result dict."""
if not SCANNER_OK: if not SCANNER_OK:
return {"cprs": [], "dates": [], "error": "scanner not available"} return {"cprs": [], "dates": [], "emails": [], "phones": [], "error": "scanner not available"}
ext = Path(filename).suffix.lower() ext = Path(filename).suffix.lower()
with tempfile.NamedTemporaryFile(suffix=ext, delete=False) as tmp: with tempfile.NamedTemporaryFile(suffix=ext, delete=False) as tmp:
tmp.write(content) tmp.write(content)
tmp_path = Path(tmp.name) tmp_path = Path(tmp.name)
result: dict = {"cprs": [], "dates": []}
try: try:
if ext == ".pdf": if ext == ".pdf":
# Check if the PDF has a text layer before running full scan_pdf. # Check if the PDF has a text layer before running full scan_pdf.
# Image-only PDFs (scanned documents) have no text and would trigger # Image-only PDFs (scanned documents) have no text and would trigger
# Tesseract OCR subprocesses that hang indefinitely on some files. # Tesseract OCR subprocesses that hang indefinitely on some files.
try: try:
import pdfplumber as _pp, io as _io import pdfplumber as _pp
with _pp.open(_io.BytesIO(content)) as _pdf: with _pp.open(io.BytesIO(content)) as _pdf:
has_text = any(ds.is_text_page(p) for p in _pdf.pages) has_text = any(ds.is_text_page(p) for p in _pdf.pages)
if not has_text: if not has_text:
return {"cprs": [], "dates": []} # image-only PDF — no CPRs possible return {"cprs": [], "dates": [], "emails": [], "phones": []}
except Exception: except Exception:
pass # if pdfplumber fails, fall through to full scan_pdf pass # if pdfplumber fails, fall through to full scan_pdf
return ds.scan_pdf(tmp_path, poppler_path=poppler_path) result = ds.scan_pdf(tmp_path, poppler_path=poppler_path, lang=lang)
elif ext in {".docx", ".doc"}: elif ext in {".docx", ".doc"}:
return ds.scan_docx(tmp_path) result = ds.scan_docx(tmp_path)
elif ext in {".xlsx", ".xlsm"}: elif ext in {".xlsx", ".xlsm"}:
return ds.scan_xlsx(tmp_path) result = ds.scan_xlsx(tmp_path)
elif ext == ".csv": elif ext == ".csv":
return ds.scan_csv(tmp_path) result = ds.scan_csv(tmp_path)
elif ext == ".txt": elif ext == ".txt":
text = content.decode("utf-8", errors="replace") text = content.decode("utf-8", errors="replace")
cprs, dates = ds.extract_matches(text, 1, "text") cprs, dates = ds.extract_matches(text, 1, "text")
return {"cprs": cprs, "dates": dates} result = {"cprs": cprs, "dates": dates}
elif ext in {".jpg", ".jpeg", ".png", ".bmp", ".tiff", ".tif", ".webp"}: elif ext in {".jpg", ".jpeg", ".png", ".bmp", ".tiff", ".tif", ".webp"}:
return ds.scan_image(tmp_path) result = ds.scan_image(tmp_path, lang=lang)
else: else:
# Try plain text
try: try:
text = content.decode("utf-8", errors="replace") text = content.decode("utf-8", errors="replace")
cprs, dates = ds.extract_matches(text, 1, "text") cprs, dates = ds.extract_matches(text, 1, "text")
return {"cprs": cprs, "dates": dates} result = {"cprs": cprs, "dates": dates}
except Exception: except Exception:
return {"cprs": [], "dates": []} pass
except Exception as e: except Exception as e:
return {"cprs": [], "dates": [], "error": str(e)} result = {"cprs": [], "dates": [], "error": str(e)}
finally: finally:
try: try:
tmp_path.unlink() tmp_path.unlink()
except Exception: except Exception:
pass pass
ep = _find_emails_phones(_extract_text_from_bytes(content, filename))
result["emails"] = ep["emails"]
result["phones"] = ep["phones"]
return result
def _worker_scan_pdf(pdf_path_str: str, result_q) -> None: def _worker_scan_pdf(pdf_path_str: str, result_q, lang: str = "dan+eng") -> None:
"""Worker executed in a spawned subprocess — must be a module-level function.""" """Worker executed in a spawned subprocess — must be a module-level function."""
try: try:
import document_scanner as _ds import document_scanner as _ds
from pathlib import Path as _Path from pathlib import Path as _Path
result_q.put(_ds.scan_pdf(_Path(pdf_path_str))) result_q.put(_ds.scan_pdf(_Path(pdf_path_str), lang=lang))
except Exception as e: except Exception as e:
result_q.put({"cprs": [], "dates": [], "error": str(e)}) result_q.put({"cprs": [], "dates": [], "error": str(e)})
def _scan_bytes_timeout(content: bytes, filename: str, timeout: int = 60) -> dict: def _scan_bytes_timeout(content: bytes, filename: str, timeout: int = 60, lang: str = "dan+eng") -> dict:
"""Like _scan_bytes but runs PDF scanning in a spawned subprocess with a hard timeout. """Like _scan_bytes but runs PDF scanning in a spawned subprocess with a hard timeout.
For non-PDF files delegates straight to _scan_bytes. For PDFs it writes the For non-PDF files delegates straight to _scan_bytes. For PDFs it writes the
@ -575,7 +617,7 @@ def _scan_bytes_timeout(content: bytes, filename: str, timeout: int = 60) -> dic
""" """
ext = Path(filename).suffix.lower() ext = Path(filename).suffix.lower()
if ext != ".pdf": if ext != ".pdf":
return _scan_bytes(content, filename) return _scan_bytes(content, filename, lang=lang)
import multiprocessing import multiprocessing
ctx = multiprocessing.get_context("spawn") ctx = multiprocessing.get_context("spawn")
@ -588,7 +630,7 @@ def _scan_bytes_timeout(content: bytes, filename: str, timeout: int = 60) -> dic
try: try:
with _pdf_subprocess_sem: with _pdf_subprocess_sem:
q = ctx.Queue() q = ctx.Queue()
p = ctx.Process(target=_worker_scan_pdf, args=(tmp_path_str, q)) p = ctx.Process(target=_worker_scan_pdf, args=(tmp_path_str, q, lang))
p.start() p.start()
p.join(timeout) p.join(timeout)
if p.is_alive(): if p.is_alive():
@ -607,19 +649,22 @@ def _scan_bytes_timeout(content: bytes, filename: str, timeout: int = 60) -> dic
def _scan_text_direct(text: str) -> dict: def _scan_text_direct(text: str) -> dict:
"""Scan a plain text string for CPRs using extract_matches. """Scan a plain text string for CPRs, emails, and phone numbers.
Uses ds.extract_matches() directly rather than ds.scan_text() because Uses ds.extract_matches() directly rather than ds.scan_text() because
scan_text() calls extract_cpr_and_dates() which is not defined in scan_text() calls extract_cpr_and_dates() which is not defined in
document_scanner.py (pre-existing bug). document_scanner.py (pre-existing bug).
""" """
if not SCANNER_OK or not text: if not text:
return {"cprs": [], "dates": []} return {"cprs": [], "dates": [], "emails": [], "phones": []}
ep = _find_emails_phones(text)
if not SCANNER_OK:
return {"cprs": [], "dates": [], **ep}
try: try:
cprs, dates = ds.extract_matches(text, 1, "text") cprs, dates = ds.extract_matches(text, 1, "text")
return {"cprs": cprs, "dates": dates} return {"cprs": cprs, "dates": dates, **ep}
except Exception: except Exception:
return {"cprs": [], "dates": []} return {"cprs": [], "dates": [], **ep}
def _html_esc(s: str) -> str: def _html_esc(s: str) -> str:
"""HTML-escape a string for safe inline embedding.""" """HTML-escape a string for safe inline embedding."""
@ -661,6 +706,11 @@ def _placeholder_svg(ext: str, name: str) -> str:
} }
bg, label = colors.get(ext, ("#9CA3AF", ext.upper().lstrip("."))) bg, label = colors.get(ext, ("#9CA3AF", ext.upper().lstrip(".")))
short = name[:22] + "" if len(name) > 22 else name short = name[:22] + "" if len(name) > 22 else name
# Escape label/name before embedding — served as image/svg+xml, so an
# unescaped value (from the ?name= query param via /api/thumb) would be a
# reflected-XSS vector when the URL is opened directly.
label = _html_esc(label)
short = _html_esc(short)
svg = f"""<svg xmlns="http://www.w3.org/2000/svg" width="280" height="360"> svg = f"""<svg xmlns="http://www.w3.org/2000/svg" width="280" height="360">
<rect width="280" height="360" fill="{bg}"/> <rect width="280" height="360" fill="{bg}"/>
<rect x="20" y="20" width="240" height="280" rx="8" fill="rgba(255,255,255,0.12)"/> <rect x="20" y="20" width="240" height="280" rx="8" fill="rgba(255,255,255,0.12)"/>

View File

@ -1,6 +1,6 @@
# GDPR Scanner — Brugermanual # GDPR Scanner — Brugermanual
Version 1.6.20 Version 1.7.9
--- ---
@ -33,7 +33,7 @@ Når der er fundet elementer, kan du gennemgå dem, beslutte hvad der skal ske m
**Hvad scanneren gennemgår:** **Hvad scanneren gennemgår:**
- Microsoft 365: Exchange e-mail, OneDrive, SharePoint, Teams - Microsoft 365: Exchange e-mail, OneDrive, SharePoint, Teams
- Google Workspace: Gmail, Google Drev - Google Workspace: Gmail, Google Drev
- Lokale og netværksbaserede filmapper (herunder SMB/NAS-drev) - Lokale og netværksbaserede filmapper (herunder SMB/NAS-drev og SFTP-servere)
**Hvad den finder:** **Hvad den finder:**
- CPR-numre - CPR-numre
@ -50,16 +50,16 @@ Når der er fundet elementer, kan du gennemgå dem, beslutte hvad der skal ske m
Når du åbner scanneren, er skærmen inddelt i tre områder: Når du åbner scanneren, er skærmen inddelt i tre områder:
``` ```
┌─────────────────┬──────────────────────────────────────────┐ ┌───────────────────────────────────────────────────────────────┐
│ │ Topbjælke: Scan-knap, profiler, handlinger │ │ │ Topbjælke: Scan-knap, profiler, handlinger │
│ Venstre panel ├──────────────────────────────────────────┤ │ Venstre panel ──────────────────────────────────────────────┤
│ │ │ │ │ │
│ - Kilder │ Resultater / scanningsforløb │ │ - Kilder │ Resultater / scanningsforløb │
│ - Indstillinger │ │ │ - Indstillinger │ │
│ - Konti │ │ │ - Konti │ │
│ - Statistik ├──────────────────────────────────────────┤ │ - Statistik ──────────────────────────────────────────────┤
│ │ Aktivitetslog │ │ │ Aktivitetslog │
└─────────────────┴──────────────────────────────────────────┘ └───────────────────────────────────────────────────────────────┘
``` ```
**Venstre panel** — vælg hvad der skal scannes og hvordan. **Venstre panel** — vælg hvad der skal scannes og hvordan.
@ -104,17 +104,33 @@ Fanen Google Workspace lader dig forbinde en Google Workspace-konto (tidligere G
| Gmail | Alle e-mails i den enkelte brugers indbakke og labels | | Gmail | Alle e-mails i den enkelte brugers indbakke og labels |
| Google Drev | Alle filer ejet af eller delt med den enkelte bruger | | Google Drev | Alle filer ejet af eller delt med den enkelte bruger |
### 3.3 Lokale og netværksbaserede filer ### 3.3 Lokale, netværksbaserede og SFTP-filkilder
Fanen **Filkilder** viser de lokale mapper og netværksdrev, du har konfigureret. Fanen **Filkilder** viser de lokale mapper, netværksdrev og SFTP-servere, du har konfigureret.
**Sådan tilføjer du en ny filkilde:** **Sådan tilføjer du en ny filkilde:**
1. Indtast en **Betegnelse** — et navn du kan genkende (f.eks. "Skolens Fællesmappe"). 1. Indtast en **Betegnelse** — et navn du kan genkende (f.eks. "Skolens Fællesmappe").
2. Indtast **Stien**: 2. Vælg **kildetype** med pillerne øverst i formularen:
- Lokal mappe: `~/Dokumenter` eller `/Volumes/Drev`
- Netværksdrev: `//nas-server/delt` eller `\\server\delt` **Lokal**
3. Hvis det er et netværksdrev, udfyldes felterne **SMB-vært**, **Brugernavn** og **Adgangskode** automatisk. Adgangskoden gemmes sikkert i systemets nøglering. - Indtast **Stien** til mappen: `~/Dokumenter` eller `/Volumes/Drev`.
4. Klik på **Tilføj**. - Klik på **Tilføj**.
**Netværk (SMB)**
- Indtast **Stien** i UNC-format: `//nas-server/delt` eller `\\server\delt`.
- Udfyld **SMB-vært**, **Brugernavn** og **Adgangskode**. Adgangskoden gemmes sikkert i systemets nøglering.
- Klik på **Tilføj**.
**SFTP**
- Indtast **Vært** (værtsnavn eller IP-adresse på SSH/SFTP-serveren).
- Indtast **Port** (standard 22).
- Indtast **Brugernavn**.
- Indtast **Fjernsti**, der skal scannes (f.eks. `/home/delt` eller `/`).
- Vælg **Godkendelsestype**:
- **Adgangskode** — indtast adgangskoden. Den gemmes sikkert i systemets nøglering.
- **Privat nøgle** — klik på **Upload nøglefil** og vælg din SSH-privatnøgle (OpenSSH- eller PEM-format). Hvis nøglen er beskyttet med en adgangssætning, skal du indtaste den. Nøglefilen gemmes i scannerens datamappe med `600`-rettigheder.
- Klik på **Tilføj**.
Du kan tilføje så mange filkilder, du har brug for. De vil fremgå som valgbare kilder i venstre panel, når du er klar til at scanne. Du kan tilføje så mange filkilder, du har brug for. De vil fremgå som valgbare kilder i venstre panel, når du er klar til at scanne.
@ -154,6 +170,10 @@ Scan kun elementer ændret efter en bestemt dato. Hurtige forudindstillinger —
**Maks. e-mails pr. bruger** — stop efter at have scannet dette antal e-mails per person (standard 2.000). Øg det, hvis du har brug for fuld dækning. **Maks. e-mails pr. bruger** — stop efter at have scannet dette antal e-mails per person (standard 2.000). Øg det, hvis du har brug for fuld dækning.
**Kun CPR-tilstand** — når aktiveret, flagges kun elementer, der indeholder mindst ét kvalificerende CPR-nummer. Elementer, hvis eneste fund er e-mailadresser, telefonnumre, ansigter eller GPS/EXIF-metadata, springes over. Nyttigt, når du ønsker en fokuseret rapport udelukkende om CPR-eksponering.
**OCR-sprog** — vælg den sprogpakke, Tesseract bruger, når der læses tekst fra scannede PDF-filer og billeder. Standard er `Dansk + Engelsk`, som dækker langt de fleste dokumenter. Skift til en anden forudindstilling, hvis dine dokumenter overvejende er på et andet sprog.
### 4.4 Start scanningen ### 4.4 Start scanningen
Klik på den blå **Scan**-knap i topbjælken. Klik på den blå **Scan**-knap i topbjælken.
@ -180,6 +200,8 @@ Klik på **▶ Genoptag** for at fortsætte fra det sted, scanningen slap. Klik
## 5. Forstå resultaterne ## 5. Forstå resultaterne
Når du åbner appen, viser gitteret **alle åbne fund** — alle markerede elementer, der stadig kræver handling (dvs. uden disposition), på tværs af alle dine scanninger og ikke kun den seneste. Efterhånden som du mærker elementer (behold, anonymisér, slet, falsk positiv …), forsvinder de fra denne visning, så det, der står tilbage, er dit udestående arbejde. Hvert element vises én gang med sin nyeste tilstand. Vil du i stedet se en enkelt tidligere scanning, så brug sessionsvælgeren (se *Gennemse tidligere scanningssessioner* nedenfor).
Hvert fundet element vises som et kort. Her er forklaringen på mærker og labels: Hvert fundet element vises som et kort. Her er forklaringen på mærker og labels:
### Kildemærker ### Kildemærker
@ -192,7 +214,8 @@ Hvert fundet element vises som et kort. Her er forklaringen på mærker og label
| Teams | Fundet i en Teams-kanal | | Teams | Fundet i en Teams-kanal |
| Gmail | Fundet i en Gmail-postkasse | | Gmail | Fundet i en Gmail-postkasse |
| Google Drev | Fundet i Google Drev | | Google Drev | Fundet i Google Drev |
| Lokal / Netværk | Fundet på et filshare | | Lokal / Netværk | Fundet på et lokalt eller SMB-filshare |
| 🔒 SFTP | Fundet på en SFTP-server |
### Risikoniveau ### Risikoniveau
@ -235,7 +258,7 @@ Når en scanning er afsluttet, kan du gennemse resultaterne fra en tidligere sca
- Klik på **Sessioner**-knappen i historikbanneret (der vises over resultatgitteret, når en scanning er afsluttet) for at åbne sessionsvælgeren. - Klik på **Sessioner**-knappen i historikbanneret (der vises over resultatgitteret, når en scanning er afsluttet) for at åbne sessionsvælgeren.
- Hver række viser dato og tidspunkt, hvilke kilder der blev scannet, og hvor mange elementer der blev fundet. Et **Δ**-mærkat angiver delta-scanninger; **Seneste** markerer den nyeste session. - Hver række viser dato og tidspunkt, hvilke kilder der blev scannet, og hvor mange elementer der blev fundet. Et **Δ**-mærkat angiver delta-scanninger; **Seneste** markerer den nyeste session.
- Klik på en række for at indlæse den pågældende sessions resultater i gitteret. Et historikbanner erstatter statuslinjen med sessionens oplysninger. - Klik på en række for at indlæse den pågældende sessions resultater i gitteret. Et historikbanner erstatter statuslinjen med sessionens oplysninger.
- Klik på **Seneste scanning** i banneret for at vende tilbage til den nyeste session. - Klik på **Åbne fund** i banneret for at forlade den tidligere session og vende tilbage til standardvisningen med alle elementer, der stadig kræver handling.
- Start af en ny scanning afslutter automatisk historiktilstanden og skifter til live-resultater. - Start af en ny scanning afslutter automatisk historiktilstanden og skifter til live-resultater.
Alle filtre, eksporter og dispositionsmærkning fungerer normalt, mens du gennemser tidligere sessioner. Alle filtre, eksporter og dispositionsmærkning fungerer normalt, mens du gennemser tidligere sessioner.
@ -253,6 +276,7 @@ Forhåndsvisningen viser:
- Alle fundne CPR-numre og deres kontekst - Alle fundne CPR-numre og deres kontekst
- Øvrige personoplysninger registreret (telefon, e-mailadresse, IBAN mv.) - Øvrige personoplysninger registreret (telefon, e-mailadresse, IBAN mv.)
- Deling og ekstern adgangsinformation - Deling og ekstern adgangsinformation
- **Relaterede dokumenter** — hvis andre elementer i samme scanningssession indeholder ét eller flere af de samme CPR-numre, vises de i et "Relaterede dokumenter"-afsnit. Klik på et element for at åbne dets forhåndsvisning. Det gør det nemmere at spore en persons data på tværs af flere filer eller e-mails.
### Angiv en disposition ### Angiv en disposition
@ -270,6 +294,30 @@ Hvert element har en **Disposition**-rullemenu i forhåndsvisningspanelet. Vælg
Klik på **Gem** efter valget. En lille **✓ Gemt**-bekræftelse vises. Klik på **Gem** efter valget. En lille **✓ Gemt**-bekræftelse vises.
### Redigér en fil på stedet
En **✂**-knap vises på resultatkort, hvor scanneren kan overskrive filen direkte. Klikker du på den, erstattes alle CPR-numre med `██████-████`-blokke, og handlingen registreres som en `"redacted"`-disposition. Kortet **bevares i gitteret indtil din næste scanning** — det vises nedtonet med et grønt **✏ Redigeret**-mærke, og dets handlingsknapper skjules, så det ikke kan behandles igen. På den måde kan du let se, hvad du har håndteret i sessionen; gitteret genopbygges, næste gang du scanner. Brug denne mulighed, når du ønsker at anonymisere en fil frem for at slette den helt.
Knappen er tilgængelig for følgende kildetyper og formater:
| Kilde | Understøttede formater |
|---|---|
| Lokale filer | DOCX, XLSX, CSV, TXT, PDF |
| Netværksdrev (SMB) | DOCX, XLSX, CSV, TXT, PDF |
| SFTP | DOCX, XLSX, CSV, TXT, PDF |
| OneDrive / SharePoint / Teams | DOCX, XLSX, PDF |
| Google Drev | DOCX, XLSX, PDF |
Knappen er **ikke** tilgængelig for e-mail-elementer (Exchange/Gmail) eller i visningsmode. Google Docs og Sheets, der er eksporteret som DOCX/XLSX under scanning, kan ikke redigeres på stedet — eksportér filen manuelt fra Google først og redigér derefter den hentede kopi.
> **PDF-sikkerhedsnote:** PDF-redigering sker fysisk — CPR-nummerteksten slettes fra PDF-datastrømmen og er ikke blot dækket over med en sort boks. En læser kan ikke gendanne den oprindelige tekst ved at markere under redigeringen eller ved programmatisk inspektion af filen. Billedbaserede (scannede) PDF-filer understøttes også: scanneren lokaliserer CPR-nummeret på sidebilledet via OCR og overskriver det pågældende område fysisk.
> **OneDrive / SharePoint / Teams-note:** Redigering skriver den ændrede fil tilbage via Microsoft Graph API og kræver tilladelsen `Files.ReadWrite.All`. Scanneren anmoder nu automatisk om denne tilladelse ved login. Hvis du har godkendt før denne opdatering, skal du logge ud og logge ind igen (Indstillinger → Microsoft 365 → Log ud), så scanneren henter et nyt token med skriveadgang. Ved app-only-opsætninger (serviceprincipal) skal en Global Administrator tildele applikationstilladelsen `Files.ReadWrite.All` i Azure → App-registreringer → API-tilladelser → Giv administratorsamtykke.
> **Google Drev-note:** Redigering i Google Drev kræver `drive`-scopet på servicekontoens domain-wide delegation (ikke blot `drive.readonly`). Hvis redigeringen fejler med en rettighedsfejl, bedes du kontakte din Google Workspace-administrator for at tilføje scopet `https://www.googleapis.com/auth/drive` til servicekontoens delegation i Admin Console.
> **SFTP-note:** SFTP-redigering er kun tilgængelig for elementer fundet i den aktuelle scansession. Gennemfør en ny scanning, hvis du gennemser historiske resultater.
### Massemarkering af flere elementer på én gang ### Massemarkering af flere elementer på én gang
Hvis du skal anvende den samme disposition på mange elementer, kan du bruge **Vælg-tilstand** i stedet for at åbne hvert kort enkeltvis. Hvis du skal anvende den samme disposition på mange elementer, kan du bruge **Vælg-tilstand** i stedet for at åbne hvert kort enkeltvis.
@ -316,6 +364,8 @@ Klik på **Slet**-knappen i filterbjælken for at åbne massesletningsvinduet.
4. En statuslinje viser sletningerne i realtid. E-mails flyttes til **Slettet post**; filer flyttes til **papirkurven**. 4. En statuslinje viser sletningerne i realtid. E-mails flyttes til **Slettet post**; filer flyttes til **papirkurven**.
Slettede elementer (uanset om det er en enkelt sletning, en massesletning eller en sletning efter anmodning fra en registreret) **bevares i gitteret indtil din næste scanning** — nedtonet med et rødt **🗑 Slettet**-mærke og med skjulte handlingsknapper — så du kan se, hvad der blev fjernet i sessionen. Hvis en massesletning delvist mislykkes, markeres kun de elementer, serveren faktisk slettede; de, der fejlede, forbliver aktive, så du kan forsøge igen. Gitteret genopbygges, næste gang du scanner.
En fuldstændig revisionslog over alle sletninger (hvad der er slettet, hvornår og hvorfor) medtages i artikel 30-rapporten. En fuldstændig revisionslog over alle sletninger (hvad der er slettet, hvornår og hvorfor) medtages i artikel 30-rapporten.
--- ---
@ -352,7 +402,7 @@ Klik på **Profiler** for at åbne profil­administrations­panelet. Her kan du:
Klik på **Excel** i filterbjælken for at downloade de aktuelle resultater som en Excel-projektmappe. Projektmappen indeholder: Klik på **Excel** i filterbjælken for at downloade de aktuelle resultater som en Excel-projektmappe. Projektmappen indeholder:
- Et oversigtsfaneblad med scanningsdato, antal elementer og kildefordeling. - Et oversigtsfaneblad med scanningsdato, antal elementer og kildefordeling.
- Et separat faneblad for hver kildetype (Outlook, OneDrive, SharePoint, Teams, Gmail, Google Drive, Lokal, Netværk). - Et separat faneblad for hver kildetype (Outlook, OneDrive, SharePoint, Teams, Gmail, Google Drive, Lokal, Netværk, SFTP).
- Alle fundne elementer, herunder kilde, konto, CPR-antal, risikoniveau, delingsstatus og disposition. - Alle fundne elementer, herunder kilde, konto, CPR-antal, risikoniveau, delingsstatus og disposition.
Knapperne **Excel** og **Art.30** er altid tilgængelige — også efter genstart af programmet — og eksporterer resultaterne fra den seneste afsluttede scanningssession uden at kræve en ny scanning. Knapperne **Excel** og **Art.30** er altid tilgængelige — også efter genstart af programmet — og eksporterer resultaterne fra den seneste afsluttede scanningssession uden at kræve en ny scanning.
@ -391,9 +441,10 @@ Klik på **🔗**-knappen øverst til højre i topbjælken for at åbne delingsp
- **Alle roller** — modtageren ser alle fundne elementer. - **Alle roller** — modtageren ser alle fundne elementer.
- **Ansatte** / **Elever** — modtageren ser kun elementer tilhørende den valgte rollegruppe. Rollefilteret er låst i deres visning. - **Ansatte** / **Elever** — modtageren ser kun elementer tilhørende den valgte rollegruppe. Rollefilteret er låst i deres visning.
- **Bruger** — modtageren ser kun elementer tilhørende en bestemt medarbejder. Vælg personen fra søgefeltet; scanneren matcher automatisk både deres M365- og Google Workspace-e-mailadresser. Brug denne mulighed, når du vil give en enkelt medarbejder adgang til sine egne scanningsresultater. - **Bruger** — modtageren ser kun elementer tilhørende en bestemt medarbejder. Vælg personen fra søgefeltet; scanneren matcher automatisk både deres M365- og Google Workspace-e-mailadresser. Brug denne mulighed, når du vil give en enkelt medarbejder adgang til sine egne scanningsresultater.
3. Vælg en **Udløbsdato** — 7 dage, 30 dage, 90 dage, 1 år eller Aldrig. 3. Angiv eventuelt et **Datointerval** — brug felterne "Elementer fra" og "Elementer til" for at begrænse modtagerens visning til elementer ændret inden for en bestemt periode. Lad begge felter stå tomme for ingen datobegrænsning.
4. Klik på **Opret**. Der genereres et unikt link: `http://host:5100/view?token=…` 4. Vælg en **Udløbsdato** — 7 dage, 30 dage, 90 dage, 1 år eller Aldrig.
5. Klik på **Kopiér** for at kopiere linket til udklipsholderen, og send det til gennemgangeren. 5. Klik på **Opret**. Formularen ryddes, og det nye link vises øverst i listen **Aktive links** nedenfor, kortvarigt fremhævet.
6. Klik på **Kopiér** i linkets række for at kopiere det til udklipsholderen, og send det til gennemgangeren.
Gennemgangeren åbner linket i en browser. De kan se resultatgitteret (afgrænset til det tilladte rolleomfang) og mærke dispositioner, men kan ikke starte scanninger, ændre indstillinger, se loginoplysninger eller slette elementer. Gennemgangeren åbner linket i en browser. De kan se resultatgitteret (afgrænset til det tilladte rolleomfang) og mærke dispositioner, men kan ikke starte scanninger, ændre indstillinger, se loginoplysninger eller slette elementer.
@ -445,6 +496,7 @@ Gå til **Indstillinger → Planlægger** for at konfigurere automatiske scannin
7. Aktiver eventuelt: 7. Aktiver eventuelt:
- **Send rapport automatisk** — send Excel-rapporten pr. e-mail til dine konfigurerede modtagere efter hver scanning. - **Send rapport automatisk** — send Excel-rapporten pr. e-mail til dine konfigurerede modtagere efter hver scanning.
- **Håndhæv opbevaringspolitik** — slet automatisk elementer ældre end din opbevaringspolitik efter hver scanning. - **Håndhæv opbevaringspolitik** — slet automatisk elementer ældre end din opbevaringspolitik efter hver scanning.
- **Kun rapport** — spring scanningen over og send blot de seneste resultater fra databasen som e-mail. Nyttigt til regelmæssige opsummerings-e-mails uden at køre en ny scanning. Når aktiveret, kræves ingen profil, og M365-godkendelse er ikke nødvendig.
8. Klik på **Gem**. 8. Klik på **Gem**.
Planlæggerikatoren i topbjælken viser dato og tidspunkt for den næste planlagte scanning ("Næste: …"). Planlæggerikatoren i topbjælken viser dato og tidspunkt for den næste planlagte scanning ("Næste: …").
@ -476,7 +528,17 @@ Klik på **Gem** for at gemme, og klik derefter på **Test** for at sende en tes
> Hvis din konto har MFA (to-faktor-godkendelse) aktiveret, kan du ikke bruge din almindelige adgangskode. Du skal oprette en **app-adgangskode** i din kontos sikkerhedsindstillinger: > Hvis din konto har MFA (to-faktor-godkendelse) aktiveret, kan du ikke bruge din almindelige adgangskode. Du skal oprette en **app-adgangskode** i din kontos sikkerhedsindstillinger:
> - **Personlig Microsoft-konto**: account.microsoft.com/security → App-adgangskoder > - **Personlig Microsoft-konto**: account.microsoft.com/security → App-adgangskoder
> - **Gmail**: myaccount.google.com → Sikkerhed → 2-trinsbekræftelse → App-adgangskoder > - **Gmail / Google Workspace**: myaccount.google.com → Sikkerhed → 2-trinsbekræftelse → App-adgangskoder (for Google Workspace-konti skal din administrator først tillade app-adgangskoder eller opsætte et SMTP-relay)
### Send altid via SMTP (spring Microsoft Graph over)
Når scanneren er logget på Microsoft 365, sender den normalt e-mail gennem Microsoft 365 direkte, uden at bruge SMTP-indstillingerne ovenfor. Det er praktisk, men det kan ikke levere til visse adresser — især en adresse på et Google-hostet underdomæne af dit Microsoft 365-domæne, som Microsoft 365 opfatter som intern og kasserer i stilhed (ingen levering, ingen fejl).
Slå **Send altid via SMTP (spring Microsoft Graph over)** til for at tvinge al e-mail — test-e-mails, manuelle rapporter og automatisk e-mail efter scanning — gennem den SMTP-server, du har konfigureret ovenfor. Brug dette, når dine rapporter sendes til en postkasse, som Microsoft 365 ikke kan levere til (f.eks. en Google Workspace-adresse), med `smtp.gmail.com` / `smtp-relay.gmail.com` som SMTP-vært.
### Send rapport efter manuel scanning
Slå **Send rapport efter manuel scanning** til for automatisk at sende rapporten pr. e-mail til dine konfigurerede modtagere, hver gang en manuel scanning er færdig.
### Send en rapport manuelt ### Send en rapport manuelt
@ -516,6 +578,7 @@ Klik på **Nulstil database** for at slette alle scanningsdata, dispositioner og
| Indstilling | Beskrivelse | | Indstilling | Beskrivelse |
|-------------|-------------| |-------------|-------------|
| Tema | Mørkt eller lyst | | Tema | Mørkt eller lyst |
| Softwareopdatering | Søg efter og installér nye versioner af scanneren direkte fra browseren, eller slå automatisk daglig opdatering til. Vises kun på serverinstallationer, der kører fra et git-checkout (ikke i skrivebordsappen). Programmet genstarter selv efter installation; opdatering afvises, mens en scanning kører, og næste scanning efter en opdatering fortsætter normalt. |
### Fanen Sikkerhed ### Fanen Sikkerhed
@ -537,6 +600,27 @@ Disse indstillinger findes i venstre panel under **Indstillinger**:
**Min. CPR-antal pr. fil** — en fil flagges kun, hvis den indeholder mindst dette antal *distinkte* CPR-numre. Standardværdien er 1 (nuværende adfærd). Sæt til 2 for at undgå falske positive ved elevscanninger: en elevs samtykkeerklæring eller indmeldelsesformular indeholder typisk kun elevens eget CPR-nummer, mens en klasselist eller karakteroversigt med flere elevers CPR-numre stadig vil blive rapporteret. **Min. CPR-antal pr. fil** — en fil flagges kun, hvis den indeholder mindst dette antal *distinkte* CPR-numre. Standardværdien er 1 (nuværende adfærd). Sæt til 2 for at undgå falske positive ved elevscanninger: en elevs samtykkeerklæring eller indmeldelsesformular indeholder typisk kun elevens eget CPR-nummer, mens en klasselist eller karakteroversigt med flere elevers CPR-numre stadig vil blive rapporteret.
**Kun CPR-tilstand** — når aktiveret, springes elementer uden CPR-numre over (kun e-mailadresser, telefonnumre, ansigter eller GPS/EXIF-data). Brug dette, når du ønsker en rapport, der udelukkende fokuserer på CPR-eksponering.
**OCR-sprog** — vælger den sprogpakke, Tesseract bruger, når der læses tekst fra scannede PDF-filer og billeder. Standard: `Dansk + Engelsk`. Skift til en anden forudindstilling for dokumenter på tysk, svensk eller fransk.
### Fanen AI / NER
Gå til **Indstillinger → AI / NER** for at konfigurere Claude AI-drevet navnegenkendelse.
Som standard bruger scanneren spaCy (en lokal maskinlæringsmodel) til at genkende personnavne, adresser og organisationsnavne i dokumenttekst. Aktivering af Claude NER erstatter dette med kald til Claude Haiku API, som er betydeligt mere nøjagtig — særligt for danske dobbeltefternavne (f.eks. "Hansen-Nielsen"), fremmedsprogede navne og navne uden omgivende kontekst (f.eks. isolerede celler i et regneark).
**Sådan aktiverer du:**
1. Opret en Anthropic API-nøgle på [console.anthropic.com](https://console.anthropic.com).
2. Indsæt nøglen i feltet **Anthropic API-nøgle** og klik på **Gem**.
3. Slå **Aktiver Claude NER**-kontakten til og klik på **Gem** igen.
4. Klik på **Test nøgle** for at bekræfte, at nøglen er gyldig og API'et er tilgængeligt.
**Pris:** Claude Haiku faktureres pr. token efter Anthropics offentliggjorte priser. Et typisk dokument koster en brøkdel af en øre. Scanningsresultater caches pr. dokument, så genskanning af den samme fil aldrig medfører en ny opkrævning.
**Fallback:** Hvis `anthropic`-pakken ikke er installeret, eller API-nøglen mangler, falder scanneren automatisk tilbage til spaCy uden fejl — kontakten har blot ingen effekt.
**Opbevaringspolitik** — når aktiveret, markeres elementer ældre end det angivne antal år som forældet. Regnskabsårets afslutning bestemmer, hvordan skæringsdatoen beregnes: **Opbevaringspolitik** — når aktiveret, markeres elementer ældre end det angivne antal år som forældet. Regnskabsårets afslutning bestemmer, hvordan skæringsdatoen beregnes:
| Indstilling | Beregning af skæringsdato | | Indstilling | Beregning af skæringsdato |
@ -545,6 +629,12 @@ Disse indstillinger findes i venstre panel under **Indstillinger**:
| 31 dec (Bogføringsloven) | Seneste 31. december minus N år | | 31 dec (Bogføringsloven) | Seneste 31. december minus N år |
| 30 jun / 31 mar | Seneste forekomst af den dato minus N år | | 30 jun / 31 mar | Seneste forekomst af den dato minus N år |
### Fanen Revisionslog
Gå til **Indstillinger → Revisionslog** for at se en uforanderlig log over alle væsentlige administrative handlinger i scanneren. Hver post viser tidspunkt, handlingstype, detaljer og klientens IP-adresse. Registrerede hændelser omfatter: gem/slet profil, opret/tilbagekald viewer-token, PIN-ændringer, tilføj/opdater/slet filkilde, gem/slet planlagt job, start/stop scanning, gem SMTP-konfiguration, dispositionsændringer, slet element og redigér element.
Loggen er skrivebeskyttet og gemmes i scannerdatabasen sammen med scanningsresultaterne. Den er inkluderet i databaseeksporter og kan hjælpe dig med at dokumentere ansvarlighed over for en tilsynsmyndighed.
--- ---
## 15. Ofte stillede spørgsmål ## 15. Ofte stillede spørgsmål
@ -556,7 +646,7 @@ Nej. CPR-numre fundet under en scanning gemmes kun som et antal (f.eks. "3 CPR-n
E-mails flyttes til brugerens **Slettet post**-mappe i Exchange — de slettes ikke permanent og kan gendannes af brugeren eller en administrator. Filer flyttes til **papirkurven** i den pågældende tjeneste (OneDrive, SharePoint, filsystem). Permanent sletning kræver en efterfølgende handling af brugeren eller administrator. E-mails flyttes til brugerens **Slettet post**-mappe i Exchange — de slettes ikke permanent og kan gendannes af brugeren eller en administrator. Filer flyttes til **papirkurven** i den pågældende tjeneste (OneDrive, SharePoint, filsystem). Permanent sletning kræver en efterfølgende handling af brugeren eller administrator.
**Kan jeg scanne uden at forbinde til Microsoft 365?** **Kan jeg scanne uden at forbinde til Microsoft 365?**
Ja. Du kan scanne lokale og SMB-filshares uden nogen M365- eller Google-forbindelse. Åbn **Kilder**, gå til fanen **Filkilder**, og tilføj dine filstier. Ja. Du kan scanne lokale mapper, SMB/NAS-drev og SFTP-servere uden nogen M365- eller Google-forbindelse. Åbn **Kilder**, gå til fanen **Filkilder**, og tilføj dine filstier eller SFTP-serveroplysninger.
**Hvad er delta-scanning, og hvornår skal jeg bruge det?** **Hvad er delta-scanning, og hvornår skal jeg bruge det?**
Delta-scanning bruger Microsoft Graphs ændringstokens (for M365) og Google Drive Changes API (for Google Workspace) til kun at hente elementer ændret siden den seneste scanning. Det er ideelt til regelmæssige (f.eks. ugentlige) compliance-tjek efter, at du har gennemført en fuld basisscan. Aktiver det i afsnittet Indstillinger i venstre panel. Delta-scanning bruger Microsoft Graphs ændringstokens (for M365) og Google Drive Changes API (for Google Workspace) til kun at hente elementer ændret siden den seneste scanning. Det er ideelt til regelmæssige (f.eks. ugentlige) compliance-tjek efter, at du har gennemført en fuld basisscan. Aktiver det i afsnittet Indstillinger i venstre panel.
@ -582,6 +672,15 @@ Ja. Gå til **Indstillinger → Sikkerhed → Interface-PIN** og angiv en 48-
**Kan en gennemganger mærke dispositioner uden adgang til scanningskontrollerne?** **Kan en gennemganger mærke dispositioner uden adgang til scanningskontrollerne?**
Ja. Brug **🔗 Del**-knappen til at oprette et skrivebeskyttet viewer-link eller angiv en Viewer-PIN under Indstillinger → Sikkerhed. Gennemgangeren åbner linket i sin browser og kan gennemse resultater og mærke dispositioner uden at se loginoplysninger, kilder eller scanningsknapper. Se afsnit 10 for detaljer. Ja. Brug **🔗 Del**-knappen til at oprette et skrivebeskyttet viewer-link eller angiv en Viewer-PIN under Indstillinger → Sikkerhed. Gennemgangeren åbner linket i sin browser og kan gennemse resultater og mærke dispositioner uden at se loginoplysninger, kilder eller scanningsknapper. Se afsnit 10 for detaljer.
**Kan jeg begrænse et delelink til en bestemt tidsperiode?**
Ja. Brug felterne "Elementer fra" og "Elementer til" i delingspanelet, når du opretter et token-link. Modtageren vil kun se elementer, hvis ændringsdate falder inden for det angivne interval.
**Hvor kan jeg se, hvem der har ændret hvad i scanneren?**
Gå til **Indstillinger → Revisionslog**. Alle væsentlige administrative handlinger logges med tidsstempel, handlingstype, detaljer og IP-adresse.
**Vil aktivering af Claude NER øge omkostningerne væsentligt?**
For en typisk skole- eller kommunescanning er omkostningen ubetydelig — Claude Haiku faktureres i brøkdele af en øre pr. dokument, og resultater caches, så det samme dokument aldrig faktureres to gange. En fuld scanning af 10.000 dokumenter koster typisk under 7 kr. Den største gevinst er i navnetætte dokumenter (klasselister, sagsmapper), hvor spaCy tidligere gik glip af mange navne.
--- ---
*GDPR Scanner v1.6.20 — teknisk opsætning og konfiguration: se README.md* *GDPR Scanner v1.7.9 — teknisk opsætning og konfiguration: se README.md*

View File

@ -1,6 +1,6 @@
# GDPR Scanner — User Manual # GDPR Scanner — User Manual
Version 1.6.20 Version 1.7.9
--- ---
@ -33,7 +33,7 @@ When items are found, you can review them, decide what to do with each one (keep
**What it scans:** **What it scans:**
- Microsoft 365: Exchange email, OneDrive, SharePoint, Teams - Microsoft 365: Exchange email, OneDrive, SharePoint, Teams
- Google Workspace: Gmail, Google Drive - Google Workspace: Gmail, Google Drive
- Local and network file shares (including SMB/NAS drives) - Local and network file shares (including SMB/NAS drives and SFTP servers)
**What it finds:** **What it finds:**
- CPR numbers (Danish civil registration numbers) - CPR numbers (Danish civil registration numbers)
@ -50,16 +50,16 @@ When items are found, you can review them, decide what to do with each one (keep
When you open the scanner, the screen is divided into three areas: When you open the scanner, the screen is divided into three areas:
``` ```
┌─────────────────┬──────────────────────────────────────────┐ ┌─────────────────┬──────────────────────────────────────────
│ │ Top bar: Scan button, profiles, actions │ │ │ Top bar: Scan button, profiles, actions │
│ Left sidebar ├──────────────────────────────────────────┤ │ Left sidebar ├──────────────────────────────────────────
│ │ │ │ │ │
│ - Sources │ Results / scan progress │ │ - Sources │ Results / scan progress │
│ - Options │ │ │ - Options │ │
│ - Accounts │ │ │ - Accounts │ │
│ - Stats ├──────────────────────────────────────────┤ │ - Stats ├──────────────────────────────────────────
│ │ Activity log │ │ │ Activity log │
└─────────────────┴──────────────────────────────────────────┘ └─────────────────┴──────────────────────────────────────────
``` ```
**Left sidebar** — choose what to scan and how. **Left sidebar** — choose what to scan and how.
@ -104,17 +104,33 @@ The Google Workspace tab lets you connect a Google Workspace (formerly G Suite)
| Gmail | All emails in each user's inbox and labels | | Gmail | All emails in each user's inbox and labels |
| Google Drive | All files owned by or shared with each user | | Google Drive | All files owned by or shared with each user |
### 3.3 Local and Network File Shares ### 3.3 Local, Network, and SFTP File Sources
The **Filkilder** (File Sources) tab lists any local folders or network drives you have configured. The **Filkilder** (File Sources) tab lists any local folders, network drives, or SFTP servers you have configured.
**To add a new file source:** **To add a new file source:**
1. Enter a **Label** — a friendly name you will recognise (e.g. "Skolens Fællesmappe"). 1. Enter a **Label** — a friendly name you will recognise (e.g. "Skolens Fællesmappe").
2. Enter the **Path**: 2. Select the **source type** using the pill selector at the top of the form:
- Local folder: `~/Documents` or `/Volumes/Share`
- Network share: `//nas-server/shared` or `\\server\share` **Local**
3. If it is a network share, fill in the **SMB Host**, **Username**, and **Password** that appear automatically. The password is stored securely in your system keychain. - Enter the **Path** to the folder: `~/Documents` or `/Volumes/Share`.
4. Click **Tilføj** (Add). - Click **Tilføj** (Add).
**Network (SMB)**
- Enter the **Path** in UNC format: `//nas-server/shared` or `\\server\share`.
- Fill in the **SMB Host**, **Username**, and **Password** that appear. The password is stored securely in your system keychain.
- Click **Tilføj** (Add).
**SFTP**
- Enter the **Host** (hostname or IP address of the SSH/SFTP server).
- Enter the **Port** (default 22).
- Enter the **Username**.
- Enter the **Remote path** to scan (e.g. `/home/shared` or `/`).
- Choose the **Authentication type**:
- **Password** — enter the password. It is stored securely in your system keychain.
- **Private key** — click **Upload key file** and select your SSH private key (OpenSSH or PEM format). If the key is passphrase-protected, enter the passphrase. The key file is stored in the scanner's data directory with `600` permissions.
- Click **Tilføj** (Add).
You can add as many file sources as you need. Each one will appear as a selectable source in the main sidebar when you are ready to scan. You can add as many file sources as you need. Each one will appear as a selectable source in the main sidebar when you are ready to scan.
@ -154,6 +170,10 @@ Only scan items modified after a certain date. Quick presets — **1 år**, **2
**Max emails per user** — stop after scanning this many emails per person (default 2,000). Increase if you need complete coverage. **Max emails per user** — stop after scanning this many emails per person (default 2,000). Increase if you need complete coverage.
**CPR-only mode** — when enabled, only items containing at least one qualifying CPR number are flagged. Items whose only hits are email addresses, phone numbers, detected faces, or EXIF/GPS metadata are skipped. Useful when you want a focused CPR-only report without noise from other data types.
**OCR language** — choose the language pack(s) Tesseract uses when reading text from scanned PDFs and images. The default `Danish + English` covers the vast majority of documents. Switch to a different preset if your documents are predominantly in another language.
### 4.4 Start the Scan ### 4.4 Start the Scan
Click the blue **Scan** button in the top bar. Click the blue **Scan** button in the top bar.
@ -180,6 +200,8 @@ Click **▶ Genoptag** to continue from where the scan left off. Click **Start f
## 5. Understanding the Results ## 5. Understanding the Results
When you open the app, the grid shows **all open items** — every flagged item that still needs action (i.e. has no disposition), across all of your scans, not just the most recent one. As you tag items (kept, redacted, deleted, false positive, …) they drop out of this view, so what remains is your outstanding work. Each item appears once, showing its most recent state. To look at a single past scan instead, use the session picker (see *Browsing past scan sessions* below).
Each flagged item appears as a card. Here is what the badges and labels mean: Each flagged item appears as a card. Here is what the badges and labels mean:
### Source badges ### Source badges
@ -192,7 +214,8 @@ Each flagged item appears as a card. Here is what the badges and labels mean:
| Teams | Found in a Teams channel | | Teams | Found in a Teams channel |
| Gmail | Found in a Gmail mailbox | | Gmail | Found in a Gmail mailbox |
| Google Drive | Found in Google Drive | | Google Drive | Found in Google Drive |
| Local / Network | Found on a file share | | Local / Network | Found on a local or SMB file share |
| 🔒 SFTP | Found on an SFTP server |
### Risk level ### Risk level
@ -235,7 +258,7 @@ Once a scan has completed, you can review results from any earlier scan session
- Click the **Sessions** button in the history banner (which appears above the results grid after a scan completes) to open the session picker. - Click the **Sessions** button in the history banner (which appears above the results grid after a scan completes) to open the session picker.
- Each row shows the date and time, which sources were scanned, and how many items were flagged. A **Δ** badge marks delta scans; **Latest** marks the most recent session. - Each row shows the date and time, which sources were scanned, and how many items were flagged. A **Δ** badge marks delta scans; **Latest** marks the most recent session.
- Click any row to load that session's results into the grid. A history banner replaces the progress bar, showing the session details. - Click any row to load that session's results into the grid. A history banner replaces the progress bar, showing the session details.
- Click **Latest scan** in the banner to jump back to the most recent session. - Click **Open items** in the banner to leave the past session and return to the default view of all items still needing action.
- Starting a new scan automatically exits history mode and switches back to live results. - Starting a new scan automatically exits history mode and switches back to live results.
All filters, exports, and disposition tagging work normally while browsing past sessions. All filters, exports, and disposition tagging work normally while browsing past sessions.
@ -253,6 +276,7 @@ The preview shows:
- All CPR numbers found and their context - All CPR numbers found and their context
- Other personal data detected (phone, email address, IBAN, etc.) - Other personal data detected (phone, email address, IBAN, etc.)
- Sharing and external-access information - Sharing and external-access information
- **Related documents** — if other items in the same scan session share one or more CPR numbers with this item, a "Related documents" section lists them. Click any row to open that item's preview. This helps you track the same person's data across multiple files or emails.
### Setting a disposition ### Setting a disposition
@ -268,7 +292,31 @@ Every item has a **Disposition** dropdown in the preview panel. Choose one of:
| Privat brug — uden for scope | Personal item, not in scope for GDPR processing | | Privat brug — uden for scope | Personal item, not in scope for GDPR processing |
| Slettet | Already deleted (set automatically when you delete an item) | | Slettet | Already deleted (set automatically when you delete an item) |
After choosing, click **Gem**. A small **✓ Gemt** confirmation appears. After choosing, click **Save**. A small **✓ Saved** confirmation appears.
### Redacting a file in-place
A **✂** button appears on result cards where the scanner can overwrite the file directly. Clicking it replaces all CPR numbers with `██████-████` blocks and logs the action as a `"redacted"` disposition. The card is **kept in the grid until your next scan** — it is greyed out, shows a green **✏ Redacted** badge, and its action buttons are hidden so it cannot be processed again. This lets you see at a glance what you handled during the session; the grid is rebuilt the next time you scan. This is useful when you want to sanitise a file rather than delete it entirely.
The button is available for the following source types and formats:
| Source | Supported formats |
|---|---|
| Local files | DOCX, XLSX, CSV, TXT, PDF |
| Network share (SMB) | DOCX, XLSX, CSV, TXT, PDF |
| SFTP | DOCX, XLSX, CSV, TXT, PDF |
| OneDrive / SharePoint / Teams | DOCX, XLSX, PDF |
| Google Drive | DOCX, XLSX, PDF |
The button is **not** available for email items (Exchange/Gmail) or viewer mode. Google Docs and Sheets that were exported as DOCX/XLSX during scanning cannot be redacted in-place — export the file from Google manually first, then redact the downloaded copy.
> **PDF security note:** PDF redaction uses physical removal — the CPR number text is erased from the PDF data stream, not just painted over with a black box. A reader cannot recover the original text by selecting under the redaction or inspecting the file programmatically. Image-based (scanned) PDFs are also supported: the scanner locates the CPR number on the page image via OCR and physically overwrites that region.
> **OneDrive / SharePoint / Teams note:** Redaction writes the modified file back via the Microsoft Graph API and requires the `Files.ReadWrite.All` permission. The scanner now requests this permission automatically during sign-in. If you authenticated before this update, sign out and sign back in (Settings → Microsoft 365 → Sign out) so the scanner obtains a new token with write access. For app-only (service principal) setups, a Global Admin must grant the `Files.ReadWrite.All` application permission in Azure → App registrations → API permissions → Grant admin consent.
> **Google Drive note:** Drive redaction requires the `drive` scope on the service account's domain-wide delegation grant (not just `drive.readonly`). If redaction fails with a permission error, ask your Google Workspace admin to add the `https://www.googleapis.com/auth/drive` scope to the service account delegation in the Admin Console.
> **SFTP note:** SFTP redaction is only available for items found in the current scan session. If you are browsing historical results, re-run the scan first.
### Bulk tagging multiple items at once ### Bulk tagging multiple items at once
@ -316,6 +364,8 @@ Click the **Delete** button in the filter bar to open the bulk delete modal.
4. A progress bar shows deletions as they happen. Emails go to **Deleted Items**; files go to the **recycle bin**. 4. A progress bar shows deletions as they happen. Emails go to **Deleted Items**; files go to the **recycle bin**.
Deleted items (whether from a single delete, a bulk delete, or a data-subject erasure) are **kept in the grid until your next scan** — greyed out with a red **🗑 Deleted** badge and their action buttons hidden — so you can see what was removed during the session. When a bulk delete partially fails, only the items the server actually deleted are marked; any that failed stay active so you can retry them. The grid is rebuilt the next time you scan.
A full audit log of every deletion (what was deleted, when, and why) is included in the Article 30 report. A full audit log of every deletion (what was deleted, when, and why) is included in the Article 30 report.
--- ---
@ -352,7 +402,7 @@ Click **Profiles** to open the profile management panel. Here you can:
Click **Excel** in the filter bar to download the current results as an Excel workbook. The workbook contains: Click **Excel** in the filter bar to download the current results as an Excel workbook. The workbook contains:
- A summary tab with scan date, item counts, and source breakdown. - A summary tab with scan date, item counts, and source breakdown.
- A separate tab for each source type (Outlook, OneDrive, SharePoint, Teams, Gmail, Google Drive, Local, Network). - A separate tab for each source type (Outlook, OneDrive, SharePoint, Teams, Gmail, Google Drive, Local, Network, SFTP).
- Every flagged item, including source, account, CPR count, risk level, sharing status, and disposition. - Every flagged item, including source, account, CPR count, risk level, sharing status, and disposition.
The **Excel** and **Art.30** buttons are always available — even after restarting the application — and will export the results from the most recent completed scan session without requiring a new scan. The **Excel** and **Art.30** buttons are always available — even after restarting the application — and will export the results from the most recent completed scan session without requiring a new scan.
@ -391,9 +441,10 @@ Click the **🔗** button in the top-right of the top bar to open the Share pane
- **All roles** — the recipient sees all flagged items. - **All roles** — the recipient sees all flagged items.
- **Ansatte** / **Elever** — the recipient sees only items belonging to that role group. The role filter is locked in their view. - **Ansatte** / **Elever** — the recipient sees only items belonging to that role group. The role filter is locked in their view.
- **User** — the recipient sees only the items belonging to a specific employee. Select the person from the search box; the scanner matches both their M365 and Google Workspace email addresses automatically. Use this when you want to give an individual employee access to their own scan results. - **User** — the recipient sees only the items belonging to a specific employee. Select the person from the search box; the scanner matches both their M365 and Google Workspace email addresses automatically. Use this when you want to give an individual employee access to their own scan results.
3. Choose an **Expiry** — 7 days, 30 days, 90 days, 1 year, or Never. 3. Optionally set a **Date range** — use the "Items from" and "Items until" date fields to limit the recipient to items modified within a specific period. This lets you, for example, create a link covering only last year's scan results. Leave both fields blank for no date restriction.
4. Click **Create**. A unique link is generated: `http://host:5100/view?token=…` 4. Choose an **Expiry** — 7 days, 30 days, 90 days, 1 year, or Never.
5. Click **Copy** to copy the link to your clipboard, then send it to the reviewer. 5. Click **Create**. The form clears and the new link appears at the top of the **Active links** list below, briefly highlighted.
6. Click **Copy** on that link's row to copy it to your clipboard, then send it to the reviewer.
The reviewer opens the link in any browser. They see the results grid (filtered to their permitted scope) and can tag dispositions but cannot start scans, change settings, view credentials, or delete items. The reviewer opens the link in any browser. They see the results grid (filtered to their permitted scope) and can tag dispositions but cannot start scans, change settings, view credentials, or delete items.
@ -445,6 +496,7 @@ Go to **Settings → Planlægger** to configure automatic scans.
7. Optionally enable: 7. Optionally enable:
- **Send rapport automatisk** — email the Excel report to your configured recipients after each scan. - **Send rapport automatisk** — email the Excel report to your configured recipients after each scan.
- **Håndhæv opbevaringspolitik** — automatically delete items older than your retention policy after each scan. - **Håndhæv opbevaringspolitik** — automatically delete items older than your retention policy after each scan.
- **Report only** — skip the scan entirely and just email the latest results already in the database. Useful for sending a regular summary email without running a new scan. When enabled, no profile is needed and M365 authentication is not required.
8. Click **Gem** (Save). 8. Click **Gem** (Save).
The scheduler indicator in the top bar shows the date and time of the next scheduled scan ("Next: …"). The scheduler indicator in the top bar shows the date and time of the next scheduled scan ("Next: …").
@ -476,7 +528,17 @@ Click **Gem** to save, then click **Test** to send a test email and verify the c
> If your account has MFA (two-factor authentication) enabled, you cannot use your regular password. You need to create an **App Password** in your account security settings: > If your account has MFA (two-factor authentication) enabled, you cannot use your regular password. You need to create an **App Password** in your account security settings:
> - **Microsoft personal account**: account.microsoft.com/security → App passwords > - **Microsoft personal account**: account.microsoft.com/security → App passwords
> - **Gmail**: myaccount.google.com → Security → 2-Step Verification → App passwords > - **Gmail / Google Workspace**: myaccount.google.com → Security → 2-Step Verification → App passwords (for Google Workspace accounts your administrator must first allow App Passwords, or set up an SMTP relay)
### Always send via SMTP (skip Microsoft Graph)
When the scanner is signed in to Microsoft 365, it normally sends email through Microsoft 365 directly, without using the SMTP settings above. This is convenient, but it cannot deliver to some addresses — most notably an address on a Google-hosted subdomain of your Microsoft 365 domain, which Microsoft 365 treats as internal and silently discards (no delivery, no error).
Turn on **Send altid via SMTP (spring Microsoft Graph over)** to force all email — test emails, manual reports, and the after-scan auto-email — through the SMTP server you configured above. Use this when your reports go to a mailbox Microsoft 365 won't deliver to (for example a Google Workspace address), with `smtp.gmail.com` / `smtp-relay.gmail.com` as the SMTP host.
### Email report after manual scan
Turn on **Send rapport efter manuel scanning** to automatically email the report to your configured recipients every time a manual scan finishes.
### Sending a report manually ### Sending a report manually
@ -516,6 +578,7 @@ Click **Reset DB** to wipe all scan data, dispositions, and deletion log. This i
| Setting | Description | | Setting | Description |
|---------|-------------| |---------|-------------|
| Theme | Dark or light mode | | Theme | Dark or light mode |
| Software update | Check for and install new versions of the scanner directly from the browser, or enable automatic daily updates. Only shown on server installations running from a git checkout (not in the desktop app). The app restarts itself after installing; updating is refused while a scan is running, and the next scan after an update continues normally. |
### Security tab ### Security tab
@ -537,6 +600,27 @@ These options are in the left sidebar under **Indstillinger**:
**Min. CPR count per file** — only flag a file if it contains at least this many *distinct* CPR numbers. The default is 1 (current behaviour). Setting it to 2 avoids false positives in student scans: a student's own consent form or registration document typically contains only their own CPR number, while a class list or grade sheet containing multiple students' CPRs will still be reported. **Min. CPR count per file** — only flag a file if it contains at least this many *distinct* CPR numbers. The default is 1 (current behaviour). Setting it to 2 avoids false positives in student scans: a student's own consent form or registration document typically contains only their own CPR number, while a class list or grade sheet containing multiple students' CPRs will still be reported.
**CPR-only mode** — when enabled, items with no CPR numbers (only email addresses, phone numbers, faces, or GPS/EXIF data) are skipped entirely. Use this when you want a lean report focused exclusively on CPR exposure.
**OCR language** — selects the Tesseract language pack(s) used when reading scanned PDFs and images. Default: `Danish + English`. Change to a different preset if your documents are in another language (German, Swedish, French presets are available).
### AI / NER tab
Go to **Settings → AI / NER** to configure Claude AI-powered Named Entity Recognition.
By default the scanner uses spaCy (a local machine-learning model) to detect person names, addresses, and organisation names in document text. Enabling Claude NER replaces this with calls to the Claude Haiku API, which is significantly more accurate — especially for Danish hyphenated surnames (e.g. "Hansen-Nielsen"), foreign-origin names, and names that appear without surrounding context (such as isolated cells in a spreadsheet).
**To enable:**
1. Obtain an Anthropic API key from [console.anthropic.com](https://console.anthropic.com).
2. Paste the key into the **Anthropic API key** field and click **Save**.
3. Turn on the **Enable Claude NER** toggle and click **Save** again.
4. Click **Test key** to confirm the key is valid and the API is reachable.
**Cost:** Claude Haiku is charged per token at Anthropic's published rates. A typical document costs less than a fraction of a cent. Scan results are cached per document, so re-scanning the same file never incurs a second charge.
**Fallback:** If the `anthropic` package is not installed or the API key is missing, the scanner automatically falls back to spaCy with no error — the toggle simply has no effect.
**Retention policy** — when enabled, marks items older than the specified number of years as overdue. The fiscal year end setting determines how the cutoff date is calculated: **Retention policy** — when enabled, marks items older than the specified number of years as overdue. The fiscal year end setting determines how the cutoff date is calculated:
| Option | Cutoff date calculation | | Option | Cutoff date calculation |
@ -545,6 +629,12 @@ These options are in the left sidebar under **Indstillinger**:
| 31 dec (Bogføringsloven) | Last 31 December minus N years | | 31 dec (Bogføringsloven) | Last 31 December minus N years |
| 30 jun / 31 mar | Last occurrence of that date minus N years | | 30 jun / 31 mar | Last occurrence of that date minus N years |
### Audit Log tab
Go to **Settings → Audit Log** to view an immutable log of all significant admin actions performed in the scanner. Each entry shows the time, action type, detail, and client IP address. Recorded events include: profile save/delete, viewer token create/revoke, PIN changes, file source add/update/delete, scheduler job save/delete, scan start/stop, SMTP config save, dispositions, item delete, and item redact.
The log is read-only and is stored in the scanner database alongside scan results. It is included in database exports and can help you demonstrate accountability to a supervisory authority.
--- ---
## 15. Frequently Asked Questions ## 15. Frequently Asked Questions
@ -556,7 +646,7 @@ No. CPR numbers found during a scan are stored only as a count (e.g. "3 CPR numb
Emails are moved to the user's **Deleted Items** folder in Exchange — they are not permanently deleted and can be recovered by the user or an administrator. Files are moved to the **recycle bin** of the relevant service (OneDrive, SharePoint, file system). A permanent deletion requires a second action by the user or admin. Emails are moved to the user's **Deleted Items** folder in Exchange — they are not permanently deleted and can be recovered by the user or an administrator. Files are moved to the **recycle bin** of the relevant service (OneDrive, SharePoint, file system). A permanent deletion requires a second action by the user or admin.
**Can I scan without connecting to Microsoft 365?** **Can I scan without connecting to Microsoft 365?**
Yes. You can scan local and SMB file shares without any M365 or Google connection. Open **Sources**, go to the **Filkilder** tab, and add your file paths. Yes. You can scan local folders, SMB/NAS drives, and SFTP servers without any M365 or Google connection. Open **Sources**, go to the **Filkilder** tab, and add your file paths or SFTP server details.
**What is delta scanning and when should I use it?** **What is delta scanning and when should I use it?**
Delta scanning uses Microsoft Graph change tokens (for M365) and the Google Drive Changes API (for Google Workspace) to fetch only items modified since the last scan. It is ideal for regular (e.g. weekly) compliance checks after you have done a full baseline scan. Enable it in the Options section of the sidebar. Delta scanning uses Microsoft Graph change tokens (for M365) and the Google Drive Changes API (for Google Workspace) to fetch only items modified since the last scan. It is ideal for regular (e.g. weekly) compliance checks after you have done a full baseline scan. Enable it in the Options section of the sidebar.
@ -582,6 +672,15 @@ Yes. Go to **Settings → Security → Interface PIN** and set a 48 digit PIN
**Can a reviewer tag dispositions without access to the scan controls?** **Can a reviewer tag dispositions without access to the scan controls?**
Yes. Use the **🔗 Share** button to create a read-only viewer link or set a Viewer PIN in Settings → Security. The reviewer opens the link in their browser and can browse results and tag dispositions without seeing credentials, sources, or scan buttons. See section 10 for details. Yes. Use the **🔗 Share** button to create a read-only viewer link or set a Viewer PIN in Settings → Security. The reviewer opens the link in their browser and can browse results and tag dispositions without seeing credentials, sources, or scan buttons. See section 10 for details.
**Can I limit a reviewer's link to a specific time period?**
Yes. When creating a token link, use the "Items from" and "Items until" date fields to restrict the link to items modified within that range. The reviewer will only see items whose modification date falls within the window you specified.
**Where can I see who changed what in the scanner?**
Go to **Settings → Audit Log**. Every significant admin action is recorded there with a timestamp, action type, detail, and IP address.
**Will enabling Claude NER increase costs significantly?**
For a typical school or municipality scan the cost is negligible — Claude Haiku charges fractions of a cent per document, and results are cached so the same file is never billed twice. A full scan of 10 000 documents typically costs under $1. The biggest gain is on name-dense documents (class lists, case files) where spaCy previously missed many names.
--- ---
*GDPR Scanner v1.6.20 — for technical setup and configuration see README.md* *GDPR Scanner v1.7.9 — for technical setup and configuration see README.md*

148
docs/setup/ZORAXY_SETUP.md Normal file
View File

@ -0,0 +1,148 @@
# HTTPS via Zoraxy Reverse Proxy
Step-by-step guide for putting GDPRScanner behind [Zoraxy](https://github.com/tobychui/zoraxy) with a Let's Encrypt certificate, on a LAN-only deployment.
Why bother on an internal network:
- **Encryption in transit** — the scanner streams CPR numbers, document previews, and share links. Serving that over plain HTTP to DPO reviewers is itself a compliance finding.
- **Secure context** — the browser Clipboard API (share-link Copy buttons) only exists on HTTPS or localhost. Over plain HTTP the app falls back to a legacy copy mechanism.
- **A real hostname**`https://gdprscanner.example.dk` instead of `http://10.x.x.x:5100` in share links, bookmarks, and emails.
This guide assumes Zoraxy runs **on the same host** as the scanner. If it runs elsewhere, replace `127.0.0.1:5100` with the scanner host's LAN IP and firewall port 5100 to the Zoraxy host only.
---
## 1. DNS record
Create an A-record for the hostname pointing at the server's **LAN IP**:
```
gdprscanner.example.dk A 10.x.x.x
```
A public DNS record pointing at a private IP is fine — outsiders can resolve the name but cannot route to the address, which is exactly the "LAN-only" goal.
> **Consequence:** because the server is not reachable from the internet, Let's Encrypt's default HTTP-01 challenge cannot work. The certificate **must** be issued via the **DNS-01 challenge** (step 4). If you prefer not to publish the internal IP at all, use an internal/split-horizon DNS record instead — DNS-01 still works since it validates against the public DNS zone, not the server.
---
## 2. Install Zoraxy
```bash
mkdir -p /opt/zoraxy && cd /opt/zoraxy
wget -O zoraxy https://github.com/tobychui/zoraxy/releases/latest/download/zoraxy_linux_amd64
chmod +x zoraxy
```
`/etc/systemd/system/zoraxy.service`:
```ini
[Unit]
Description=Zoraxy reverse proxy
After=network.target
[Service]
WorkingDirectory=/opt/zoraxy
ExecStart=/opt/zoraxy/zoraxy
Restart=always
[Install]
WantedBy=multi-user.target
```
```bash
systemctl daemon-reload && systemctl enable --now zoraxy
```
Open the management UI at `http://<server-ip>:8000` and create the admin account.
> Menu names below may differ slightly between Zoraxy versions — the concepts to look for are: ACME certificate with DNS challenge, host-based proxy rule, TLS on the incoming port.
---
## 3. Incoming port and TLS
In Zoraxy's global settings:
- Set the incoming proxy port to **443** and enable **TLS**.
- Enable **force-redirect port 80 → 443** so plain-HTTP visits upgrade automatically.
---
## 4. Certificate via ACME (DNS-01)
In **TLS / SSL Certificates → ACME**:
1. Enter the hostname (`gdprscanner.example.dk`).
2. Enable the **DNS challenge** and select the DNS provider that hosts your zone (Cloudflare, Simply.com, etc.).
3. Paste the provider's **API token/credentials** — created in the DNS provider's control panel.
4. Request the certificate. Zoraxy renews it automatically.
If your DNS host has no API, Zoraxy can generate a **self-signed certificate** as a fallback — it works, but every client machine must trust it manually. Getting a DNS API token is the better one-time investment.
---
## 5. Proxy rule
**HTTP Proxy → New Proxy Rule**:
| Field | Value |
|---|---|
| Matching hostname | `gdprscanner.example.dk` |
| Target | `127.0.0.1:5100` |
| TLS to target | Off (the scanner speaks plain HTTP locally) |
---
## 6. Close the side doors
**Bind the scanner to loopback** so only Zoraxy can reach Flask. Wherever the scanner is started (systemd unit or `start_gdpr.sh`), add:
```bash
--host 127.0.0.1
```
After a restart, `http://<server-ip>:5100` stops responding by design. The in-app self-update restart preserves the argument.
Optional hardening:
- Add a Zoraxy **Access Rule** whitelisting your LAN CIDR (e.g. `10.0.0.0/8`) on the proxy rule.
- Firewall the Zoraxy **management port 8000** to admin machines only.
---
## 7. Firewall / perimeter checklist
The Zoraxy whitelist (step 6) is an **application-layer** control — a rejected request has still completed the TCP and TLS handshake against your box, and any proxy host you forget to tag is fully exposed. The firewall is the real perimeter. Work this checklist whenever you stand up or replace the edge firewall:
- [ ] **No inbound port-forward unless a service is intentionally public.** A LAN-only deployment needs *zero* inbound forwards — DNS-01 (step 4) is outbound-only, so certificates issue and renew with the firewall fully closed.
- [ ] **If any service is intentionally public** (e.g. a media server), forward **443 only to the Zoraxy host** — never to individual app hosts. Everything then enters through Zoraxy, where the per-host Access Rule decides public vs. private.
- [ ] **The per-host whitelist stays your public/private boundary even with the firewall in place** — it is not made redundant by the firewall. Public hosts use the `default` rule; every internal-only host gets **Local Access Only**.
- [ ] **New proxy hosts default to public.** Zoraxy applies the `default` rule to any host with no rule set, so a freshly-added internal service is reachable the moment it exists. Set its Access Rule to **Local Access Only** *at creation time*.
- [ ] **Management ports are LAN-only.** Zoraxy admin (`:8000`) and any app admin UI must never be forwarded; tag them **Local Access Only** as well.
- [ ] **Verify from off-network.** From a connection outside the LAN (e.g. a phone on mobile data), confirm private hostnames are blocked and only the intentionally-public ones respond:
```bash
curl -v https://gdprscanner.example.dk # should fail/refuse from outside
nmap -Pn -p 80,443,5100 <your-public-IP> # only intentionally-open ports listed
```
---
## 8. Verify the scanner-specific behaviour
1. `https://gdprscanner.example.dk` loads with a valid padlock; `http://` redirects.
2. **Run a scan and watch result cards stream in live** — that is the Server-Sent Events connection (`/api/scan/stream`) passing through the proxy. If progress stalls while the scan log advances, look at proxy buffering/timeout settings.
3. Create a **share link** — it must start with `https://gdprscanner.example.dk/view?token=…`. The app uses the page origin automatically on HTTPS (the LAN-IP rewrite only applies when browsing at localhost). The Copy buttons now use the native Clipboard API.
4. **Settings → General → Software update → Check for updates** still works (outbound git fetch is unaffected by the proxy).
---
## Troubleshooting
| Symptom | Cause / fix |
|---|---|
| Certificate request fails | HTTP-01 attempted against an unreachable host — make sure the **DNS challenge** is selected and the API credentials are for the zone's actual DNS host |
| Cards don't stream during scans | Proxy buffering the SSE response — check Zoraxy timeout/buffering settings for the rule |
| Share links still show the LAN IP | Page was loaded via the old `http://<ip>:5100` URL — use the HTTPS hostname; links follow the page origin |
| `http://<ip>:5100` still reachable | The `--host 127.0.0.1` flag is missing from the scanner's launch command |

View File

@ -117,6 +117,12 @@ try:
except ImportError: except ImportError:
SPACY_OK = False SPACY_OK = False
try:
import anthropic as _anthropic
ANTHROPIC_OK = True
except ImportError:
ANTHROPIC_OK = False
try: try:
from docx import Document as DocxDocument from docx import Document as DocxDocument
DOCX_OK = True DOCX_OK = True
@ -232,6 +238,91 @@ def load_nlp():
return None return None
# ── Claude NER ────────────────────────────────────────────────────────────────
def _get_claude_ner_config() -> "tuple[bool, str]":
"""Read Claude NER settings from config.json. Small file — OS-cached."""
try:
from app_config import _load_config, get_claude_api_key
cfg = _load_config()
return bool(cfg.get("claude_ner")), get_claude_api_key()
except Exception:
return False, ""
_CLAUDE_NER_CACHE: "dict[int, list[dict]]" = {}
_CLAUDE_NER_LOCK = None
def _claude_lock():
global _CLAUDE_NER_LOCK
if _CLAUDE_NER_LOCK is None:
import threading as _th
_CLAUDE_NER_LOCK = _th.Lock()
return _CLAUDE_NER_LOCK
def _ner_claude(text: str, api_key: str) -> "list[dict]":
"""
Extract named entities via Claude Haiku. Returns list of
{"text": str, "type": "NAME"|"ADDRESS"|"ORG"}.
In-memory cache keyed by hash(text); evicts oldest when > 2000 entries.
"""
if not ANTHROPIC_OK or not api_key:
return []
cache_key = hash(text)
lock = _claude_lock()
with lock:
if cache_key in _CLAUDE_NER_CACHE:
return _CLAUDE_NER_CACHE[cache_key]
try:
import json as _json
client = _anthropic.Anthropic(api_key=api_key)
CHUNK = 8_000
entities: "list[dict]" = []
for i in range(0, min(len(text), CHUNK * 10), CHUNK):
chunk = text[i : i + CHUNK]
if not chunk.strip():
continue
msg = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=512,
messages=[{
"role": "user",
"content": (
"Extract personal data from the text. "
"Return ONLY valid JSON: "
"{\"entities\":[{\"text\":\"<exact substring>\","
"\"type\":\"NAME\"|\"ADDRESS\"|\"ORG\"}]}. "
"NAME=person names, ADDRESS=physical addresses, "
"ORG=organisation names. "
"Skip CPR numbers, emails, phones, dates. "
"Return {\"entities\":[]} if none.\n\nTEXT:\n" + chunk
),
}],
)
raw = msg.content[0].text.strip()
if "```" in raw:
raw = raw.split("```")[1]
if raw.startswith("json\n"):
raw = raw[5:]
entities.extend(_json.loads(raw).get("entities", []))
result = [e for e in entities
if isinstance(e, dict) and e.get("text") and e.get("type")]
except Exception:
result = []
with lock:
if len(_CLAUDE_NER_CACHE) >= 2_000:
try:
del _CLAUDE_NER_CACHE[next(iter(_CLAUDE_NER_CACHE))]
except Exception:
pass
_CLAUDE_NER_CACHE[cache_key] = result
return result
# ── OCR page cache ─────────────────────────────────────────────────────────── # ── OCR page cache ───────────────────────────────────────────────────────────
_OCR_CACHE_PATH = Path.home() / ".document_scanner_ocr_cache.db" _OCR_CACHE_PATH = Path.home() / ".document_scanner_ocr_cache.db"
@ -743,8 +834,15 @@ def count_pii_types(text: str, use_ner: bool = True) -> dict:
if 1 <= int(reg) <= 9999 and len(acct) >= 6: if 1 <= int(reg) <= 9999 and len(acct) >= 6:
counts["BANK_ACCOUNT"] += 1 counts["BANK_ACCOUNT"] += 1
# NER-based counts — only run if model is loaded and text is non-trivial # NER-based counts — Claude (if enabled) else spaCy
if use_ner and len(text.strip()) > 20: if use_ner and len(text.strip()) > 20:
_claude_on, _claude_key = _get_claude_ner_config()
if _claude_on and ANTHROPIC_OK and _claude_key:
for ent in _ner_claude(text, _claude_key):
_t = ent.get("type")
if _t in counts:
counts[_t] += 1
else:
nlp = load_nlp() nlp = load_nlp()
if nlp: if nlp:
NER_LIMIT = 20_000 NER_LIMIT = 20_000
@ -902,21 +1000,26 @@ def find_pii_spans_in_text(text: str, use_ner: bool = True) -> list[tuple[int, i
if _is_name_match(m): if _is_name_match(m):
spans.append((m.start(), m.end(), "NAME")) spans.append((m.start(), m.end(), "NAME"))
# NER (names, addresses, orgs) # NER spans — Claude (if enabled) else spaCy
# Cap at 20 000 chars per call — spaCy NER is O(n) but dense tabular text
# (e.g. Excel-converted PDFs) can have thousands of tokens per page and stall.
#
# Context boosting: spaCy needs sentence context to recognise isolated names.
# For short text (< 80 chars, e.g. a single cell or line) we prepend a label
# so the model sees "Navn: Peter Hansen" instead of bare "Peter Hansen".
# Matches are shifted back by the prefix length before being recorded.
if use_ner: if use_ner:
_claude_on, _claude_key = _get_claude_ner_config()
if _claude_on and ANTHROPIC_OK and _claude_key:
for ent in _ner_claude(text, _claude_key):
_label = ent.get("type")
_ent_text = ent.get("text", "")
if not _ent_text or _label not in ("NAME", "ADDRESS", "ORG"):
continue
for _m in re.finditer(re.escape(_ent_text), text):
spans.append((_m.start(), _m.end(), _label))
else:
# spaCy NER — cap at 20 000 chars per call (dense tabular text can stall).
# Context boosting: prepend "Navn: " for short/isolated text so spaCy
# sees sentence context; shift match positions back by prefix length.
nlp = load_nlp() nlp = load_nlp()
if nlp: if nlp:
NER_LIMIT = 20_000 NER_LIMIT = 20_000
PREFIX = "Navn: " PREFIX = "Navn: "
PLEN = len(PREFIX) PLEN = len(PREFIX)
# Only inject prefix for short/isolated text
if len(text.strip()) < 80: if len(text.strip()) < 80:
ner_input = PREFIX + text ner_input = PREFIX + text
ner_offset = -PLEN ner_offset = -PLEN

View File

@ -551,6 +551,68 @@ def _smb_read_file(tree, smb_path: str) -> bytes:
fh.close(get_attributes=False) fh.close(get_attributes=False)
def write_smb_file(smb_path_uri: str, content: bytes,
username: str, password: str, domain: str = "") -> None:
"""Overwrite an SMB file at smb_path_uri (e.g. '//host/share/folder/file.docx').
Raises RuntimeError if smbprotocol is not installed.
Raises ValueError if the path cannot be parsed.
All SMB errors propagate as-is.
"""
if not SMB_OK:
raise RuntimeError("smbprotocol not installed — run: pip install smbprotocol")
norm = smb_path_uri.replace("\\", "/").lstrip("/")
parts = norm.split("/", 2)
if len(parts) < 2:
raise ValueError(f"Cannot parse SMB path '{smb_path_uri}' — expected //host/share[/path]")
host = parts[0]
share = parts[1]
file_rel = parts[2].replace("/", "\\") if len(parts) > 2 else ""
if not host or not share or not file_rel:
raise ValueError(f"Cannot parse SMB path '{smb_path_uri}'")
import uuid as _uuid
conn = Connection(_uuid.uuid4(), host, 445)
conn.connect(timeout=30)
try:
session = Session(conn, username=username, password=password,
require_encryption=False)
if domain:
session.username = f"{domain}\\{username}"
session.connect()
try:
tree = TreeConnect(session, f"\\\\{host}\\{share}")
tree.connect()
try:
fh = Open(tree, file_rel)
fh.create(
ImpersonationLevel.Impersonation,
FilePipePrinterAccessMask.FILE_WRITE_DATA |
FilePipePrinterAccessMask.FILE_WRITE_ATTRIBUTES,
FileAttributes.FILE_ATTRIBUTE_NORMAL,
ShareAccess.FILE_SHARE_NONE,
CreateDisposition.FILE_SUPERSEDE,
CreateOptions.FILE_NON_DIRECTORY_FILE,
)
try:
chunk_size = 1024 * 1024
offset = 0
while offset < len(content):
chunk = content[offset:offset + chunk_size]
fh.write(chunk, offset)
offset += len(chunk)
finally:
fh.close(get_attributes=False)
finally:
tree.disconnect()
finally:
session.disconnect()
finally:
conn.disconnect()
def _smb_ts(windows_ts: int) -> str: def _smb_ts(windows_ts: int) -> str:
"""Convert Windows FILETIME (100ns intervals since 1601-01-01) to YYYY-MM-DD.""" """Convert Windows FILETIME (100ns intervals since 1601-01-01) to YYYY-MM-DD."""
if not windows_ts: if not windows_ts:

View File

@ -6,7 +6,7 @@ Stores scan results alongside the existing JSON cache. Neither replaces the
other: JSON is fast and portable, SQLite enables querying, trending, and the other: JSON is fast and portable, SQLite enables querying, trending, and the
data-subject index. data-subject index.
Database location: ~/.gdpr_scanner.db (configurable via DB_PATH) Database location: ~/.gdprscanner/scanner.db (configurable via DB_PATH)
Schema Schema
------ ------
@ -29,11 +29,14 @@ Usage (from gdpr_scanner.py)
import hashlib import hashlib
import json import json
import logging
import sqlite3 import sqlite3
import time import time
from pathlib import Path from pathlib import Path
from typing import Iterator from typing import Iterator
logger = logging.getLogger(__name__)
from pathlib import Path as _P from pathlib import Path as _P
_DATA_DIR = _P.home() / ".gdprscanner" _DATA_DIR = _P.home() / ".gdprscanner"
_DATA_DIR.mkdir(exist_ok=True) _DATA_DIR.mkdir(exist_ok=True)
@ -180,6 +183,17 @@ CREATE INDEX IF NOT EXISTS idx_dellog_time ON deletion_log(deleted_at);
CREATE INDEX IF NOT EXISTS idx_dellog_item ON deletion_log(item_id); CREATE INDEX IF NOT EXISTS idx_dellog_item ON deletion_log(item_id);
CREATE INDEX IF NOT EXISTS idx_dellog_reason ON deletion_log(reason); CREATE INDEX IF NOT EXISTS idx_dellog_reason ON deletion_log(reason);
CREATE TABLE IF NOT EXISTS audit_log (
id INTEGER PRIMARY KEY AUTOINCREMENT,
ts REAL NOT NULL,
action TEXT NOT NULL DEFAULT '',
actor TEXT NOT NULL DEFAULT '',
detail TEXT NOT NULL DEFAULT '',
ip TEXT NOT NULL DEFAULT ''
);
CREATE INDEX IF NOT EXISTS idx_audit_ts ON audit_log(ts);
CREATE INDEX IF NOT EXISTS idx_audit_action ON audit_log(action);
-- Indexes -- Indexes
CREATE INDEX IF NOT EXISTS idx_items_scan ON flagged_items(scan_id); CREATE INDEX IF NOT EXISTS idx_items_scan ON flagged_items(scan_id);
CREATE INDEX IF NOT EXISTS idx_items_source ON flagged_items(source_type); CREATE INDEX IF NOT EXISTS idx_items_source ON flagged_items(source_type);
@ -200,6 +214,9 @@ _MIGRATIONS: list[tuple[int, str]] = [
(4, "ALTER TABLE flagged_items ADD COLUMN face_count INTEGER NOT NULL DEFAULT 0"), (4, "ALTER TABLE flagged_items ADD COLUMN face_count INTEGER NOT NULL DEFAULT 0"),
(5, "ALTER TABLE flagged_items ADD COLUMN exif_json TEXT NOT NULL DEFAULT '{}'"), (5, "ALTER TABLE flagged_items ADD COLUMN exif_json TEXT NOT NULL DEFAULT '{}'"),
(6, "ALTER TABLE flagged_items ADD COLUMN full_path TEXT NOT NULL DEFAULT ''"), (6, "ALTER TABLE flagged_items ADD COLUMN full_path TEXT NOT NULL DEFAULT ''"),
(8, "ALTER TABLE flagged_items ADD COLUMN email_count INTEGER NOT NULL DEFAULT 0"),
(9, "ALTER TABLE flagged_items ADD COLUMN phone_count INTEGER NOT NULL DEFAULT 0"),
(10, "ALTER TABLE flagged_items ADD COLUMN body_excerpt TEXT NOT NULL DEFAULT ''"),
(7, """CREATE TABLE IF NOT EXISTS schedule_runs ( (7, """CREATE TABLE IF NOT EXISTS schedule_runs (
id INTEGER PRIMARY KEY AUTOINCREMENT, id INTEGER PRIMARY KEY AUTOINCREMENT,
started_at REAL NOT NULL, started_at REAL NOT NULL,
@ -211,6 +228,7 @@ _MIGRATIONS: list[tuple[int, str]] = [
emailed INTEGER NOT NULL DEFAULT 0, emailed INTEGER NOT NULL DEFAULT 0,
error TEXT NOT NULL DEFAULT '' error TEXT NOT NULL DEFAULT ''
)"""), )"""),
(11, "ALTER TABLE flagged_items ADD COLUMN account_name TEXT NOT NULL DEFAULT ''"),
] ]
@ -311,8 +329,9 @@ class ScanDB:
(id, scan_id, name, source, source_type, account_id, folder, (id, scan_id, name, source, source_type, account_id, folder,
url, drive_id, size_kb, modified, cpr_count, risk, url, drive_id, size_kb, modified, cpr_count, risk,
thumb_b64, thumb_mime, attachments, user_role, transfer_risk, thumb_b64, thumb_mime, attachments, user_role, transfer_risk,
special_category, face_count, exif_json, full_path, scanned_at) special_category, face_count, exif_json, full_path,
VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)""", email_count, phone_count, body_excerpt, account_name, scanned_at)
VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)""",
( (
card.get("id", ""), card.get("id", ""),
scan_id, scan_id,
@ -336,6 +355,10 @@ class ScanDB:
card.get("face_count", 0), card.get("face_count", 0),
json.dumps(card.get("exif", {})), json.dumps(card.get("exif", {})),
card.get("full_path", ""), card.get("full_path", ""),
card.get("email_count", 0),
card.get("phone_count", 0),
card.get("body_excerpt", ""),
card.get("account_name", ""),
now, now,
), ),
) )
@ -414,6 +437,33 @@ class ScanDB:
c.commit() c.commit()
def finalize_orphan_scans(self) -> int:
"""Finalise scans left unfinished by a crash, kill, or mid-scan restart.
After a fresh process start nothing is scanning, so any scan still
carrying finished_at IS NULL is dead the process that owned it is gone.
Its already-saved flagged_items were stranded: both get_session_items
and get_open_items require finished_at, so those items are invisible and
effectively lost. Finalising the orphans on startup makes them show up
and prevents permanent data loss from interrupted scans (the M365 and
Google engines return early on abort and never reach finish_scan; only
the file scan finalises in a finally block).
Safe to call only when no scan is running (i.e. at startup). Returns the
number of scans finalised.
"""
rows = self._connect().execute(
"SELECT id, total_scanned FROM scans WHERE finished_at IS NULL"
).fetchall()
count = 0
for sid, total in rows:
try:
self.finish_scan(sid, total or 0)
count += 1
except Exception as e:
logger.warning("[db] finalize_orphan_scans: scan %s failed: %s", sid, e)
return count
# ── Query helpers ───────────────────────────────────────────────────────── # ── Query helpers ─────────────────────────────────────────────────────────
def latest_scan_id(self) -> int | None: def latest_scan_id(self) -> int | None:
@ -518,6 +568,71 @@ class ScanDB:
result.append(d) result.append(d)
return result return result
def get_open_items(self) -> list[dict]:
"""Return every flagged item across all scans that has no action taken.
"Open" means the item has no disposition row (or a row whose status is
still 'unreviewed'). Unlike get_session_items this is NOT limited to the
latest scan window it surfaces all outstanding items so nothing slips
out of view once a newer scan starts a fresh session.
flagged_items has a composite PK of (id, scan_id), so the same logical
item appears once per scan that flagged it. We deduplicate by id, keeping
the row from the most recent finished scan, so each open item shows once.
"""
rows = self._connect().execute(
"""SELECT fi.*, COALESCE(d.status, 'unreviewed') AS disposition
FROM flagged_items fi
JOIN scans s ON fi.scan_id = s.id
LEFT JOIN dispositions d ON d.item_id = fi.id
WHERE s.finished_at IS NOT NULL
AND (d.item_id IS NULL OR d.status = 'unreviewed')
AND fi.scan_id = (
SELECT MAX(fi2.scan_id)
FROM flagged_items fi2
JOIN scans s2 ON fi2.scan_id = s2.id
WHERE fi2.id = fi.id AND s2.finished_at IS NOT NULL
)
ORDER BY fi.cpr_count DESC""",
).fetchall()
result = []
for r in rows:
d = dict(r)
d["attachments"] = json.loads(d.get("attachments") or "[]")
result.append(d)
return result
def get_related_items(self, item_id: str, ref_scan_id: int | None = None,
window_seconds: int = 300) -> list[dict]:
"""Return flagged items from the same session that share at least one CPR
hash with *item_id*, ordered by number of shared CPRs descending."""
if ref_scan_id:
row = self._connect().execute(
"SELECT started_at FROM scans WHERE id=?", (ref_scan_id,)
).fetchone()
else:
row = self._connect().execute(
"SELECT started_at FROM scans WHERE finished_at IS NOT NULL ORDER BY id DESC LIMIT 1"
).fetchone()
if not row:
return []
latest_start = row[0]
rows = self._connect().execute(
"""SELECT fi.*, COUNT(DISTINCT ci2.cpr_hash) AS shared_cprs
FROM cpr_index ci1
JOIN cpr_index ci2 ON ci2.cpr_hash = ci1.cpr_hash
JOIN flagged_items fi ON fi.id = ci2.item_id
JOIN scans s ON fi.scan_id = s.id
WHERE ci1.item_id = ?
AND fi.id != ?
AND s.started_at BETWEEN ? AND ?
AND s.finished_at IS NOT NULL
GROUP BY fi.id
ORDER BY shared_cprs DESC, fi.cpr_count DESC""",
(item_id, item_id, latest_start - window_seconds, latest_start + window_seconds),
).fetchall()
return [dict(r) for r in rows]
def get_session_sources(self, window_seconds: int = 300) -> set: def get_session_sources(self, window_seconds: int = 300) -> set:
"""Return the union of all source keys scanned in the current session. """Return the union of all source keys scanned in the current session.
@ -771,6 +886,34 @@ class ScanDB:
).fetchone()[0] or 0 ).fetchone()[0] or 0
return {"total": total, "by_reason": by_reason, "cpr_hits_deleted": cpr_deleted} return {"total": total, "by_reason": by_reason, "cpr_hits_deleted": cpr_deleted}
# ── Compliance audit log ──────────────────────────────────────────────────
def log_audit(self, action: str, detail: str = "",
actor: str = "", ip: str = "") -> None:
"""Write an immutable compliance audit record."""
c = self._connect()
c.execute(
"INSERT INTO audit_log (ts, action, actor, detail, ip) VALUES (?,?,?,?,?)",
(time.time(), action, actor, detail, ip),
)
c.commit()
def get_audit_log(self, limit: int = 200,
action: str | None = None) -> list[dict]:
"""Return audit records, most recent first."""
c = self._connect()
if action:
rows = c.execute(
"SELECT * FROM audit_log WHERE action=? ORDER BY ts DESC LIMIT ?",
(action, limit),
).fetchall()
else:
rows = c.execute(
"SELECT * FROM audit_log ORDER BY ts DESC LIMIT ?",
(limit,),
).fetchall()
return [dict(r) for r in rows]
def delete_item_record(self, item_id: str, scan_id: int | None = None) -> None: def delete_item_record(self, item_id: str, scan_id: int | None = None) -> None:
"""Remove a flagged item from the DB (after it has been deleted in M365).""" """Remove a flagged item from the DB (after it has been deleted in M365)."""
c = self._connect() c = self._connect()
@ -1019,6 +1162,15 @@ class ScanDB:
_db: ScanDB | None = None _db: ScanDB | None = None
def log_audit_event(action: str, detail: str = "",
actor: str = "", ip: str = "") -> None:
"""Write an audit record to the shared DB. Silently no-ops if DB unavailable."""
try:
get_db().log_audit(action, detail, actor=actor, ip=ip)
except Exception:
pass
def get_db(path: Path = DB_PATH) -> ScanDB: def get_db(path: Path = DB_PATH) -> ScanDB:
"""Return the module-level ScanDB singleton, creating it if needed.""" """Return the module-level ScanDB singleton, creating it if needed."""
global _db global _db

View File

@ -251,7 +251,7 @@ from app_config import (
from checkpoint import ( from checkpoint import (
_checkpoint_key, _save_checkpoint, _load_checkpoint, _clear_checkpoint, _checkpoint_key, _save_checkpoint, _load_checkpoint, _clear_checkpoint,
_load_delta_tokens, _save_delta_tokens, _load_delta_tokens, _save_delta_tokens,
_CHECKPOINT_PATH, _DELTA_PATH, _cp_path, _DELTA_PATH,
) )
from sse import broadcast, _sse_queues, _sse_buffer from sse import broadcast, _sse_queues, _sse_buffer
@ -317,6 +317,11 @@ app = Flask(__name__,
template_folder=_os.path.join(_BASE_DIR, "templates"), template_folder=_os.path.join(_BASE_DIR, "templates"),
static_folder=_os.path.join(_BASE_DIR, "static")) static_folder=_os.path.join(_BASE_DIR, "static"))
# Static files must revalidate on every load (cheap 304s via ETag). Without
# this there is no Cache-Control header and browsers cache JS/CSS heuristically
# for days — after a self-update the backend is new but the UI stays stale.
app.config["SEND_FILE_MAX_AGE_DEFAULT"] = 0
# Session secret — derived from machine_id so it survives restarts without a separate file. # Session secret — derived from machine_id so it survives restarts without a separate file.
# machine_id is also the Fernet key (base64-encoded 32 bytes); we use its raw bytes as the secret. # machine_id is also the Fernet key (base64-encoded 32 bytes); we use its raw bytes as the secret.
try: try:
@ -1572,10 +1577,11 @@ from routes.scheduler import bp as scheduler_bp
from routes.google_auth import bp as google_auth_bp from routes.google_auth import bp as google_auth_bp
from routes.google_scan import bp as google_scan_bp from routes.google_scan import bp as google_scan_bp
from routes.viewer import bp as viewer_bp from routes.viewer import bp as viewer_bp
from routes.updates import bp as updates_bp
for _bp in [auth_bp, users_bp, scan_bp, sources_bp, profiles_bp, for _bp in [auth_bp, users_bp, scan_bp, sources_bp, profiles_bp,
email_bp, database_bp, export_bp, app_routes_bp, scheduler_bp, email_bp, database_bp, export_bp, app_routes_bp, scheduler_bp,
google_auth_bp, google_scan_bp, viewer_bp]: google_auth_bp, google_scan_bp, viewer_bp, updates_bp]:
app.register_blueprint(_bp) app.register_blueprint(_bp)
# ── Entry point ─────────────────────────────────────────────────────────────── # ── Entry point ───────────────────────────────────────────────────────────────
@ -1592,10 +1598,10 @@ Headless (scheduled) usage:
environment variables: M365_CLIENT_ID, M365_TENANT_ID, M365_CLIENT_SECRET environment variables: M365_CLIENT_ID, M365_TENANT_ID, M365_CLIENT_SECRET
or a settings JSON: --settings /path/to/settings.json or a settings JSON: --settings /path/to/settings.json
Scan options are loaded from ~/.gdpr_scanner_settings.json (saved automatically Scan options are loaded from ~/.gdprscanner/settings.json (saved automatically
after any interactive scan), or overridden in the --settings file. after any interactive scan), or overridden in the --settings file.
SMTP config is loaded from ~/.gdpr_scanner_smtp.json (saved in the UI) or from SMTP config is loaded from ~/.gdprscanner/smtp.json (saved in the UI) or from
an 'smtp' key in the --settings file. an 'smtp' key in the --settings file.
Example cron (weekly, Mondays at 06:00): Example cron (weekly, Mondays at 06:00):
@ -1630,7 +1636,7 @@ Example --settings file with SMTP:
parser.add_argument("--output", default=".", parser.add_argument("--output", default=".",
help="Output directory for Excel export in headless mode (default: .)") help="Output directory for Excel export in headless mode (default: .)")
parser.add_argument("--settings", default=None, parser.add_argument("--settings", default=None,
help="Path to a JSON settings file (overrides ~/.gdpr_scanner_settings.json)") help="Path to a JSON settings file (overrides ~/.gdprscanner/settings.json)")
parser.add_argument("--email-to", default=None, parser.add_argument("--email-to", default=None,
help="Comma-separated recipient addresses — send Excel report by email (headless only)") help="Comma-separated recipient addresses — send Excel report by email (headless only)")
parser.add_argument("--retention-years", type=int, default=None, parser.add_argument("--retention-years", type=int, default=None,
@ -1638,7 +1644,7 @@ Example --settings file with SMTP:
parser.add_argument("--fiscal-year-end", default=None, parser.add_argument("--fiscal-year-end", default=None,
help="Fiscal year end as MM-DD for retention cutoff (e.g. 12-31 for Bogforingsloven). Omit for rolling window.") help="Fiscal year end as MM-DD for retention cutoff (e.g. 12-31 for Bogforingsloven). Omit for rolling window.")
parser.add_argument("--reset-db", action="store_true", parser.add_argument("--reset-db", action="store_true",
help="Reset the results database (~/.gdpr_scanner.db) — permanently deletes all scan history, " help="Reset the results database (~/.gdprscanner/scanner.db) — permanently deletes all scan history, "
"dispositions, and deletion log. Prompts for confirmation unless --yes is also passed.") "dispositions, and deletion log. Prompts for confirmation unless --yes is also passed.")
parser.add_argument("--yes", action="store_true", parser.add_argument("--yes", action="store_true",
help="Skip confirmation prompts (use with --reset-db for scripted resets)") help="Skip confirmation prompts (use with --reset-db for scripted resets)")
@ -1842,7 +1848,7 @@ Example --settings file with SMTP:
(_SETTINGS_PATH, "Headless scan settings"), (_SETTINGS_PATH, "Headless scan settings"),
(_ROLE_OVERRIDES_PATH, "Manual role overrides"), (_ROLE_OVERRIDES_PATH, "Manual role overrides"),
(_FILE_SOURCES_PATH, "File source definitions"), (_FILE_SOURCES_PATH, "File source definitions"),
(_CHECKPOINT_PATH, "Scan checkpoint (resume state)"), (_cp_path("m365"), "Scan checkpoint (resume state)"),
(_DELTA_PATH, "Delta scan tokens"), (_DELTA_PATH, "Delta scan tokens"),
(_LANG_OVERRIDE_FILE, "Language preference"), (_LANG_OVERRIDE_FILE, "Language preference"),
(Path.home() / ".gdprscanner" / "schedule.json", "Scheduler configuration"), (Path.home() / ".gdprscanner" / "schedule.json", "Scheduler configuration"),
@ -1929,10 +1935,12 @@ Example --settings file with SMTP:
print(" ✖ m365_db not available — cannot reset") print(" ✖ m365_db not available — cannot reset")
_sys.exit(1) _sys.exit(1)
# Also clear the JSON checkpoint so the UI starts with no cached results # Also clear all checkpoints so the UI starts with no cached results
_clear_checkpoint() from pathlib import Path as _Path
if not _CHECKPOINT_PATH.exists(): for _cpf in (_Path.home() / ".gdprscanner").glob("checkpoint_*.json"):
print(f" ✔ Checkpoint cleared") try: _cpf.unlink()
except Exception: pass
print(f" ✔ Checkpoints cleared")
# Clear delta tokens too — stale after a full DB reset # Clear delta tokens too — stale after a full DB reset
if _DELTA_PATH.exists(): if _DELTA_PATH.exists():
@ -2141,7 +2149,7 @@ Example --settings file with SMTP:
email_to = getattr(args, "email_to", None) email_to = getattr(args, "email_to", None)
if email_to: if email_to:
recipients = [r.strip() for r in email_to.replace(";", ",").split(",") if r.strip()] recipients = [r.strip() for r in email_to.replace(";", ",").split(",") if r.strip()]
# SMTP config: --settings file takes priority, then saved ~/.gdpr_scanner_smtp.json # SMTP config: --settings file takes priority, then saved ~/.gdprscanner/smtp.json
smtp_cfg = _load_smtp_config() smtp_cfg = _load_smtp_config()
if cfg.get("smtp"): if cfg.get("smtp"):
smtp_cfg = {**smtp_cfg, **cfg["smtp"]} smtp_cfg = {**smtp_cfg, **cfg["smtp"]}
@ -2258,14 +2266,33 @@ Example --settings file with SMTP:
# Find a free port — auto-increment from the requested port if in use. # Find a free port — auto-increment from the requested port if in use.
import socket as _socket import socket as _socket
def _find_free_port(start: int, host: str) -> int:
for p in range(start, start + 100): def _can_bind(p: int, host: str) -> bool:
with _socket.socket(_socket.AF_INET, _socket.SOCK_STREAM) as s: with _socket.socket(_socket.AF_INET, _socket.SOCK_STREAM) as s:
# Probe with SO_REUSEADDR, matching how Werkzeug binds.
# Without it, connections left in TIME_WAIT by a previous
# instance (e.g. the in-app update restart) make the port
# look occupied and the app silently moves to the next one.
s.setsockopt(_socket.SOL_SOCKET, _socket.SO_REUSEADDR, 1)
try: try:
s.bind((host, p)) s.bind((host, p))
return p return True
except OSError: except OSError:
continue return False
def _find_free_port(start: int, host: str) -> int:
# Give the requested port a grace period — after a self-restart
# the previous process may not have released it yet.
deadline = time.time() + 10
while True:
if _can_bind(start, host):
return start
if time.time() >= deadline:
break
time.sleep(0.5)
for p in range(start + 1, start + 100):
if _can_bind(p, host):
return p
raise RuntimeError(f"No free port found in range {start}{start + 99}") raise RuntimeError(f"No free port found in range {start}{start + 99}")
actual_port = _find_free_port(args.port, args.host) actual_port = _find_free_port(args.port, args.host)
@ -2278,6 +2305,19 @@ Example --settings file with SMTP:
print(f"\n GDPRScanner\n ──────────────────────────────") print(f"\n GDPRScanner\n ──────────────────────────────")
print(f" Open: http://{args.host}:{args.port}") print(f" Open: http://{args.host}:{args.port}")
# Recover scans left unfinished by a crash / kill / mid-scan restart.
# Nothing is scanning at startup, so any scan with finished_at IS NULL is
# dead; finalising it makes its already-saved items visible again instead
# of stranding them (both get_session_items and get_open_items require a
# finished scan). Must run before the scheduler can start a new scan.
try:
if DB_OK:
_recovered = _get_db().finalize_orphan_scans()
if _recovered:
print(f" Recovered {_recovered} unfinished scan(s) from a prior restart")
except Exception as _orphan_err:
print(f" Orphan-scan recovery: failed ({_orphan_err})")
# Start in-process scheduler (#19) # Start in-process scheduler (#19)
try: try:
import scan_scheduler as _sched_mod import scan_scheduler as _sched_mod
@ -2294,5 +2334,14 @@ Example --settings file with SMTP:
except Exception as _sched_err: except Exception as _sched_err:
print(f" Scheduler: failed to start ({_sched_err})") print(f" Scheduler: failed to start ({_sched_err})")
# Auto-update background thread (Settings → General → Software update)
try:
from routes.updates import start_auto_update_thread
from app_config import get_update_config as _get_upd_cfg
if start_auto_update_thread() and _get_upd_cfg().get("auto_update"):
print(" Auto-update: enabled (checked daily)")
except Exception as _upd_err:
print(f" Auto-update: failed to start ({_upd_err})")
print(f" Press Ctrl+C to stop\n") print(f" Press Ctrl+C to stop\n")
app.run(host=args.host, port=args.port, debug=False, threaded=True) app.run(host=args.host, port=args.port, debug=False, threaded=True)

View File

@ -70,6 +70,9 @@ GMAIL_SCOPES = [
DRIVE_SCOPES = [ DRIVE_SCOPES = [
"https://www.googleapis.com/auth/drive.readonly", "https://www.googleapis.com/auth/drive.readonly",
] ]
DRIVE_WRITE_SCOPES = [
"https://www.googleapis.com/auth/drive",
]
ADMIN_SCOPES = [ ADMIN_SCOPES = [
"https://www.googleapis.com/auth/admin.directory.user.readonly", "https://www.googleapis.com/auth/admin.directory.user.readonly",
] ]
@ -284,6 +287,26 @@ class GoogleConnector:
raise GoogleError(f"Drive auth failed for {user_email}: {e}") from e raise GoogleError(f"Drive auth failed for {user_email}: {e}") from e
return _drive_changes_collect(service, user_email, page_token, max_files, max_file_mb) return _drive_changes_collect(service, user_email, page_token, max_files, max_file_mb)
# ── Drive write-back (redaction) ──────────────────────────────────────────
def get_drive_file_mime(self, user_email: str, file_id: str) -> str:
"""Return the mimeType of a Drive file."""
creds = self._creds_for(user_email, DRIVE_WRITE_SCOPES)
service = build("drive", "v3", credentials=creds, cache_discovery=False)
return _get_drive_file_mime(service, file_id)
def download_drive_file_by_id(self, user_email: str, file_id: str) -> bytes:
"""Download raw bytes of a non-Google-native Drive file by ID."""
creds = self._creds_for(user_email, DRIVE_WRITE_SCOPES)
service = build("drive", "v3", credentials=creds, cache_discovery=False)
return _download_drive_file_by_id(service, file_id)
def update_drive_file(self, user_email: str, file_id: str, content: bytes, mime_type: str) -> None:
"""Replace Drive file content in-place. Requires drive (not drive.readonly) scope."""
creds = self._creds_for(user_email, DRIVE_WRITE_SCOPES)
service = build("drive", "v3", credentials=creds, cache_discovery=False)
_update_drive_file_content(service, file_id, content, mime_type)
# ── Persistence helpers ─────────────────────────────────────────────────────── # ── Persistence helpers ───────────────────────────────────────────────────────
@ -507,6 +530,30 @@ def _download_drive_file(
return None return None
def _get_drive_file_mime(service, file_id: str) -> str:
"""Return the mimeType of a Drive file."""
info = service.files().get(fileId=file_id, fields="mimeType").execute()
return info.get("mimeType", "")
def _download_drive_file_by_id(service, file_id: str) -> bytes:
"""Download raw bytes of a non-Google-native Drive file by ID."""
req = service.files().get_media(fileId=file_id)
buf = io.BytesIO()
dl = MediaIoBaseDownload(buf, req, chunksize=4 * 1024 * 1024)
done = False
while not done:
_, done = dl.next_chunk()
return buf.getvalue()
def _update_drive_file_content(service, file_id: str, content: bytes, mime_type: str) -> None:
"""Replace a Drive file's content in-place."""
from googleapiclient.http import MediaInMemoryUpload
media = MediaInMemoryUpload(content, mimetype=mime_type, resumable=False)
service.files().update(fileId=file_id, media_body=media).execute()
def _drive_iter( def _drive_iter(
service, service,
user_email: str, user_email: str,
@ -743,6 +790,26 @@ class PersonalGoogleConnector:
raise GoogleError(f"Drive auth failed: {e}") from e raise GoogleError(f"Drive auth failed: {e}") from e
return _drive_changes_collect(service, user_email, page_token, max_files, max_file_mb) return _drive_changes_collect(service, user_email, page_token, max_files, max_file_mb)
# ── Drive write-back (redaction) ──────────────────────────────────────────
def get_drive_file_mime(self, user_email: str, file_id: str) -> str:
"""Return the mimeType of a Drive file."""
self._refresh_if_needed()
service = build("drive", "v3", credentials=self._creds, cache_discovery=False)
return _get_drive_file_mime(service, file_id)
def download_drive_file_by_id(self, user_email: str, file_id: str) -> bytes:
"""Download raw bytes of a non-Google-native Drive file by ID."""
self._refresh_if_needed()
service = build("drive", "v3", credentials=self._creds, cache_discovery=False)
return _download_drive_file_by_id(service, file_id)
def update_drive_file(self, user_email: str, file_id: str, content: bytes, mime_type: str) -> None:
"""Replace Drive file content in-place. Requires drive (not drive.readonly) scope."""
self._refresh_if_needed()
service = build("drive", "v3", credentials=self._creds, cache_discovery=False)
_update_drive_file_content(service, file_id, content, mime_type)
@staticmethod @staticmethod
def get_device_code_flow(client_id: str, client_secret: str) -> dict: def get_device_code_flow(client_id: str, client_secret: str) -> dict:
""" """

View File

@ -106,7 +106,7 @@
"history_lbl": "Historik", "history_lbl": "Historik",
"history_items": "fund", "history_items": "fund",
"history_btn_sessions": "Sessioner", "history_btn_sessions": "Sessioner",
"history_btn_latest": "Seneste scanning", "history_btn_latest": "Åbne fund",
"history_picker_empty": "Ingen tidligere scanninger", "history_picker_empty": "Ingen tidligere scanninger",
"history_delta_badge": "Delta", "history_delta_badge": "Delta",
"history_latest_badge": "Seneste", "history_latest_badge": "Seneste",
@ -348,8 +348,9 @@
"m365_resuming": "Genoptager — springer allerede skannede elementer over…", "m365_resuming": "Genoptager — springer allerede skannede elementer over…",
"m365_opt_delta": "Delta-scanning", "m365_opt_delta": "Delta-scanning",
"m365_opt_delta_hint": "Kun ændrede elementer (efter første fulde scanning)", "m365_opt_delta_hint": "Kun ændrede elementer (efter første fulde scanning)",
"m365_delta_tokens_saved": "Tokens gemt", "m365_delta_tokens_saved": "Tokens gemt for {n} kilde(r)",
"m365_delta_clear": "Ryd tokens", "m365_delta_clear": "Ryd tokens",
"m365_delta_tokens_hint": "Gemte ændringstokens gør, at delta-scanninger kun henter elementer ændret siden sidste scanning. Ryd tokens tvinger næste scanning til at være en fuld scanning.",
"m365_delta_cleared": "Delta-tokens ryddet — næste scanning bliver fuld scanning.", "m365_delta_cleared": "Delta-tokens ryddet — næste scanning bliver fuld scanning.",
"m365_delta_mode": "Delta-tilstand — henter kun ændrede elementer…", "m365_delta_mode": "Delta-tilstand — henter kun ændrede elementer…",
"m365_smtp_title": "✉ Send rapport", "m365_smtp_title": "✉ Send rapport",
@ -365,6 +366,7 @@
"m365_smtp_recipients_hint": "Adskil med komma eller semikolon", "m365_smtp_recipients_hint": "Adskil med komma eller semikolon",
"m365_smtp_save": "Gem", "m365_smtp_save": "Gem",
"m365_smtp_auto_email_manual": "Send rapport efter manuel scanning", "m365_smtp_auto_email_manual": "Send rapport efter manuel scanning",
"m365_smtp_prefer_smtp": "Send altid via SMTP (spring Microsoft Graph over)",
"m365_smtp_send": "Send nu", "m365_smtp_send": "Send nu",
"m365_smtp_saved": "Indstillinger gemt.", "m365_smtp_saved": "Indstillinger gemt.",
"m365_smtp_sending": "Sender…", "m365_smtp_sending": "Sender…",
@ -559,8 +561,8 @@
"m365_db_import_mode": "Tilstand:", "m365_db_import_mode": "Tilstand:",
"m365_db_import_merge": "Sammenflet (sikker)", "m365_db_import_merge": "Sammenflet (sikker)",
"m365_db_import_replace": "Erstat (fuld gendannelse)", "m365_db_import_replace": "Erstat (fuld gendannelse)",
"m365_db_import_replace_warn": "⚠ Erstatningstilstand sletter alle eksisterende scanningsdata inden gendannelse. Sørg for at have en sikkerhedskopi af ~/.gdpr_scanner.db først.", "m365_db_import_replace_warn": "⚠ Erstatningstilstand sletter alle eksisterende scanningsdata inden gendannelse. Sørg for at have en sikkerhedskopi af ~/.gdprscanner/scanner.db først.",
"m365_db_import_replace_confirm": "Erstatningstilstand sletter ALLE eksisterende scanningsdata og gendanner fra arkivet.\\n\\nSørg for at have en manuel sikkerhedskopi af ~/.gdpr_scanner.db.\\n\\nFortsæt?", "m365_db_import_replace_confirm": "Erstatningstilstand sletter ALLE eksisterende scanningsdata og gendanner fra arkivet.\\n\\nSørg for at have en manuel sikkerhedskopi af ~/.gdprscanner/scanner.db.\\n\\nFortsæt?",
"m365_db_import_no_file": "Vælg venligst en ZIP-fil først.", "m365_db_import_no_file": "Vælg venligst en ZIP-fil først.",
"m365_db_importing": "Importerer…", "m365_db_importing": "Importerer…",
"m365_db_imported": "Importeret", "m365_db_imported": "Importeret",
@ -570,7 +572,17 @@
"m365_opt_skip_gps": "Ignorer GPS i billeder", "m365_opt_skip_gps": "Ignorer GPS i billeder",
"m365_opt_skip_gps_hint": "Billeder med GPS-koordinater flagges ikke — nyttigt ved elevscanninger, hvor smartphones indlejrer placering i alle fotos.", "m365_opt_skip_gps_hint": "Billeder med GPS-koordinater flagges ikke — nyttigt ved elevscanninger, hvor smartphones indlejrer placering i alle fotos.",
"m365_opt_min_cpr": "Min. CPR-antal pr. fil", "m365_opt_min_cpr": "Min. CPR-antal pr. fil",
"m365_opt_scan_emails": "Søg efter e-mailadresser",
"m365_opt_scan_emails_hint": "Flagger filer med e-mailadresser. Slået fra som standard — e-mailadresser er meget almindelige og kan give mange resultater.",
"m365_opt_scan_phones": "Søg efter telefonnumre",
"m365_opt_scan_phones_hint": "Flagger filer med danske telefonnumre (8 cifre). Nyttigt til at finde kontaktlister og forældrekorrespondance.",
"m365_badge_emails": "e-mail",
"m365_badge_phones": "tlf.",
"m365_opt_min_cpr_hint": "Filer med færre distinkte CPR-numre end denne tærskel rapporteres ikke. Sæt til 2 for at undgå falske positive, når elever har egne CPR-numre i filer.", "m365_opt_min_cpr_hint": "Filer med færre distinkte CPR-numre end denne tærskel rapporteres ikke. Sæt til 2 for at undgå falske positive, når elever har egne CPR-numre i filer.",
"m365_opt_cpr_only": "Kun CPR-tilstand",
"m365_opt_cpr_only_hint": "Flagger kun filer med CPR-numre. Filer med kun e-mailadresser, telefonnumre, ansigter eller EXIF-metadata ignoreres.",
"m365_opt_ocr_lang": "OCR-sprog",
"m365_opt_ocr_lang_hint": "Tesseract-sprogpakke(r) der bruges ved scanning af scannede PDF'er og billeder. Sprogpakker skal være installeret på serveren (f.eks. tesseract-ocr-dan). Flere pakker: dan+eng.",
"m365_filter_photo_only": "📷 Billeder / biometrisk", "m365_filter_photo_only": "📷 Billeder / biometrisk",
"m365_filter_all_roles": "Alle roller", "m365_filter_all_roles": "Alle roller",
"m365_filter_staff": "Ansatte", "m365_filter_staff": "Ansatte",
@ -598,16 +610,47 @@
"m365_file_sources_empty": "Ingen filkilder konfigureret. Tilføj en lokal mappe eller netværksdeling nedenfor.", "m365_file_sources_empty": "Ingen filkilder konfigureret. Tilføj en lokal mappe eller netværksdeling nedenfor.",
"m365_file_sources_add": "Tilføj kilde", "m365_file_sources_add": "Tilføj kilde",
"m365_fsrc_label": "Betegnelse", "m365_fsrc_label": "Betegnelse",
"m365_fsrc_name": "Navn",
"m365_fsrc_sftp_auth": "Auth",
"m365_fsrc_path": "Sti", "m365_fsrc_path": "Sti",
"m365_fsrc_smb_detected": "SMB/CIFS-netværksdeling registreret", "m365_fsrc_smb_detected": "SMB/CIFS-netværksdeling registreret",
"m365_fsrc_smb_host": "SMB-vært", "m365_fsrc_smb_host": "SMB-vært",
"m365_fsrc_smb_user": "Brugernavn", "m365_fsrc_smb_user": "Brugernavn",
"m365_fsrc_smb_pw": "Adgangskode", "m365_fsrc_smb_pw": "Adgangskode",
"m365_fsrc_smb_pw_hint": "Adgangskoden gemmes i nøglekæden — aldrig i en fil.", "m365_fsrc_smb_pw_hint": "Adgangskoden gemmes i nøglekæden — aldrig i en fil.",
"m365_fsrc_pw_keychain_placeholder": "Gemt i OS-nøglering",
"m365_fsrc_add_btn": "Tilføj", "m365_fsrc_add_btn": "Tilføj",
"m365_fsrc_saved": "Kilde gemt", "m365_fsrc_saved": "Kilde gemt",
"m365_fsrc_saving": "Gemmer...", "m365_fsrc_saving": "Gemmer...",
"m365_fsrc_path_required": "Sti er påkrævet.", "m365_fsrc_path_required": "Sti er påkrævet.",
"m365_fsrc_type_local": "Lokal mappe",
"m365_fsrc_type_smb": "Netværksdrev (SMB)",
"m365_fsrc_type_sftp": "SFTP-server",
"m365_fsrc_sftp_host": "SFTP-host",
"m365_fsrc_sftp_port": "Port",
"m365_fsrc_sftp_user": "Brugernavn",
"m365_fsrc_sftp_remote_path": "Fjernsti",
"m365_fsrc_sftp_auth_password": "Adgangskode",
"m365_fsrc_sftp_auth_key": "SSH-nøgle",
"m365_fsrc_sftp_pw": "Adgangskode",
"m365_fsrc_sftp_pw_hint": "Adgangskoden gemmes i OS-nøgleringe — aldrig i en fil.",
"m365_fsrc_sftp_key_upload": "Privat nøglefil",
"m365_fsrc_sftp_key_btn": "Upload nøgle",
"m365_fsrc_sftp_key_uploaded": "Nøgle uploadet",
"m365_fsrc_sftp_passphrase": "Adgangssætning (hvis nøglen er krypteret)",
"m365_fsrc_sftp_passphrase_hint": "Adgangssætningen gemmes i OS-nøgleringe — aldrig i en fil.",
"m365_fsrc_sftp_not_installed": "paramiko er ikke installeret — kør: pip install paramiko",
"m365_fsrc_name_placeholder": "f.eks. Lærerfiler, NAS-arkiv",
"m365_fsrc_path_placeholder": "~/Dokumenter eller //nas/shares",
"m365_fsrc_smb_host_placeholder": "nas.skole.dk",
"m365_fsrc_smb_user_placeholder": "DOMÆNE\\brugernavn",
"m365_fsrc_smb_user_edit_placeholder": "DOMÆNE\\brugernavn eller brugernavn",
"m365_fsrc_sftp_host_placeholder": "sftp.skole.dk",
"m365_fsrc_sftp_user_placeholder": "backup_user",
"m365_fsrc_sftp_path_placeholder": "/var/data",
"m365_fsrc_sftp_passphrase_placeholder": "Lad stå tomt hvis nøglen ikke er krypteret",
"m365_fsrc_sftp_host_required": "SFTP-host er påkrævet.",
"m365_fsrc_sftp_user_required": "SFTP-brugernavn er påkrævet.",
"m365_fsrc_scan_btn": "Scan", "m365_fsrc_scan_btn": "Scan",
"m365_fsrc_scan_start": "Starter filscanning", "m365_fsrc_scan_start": "Starter filscanning",
"m365_src_group_files": "Filkilder", "m365_src_group_files": "Filkilder",
@ -634,6 +677,14 @@
"m365_settings_tab_general": "Generelt", "m365_settings_tab_general": "Generelt",
"m365_settings_tab_email": "E-mailrapport", "m365_settings_tab_email": "E-mailrapport",
"m365_settings_tab_database": "Database", "m365_settings_tab_database": "Database",
"m365_settings_tab_auditlog": "Revisionslog",
"m365_audit_title": "Compliance-revisionslog",
"m365_audit_col_time": "Tidspunkt",
"m365_audit_col_action": "Handling",
"m365_audit_col_detail": "Detalje",
"m365_audit_col_ip": "IP",
"m365_audit_loading": "Indlæser…",
"m365_audit_empty": "Ingen revisionsbegivenheder registreret endnu.",
"m365_settings_appearance": "Udseende", "m365_settings_appearance": "Udseende",
"m365_settings_language": "Sprog", "m365_settings_language": "Sprog",
"m365_settings_theme": "Tema", "m365_settings_theme": "Tema",
@ -704,6 +755,8 @@
"m365_sched_after_scan": "Efter scanning", "m365_sched_after_scan": "Efter scanning",
"m365_sched_auto_email": "Send rapport automatisk", "m365_sched_auto_email": "Send rapport automatisk",
"m365_sched_auto_retention": "Håndhæv opbevaringspolitik", "m365_sched_auto_retention": "Håndhæv opbevaringspolitik",
"m365_sched_report_only": "Kun rapport",
"m365_sched_report_only_hint": "Send de seneste scanningsresultater uden at køre en ny scanning. Kræver scanningsresultater i databasen.",
"m365_sched_status": "Status", "m365_sched_status": "Status",
"m365_sched_run_now": "▶ Kør nu", "m365_sched_run_now": "▶ Kør nu",
"m365_sched_add": "+ Tilføj planlagt scanning", "m365_sched_add": "+ Tilføj planlagt scanning",
@ -712,6 +765,9 @@
"m365_sched_editor_edit": "Rediger planlagt scanning", "m365_sched_editor_edit": "Rediger planlagt scanning",
"m365_sched_name_required": "Navn er påkrævet", "m365_sched_name_required": "Navn er påkrævet",
"m365_sched_no_runs": "Ingen planlagte kørsler endnu", "m365_sched_no_runs": "Ingen planlagte kørsler endnu",
"m365_sched_no_jobs": "Ingen planlagte scanninger endnu.",
"m365_sched_running": "Kører...",
"m365_sched_disabled": "Deaktiveret",
"m365_sched_freq_daily": "Dagligt", "m365_sched_freq_daily": "Dagligt",
"m365_sched_freq_weekly": "Ugentligt", "m365_sched_freq_weekly": "Ugentligt",
"m365_sched_freq_monthly": "Månedligt", "m365_sched_freq_monthly": "Månedligt",
@ -759,9 +815,7 @@
"role_staff": "Ansat", "role_staff": "Ansat",
"role_student": "Elev", "role_student": "Elev",
"role_other": "Anden", "role_other": "Anden",
"m365_settings_tab_security": "Sikkerhed", "m365_settings_tab_security": "Sikkerhed",
"share_modal_title": "Del resultater", "share_modal_title": "Del resultater",
"share_modal_desc": "Skrivebeskyttede links lader en DPO eller gennemganger se resultater og tilknytte dispositioner uden adgang til scanningskontroller eller legitimationsoplysninger.", "share_modal_desc": "Skrivebeskyttede links lader en DPO eller gennemganger se resultater og tilknytte dispositioner uden adgang til scanningskontroller eller legitimationsoplysninger.",
"share_new_link": "Nyt link", "share_new_link": "Nyt link",
@ -794,13 +848,14 @@
"share_scope_all": "Alle", "share_scope_all": "Alle",
"share_scope_type_role": "Rolle", "share_scope_type_role": "Rolle",
"share_scope_type_user": "Bruger", "share_scope_type_user": "Bruger",
"share_date_from": "Emner fra",
"share_date_to": "Emner til og med",
"share_scope_role_lbl": "Rolle", "share_scope_role_lbl": "Rolle",
"share_scope_user_lbl": "Brugerens e-mail", "share_scope_user_lbl": "Brugerens e-mail",
"share_scope_user_placeholder": "alice@skole.dk", "share_scope_user_placeholder": "alice@skole.dk",
"share_scope_user_invalid": "Angiv venligst en gyldig e-mailadresse for brugeromfanget.", "share_scope_user_invalid": "Angiv venligst en gyldig e-mailadresse for brugeromfanget.",
"share_scope_staff": "Ansatte", "share_scope_staff": "Ansatte",
"share_scope_student": "Elever", "share_scope_student": "Elever",
"viewer_pin_group_title": "Seerens PIN", "viewer_pin_group_title": "Seerens PIN",
"viewer_pin_desc": "En numerisk PIN (48 cifre), der lader alle åbne <code style=\"font-size:10px\">/view</code> i en browser for skrivebeskyttet adgang til resultater uden et token-link.", "viewer_pin_desc": "En numerisk PIN (48 cifre), der lader alle åbne <code style=\"font-size:10px\">/view</code> i en browser for skrivebeskyttet adgang til resultater uden et token-link.",
"viewer_pin_clear": "Ryd PIN", "viewer_pin_clear": "Ryd PIN",
@ -811,12 +866,11 @@
"viewer_pin_saved": "PIN gemt", "viewer_pin_saved": "PIN gemt",
"viewer_pin_clear_confirm": "Fjern seerens PIN? /view vil igen kræve et token-link.", "viewer_pin_clear_confirm": "Fjern seerens PIN? /view vil igen kræve et token-link.",
"viewer_pin_cleared": "PIN ryddet", "viewer_pin_cleared": "PIN ryddet",
"interface_pin_group_title": "Interface-PIN", "interface_pin_group_title": "Interface-PIN",
"interface_pin_desc": "En numerisk PIN-kode (4\u20138 cifre), der skal indtastes, inden man får adgang til selve scanneren. Seere, der tilgår <code style=\"font-size:10px\">/view</code>, er ikke berørt.", "interface_pin_desc": "En numerisk PIN-kode (48 cifre), der skal indtastes, inden man får adgang til selve scanneren. Seere, der tilgår <code style=\"font-size:10px\">/view</code>, er ikke berørt.",
"interface_pin_clear": "Ryd PIN", "interface_pin_clear": "Ryd PIN",
"interface_pin_is_set": "Interface-PIN er angivet", "interface_pin_is_set": "Interface-PIN er angivet",
"interface_pin_not_set_msg": "Ingen PIN angivet \u2014 grænsefladen er åben for alle på netværket", "interface_pin_not_set_msg": "Ingen PIN angivet grænsefladen er åben for alle på netværket",
"interface_pin_saved": "PIN gemt", "interface_pin_saved": "PIN gemt",
"interface_pin_clear_confirm": "Fjern interface-PIN? Scanneren vil herefter være tilgængelig for alle på netværket.", "interface_pin_clear_confirm": "Fjern interface-PIN? Scanneren vil herefter være tilgængelig for alle på netværket.",
"interface_pin_cleared": "PIN ryddet", "interface_pin_cleared": "PIN ryddet",
@ -824,5 +878,31 @@
"interface_pin_login_btn": "Fortsæt", "interface_pin_login_btn": "Fortsæt",
"interface_pin_err_incorrect": "Forkert PIN.", "interface_pin_err_incorrect": "Forkert PIN.",
"interface_pin_err_too_many": "For mange forsøg. Prøv igen om lidt.", "interface_pin_err_too_many": "For mange forsøg. Prøv igen om lidt.",
"interface_pin_err_network": "Netværksfejl. Prøv igen." "interface_pin_err_network": "Netværksfejl. Prøv igen.",
"m365_settings_tab_ai": "AI / NER",
"m365_ai_title": "AI-forbedret navnegenkendelse",
"m365_ai_desc": "Brug Claude AI i stedet for spaCy til navn-, adresse- og organisationsgenkendelse. Betydeligt mere nøjagtig på dansk tekst — særligt dobbeltefternavne og fremmedsprogede navne. Kræver en Anthropic API-nøgle; faktureres pr. token.",
"m365_ai_enable": "Aktiver Claude NER",
"m365_ai_api_key_label": "Anthropic API-nøgle",
"m365_ai_show_key": "Vis",
"m365_ai_hide_key": "Skjul",
"m365_ai_key_set": "API-nøgle gemt",
"m365_ai_key_not_set": "Ingen API-nøgle gemt",
"m365_ai_test": "Test nøgle",
"m365_ai_testing": "Tester…",
"m365_ai_test_ok": "API-nøgle er gyldig",
"m365_ai_test_fail": "Test mislykkedes",
"m365_ai_saved": "Gemt",
"m365_ai_model_note": "Model: claude-haiku-4-5 · faktureres efter Anthropics token-priser · resultater caches pr. dokument.",
"m365_settings_updates": "Softwareopdatering",
"m365_update_idle": "Tjek om der findes en nyere version.",
"m365_update_auto": "Installér opdateringer automatisk (tjekkes dagligt — programmet genstarter selv)",
"m365_update_check": "Søg efter opdateringer",
"m365_update_install": "Installér opdatering",
"m365_update_checking": "Tjekker…",
"m365_update_uptodate": "Du kører den nyeste version.",
"m365_update_available": "Opdatering tilgængelig",
"m365_update_installing": "Installerer opdatering — programmet genstarter…",
"m365_update_failed": "Opdateringstjek mislykkedes",
"m365_update_scan_running": "Kan ikke opdatere, mens en scanning kører."
} }

View File

@ -167,8 +167,8 @@
"history_lbl": "Verlauf", "history_lbl": "Verlauf",
"history_items": "Treffer", "history_items": "Treffer",
"history_btn_sessions": "Sessionen", "history_btn_sessions": "Sessionen",
"history_btn_latest": "Letzter Scan", "history_btn_latest": "Offene Einträge",
"history_picker_empty": "Keine fr\u00fcheren Scans", "history_picker_empty": "Keine früheren Scans",
"history_delta_badge": "Delta", "history_delta_badge": "Delta",
"history_latest_badge": "Aktuell", "history_latest_badge": "Aktuell",
"lbl_blurred": "Unscharf gemacht", "lbl_blurred": "Unscharf gemacht",
@ -348,8 +348,9 @@
"m365_resuming": "Fortsetzen — bereits gescannte Elemente werden übersprungen…", "m365_resuming": "Fortsetzen — bereits gescannte Elemente werden übersprungen…",
"m365_opt_delta": "Delta-Scan", "m365_opt_delta": "Delta-Scan",
"m365_opt_delta_hint": "Nur geänderte Elemente (nach erstem Vollscan)", "m365_opt_delta_hint": "Nur geänderte Elemente (nach erstem Vollscan)",
"m365_delta_tokens_saved": "Tokens gespeichert", "m365_delta_tokens_saved": "Tokens für {n} Quelle(n) gespeichert",
"m365_delta_clear": "Tokens löschen", "m365_delta_clear": "Tokens löschen",
"m365_delta_tokens_hint": "Gespeicherte Änderungstokens lassen Delta-Scans nur Elemente abrufen, die seit dem letzten Scan geändert wurden. Tokens löschen erzwingt beim nächsten Scan einen Vollscan.",
"m365_delta_cleared": "Delta-Tokens gelöscht — nächster Scan wird ein Vollscan.", "m365_delta_cleared": "Delta-Tokens gelöscht — nächster Scan wird ein Vollscan.",
"m365_delta_mode": "Delta-Modus — nur geänderte Elemente werden abgerufen…", "m365_delta_mode": "Delta-Modus — nur geänderte Elemente werden abgerufen…",
"m365_smtp_title": "✉ Bericht senden", "m365_smtp_title": "✉ Bericht senden",
@ -365,6 +366,7 @@
"m365_smtp_recipients_hint": "Komma- oder semikolongetrennt", "m365_smtp_recipients_hint": "Komma- oder semikolongetrennt",
"m365_smtp_save": "Speichern", "m365_smtp_save": "Speichern",
"m365_smtp_auto_email_manual": "Bericht nach manueller Suche senden", "m365_smtp_auto_email_manual": "Bericht nach manueller Suche senden",
"m365_smtp_prefer_smtp": "Immer via SMTP senden (Microsoft Graph überspringen)",
"m365_smtp_send": "Jetzt senden", "m365_smtp_send": "Jetzt senden",
"m365_smtp_saved": "Einstellungen gespeichert.", "m365_smtp_saved": "Einstellungen gespeichert.",
"m365_smtp_sending": "Senden…", "m365_smtp_sending": "Senden…",
@ -559,8 +561,8 @@
"m365_db_import_mode": "Modus:", "m365_db_import_mode": "Modus:",
"m365_db_import_merge": "Zusammenführen (sicher)", "m365_db_import_merge": "Zusammenführen (sicher)",
"m365_db_import_replace": "Ersetzen (vollständige Wiederherstellung)", "m365_db_import_replace": "Ersetzen (vollständige Wiederherstellung)",
"m365_db_import_replace_warn": "⚠ Der Ersetzungsmodus löscht alle vorhandenen Scandaten vor der Wiederherstellung. Stellen Sie sicher, dass Sie zuerst eine Sicherungskopie von ~/.gdpr_scanner.db haben.", "m365_db_import_replace_warn": "⚠ Der Ersetzungsmodus löscht alle vorhandenen Scandaten vor der Wiederherstellung. Stellen Sie sicher, dass Sie zuerst eine Sicherungskopie von ~/.gdprscanner/scanner.db haben.",
"m365_db_import_replace_confirm": "Der Ersetzungsmodus löscht ALLE vorhandenen Scandaten und stellt aus dem Archiv wieder her.\\n\\nStellen Sie sicher, dass Sie eine manuelle Sicherungskopie von ~/.gdpr_scanner.db haben.\\n\\nFortfahren?", "m365_db_import_replace_confirm": "Der Ersetzungsmodus löscht ALLE vorhandenen Scandaten und stellt aus dem Archiv wieder her.\\n\\nStellen Sie sicher, dass Sie eine manuelle Sicherungskopie von ~/.gdprscanner/scanner.db haben.\\n\\nFortfahren?",
"m365_db_import_no_file": "Bitte wählen Sie zuerst eine ZIP-Datei aus.", "m365_db_import_no_file": "Bitte wählen Sie zuerst eine ZIP-Datei aus.",
"m365_db_importing": "Importiere…", "m365_db_importing": "Importiere…",
"m365_db_imported": "Importiert", "m365_db_imported": "Importiert",
@ -570,7 +572,17 @@
"m365_opt_skip_gps": "GPS in Bildern ignorieren", "m365_opt_skip_gps": "GPS in Bildern ignorieren",
"m365_opt_skip_gps_hint": "Bilder mit GPS-Koordinaten werden nicht markiert — nützlich beim Scannen von Schüler-Konten, deren Smartphones Standort in jedes Foto einbetten.", "m365_opt_skip_gps_hint": "Bilder mit GPS-Koordinaten werden nicht markiert — nützlich beim Scannen von Schüler-Konten, deren Smartphones Standort in jedes Foto einbetten.",
"m365_opt_min_cpr": "Min. CPR-Anzahl pro Datei", "m365_opt_min_cpr": "Min. CPR-Anzahl pro Datei",
"m365_opt_scan_emails": "E-Mail-Adressen scannen",
"m365_opt_scan_emails_hint": "Markiert Dateien mit E-Mail-Adressen. Standardmäßig deaktiviert — E-Mail-Adressen sind sehr häufig und können viele Treffer erzeugen.",
"m365_opt_scan_phones": "Telefonnummern scannen",
"m365_opt_scan_phones_hint": "Markiert Dateien mit dänischen Telefonnummern (8 Ziffern). Nützlich zum Auffinden von Kontaktlisten.",
"m365_badge_emails": "E-Mail",
"m365_badge_phones": "Tel.",
"m365_opt_min_cpr_hint": "Dateien mit weniger eindeutigen CPR-Nummern als dieser Schwellenwert werden nicht gemeldet. Auf 2 setzen, um Falsch-Positive zu vermeiden, wenn Schüler eigene CPR-Nummern in Dateien haben.", "m365_opt_min_cpr_hint": "Dateien mit weniger eindeutigen CPR-Nummern als dieser Schwellenwert werden nicht gemeldet. Auf 2 setzen, um Falsch-Positive zu vermeiden, wenn Schüler eigene CPR-Nummern in Dateien haben.",
"m365_opt_cpr_only": "Nur-CPR-Modus",
"m365_opt_cpr_only_hint": "Markiert nur Dateien mit CPR-Nummern. Dateien mit nur E-Mail-Adressen, Telefonnummern, Gesichtern oder EXIF-Metadaten werden ignoriert.",
"m365_opt_ocr_lang": "OCR-Sprache",
"m365_opt_ocr_lang_hint": "Tesseract-Sprachpaket(e) für das Scannen von gescannten PDFs und Bildern. Pakete müssen auf dem Server installiert sein (z.B. tesseract-ocr-dan). Mehrere Pakete: dan+eng.",
"m365_filter_photo_only": "📷 Fotos / biometrisch", "m365_filter_photo_only": "📷 Fotos / biometrisch",
"m365_filter_all_roles": "Alle Rollen", "m365_filter_all_roles": "Alle Rollen",
"m365_filter_staff": "Personal", "m365_filter_staff": "Personal",
@ -598,16 +610,47 @@
"m365_file_sources_empty": "Keine Dateiquellen konfiguriert. Fügen Sie unten einen lokalen Ordner oder eine Netzwerkfreigabe hinzu.", "m365_file_sources_empty": "Keine Dateiquellen konfiguriert. Fügen Sie unten einen lokalen Ordner oder eine Netzwerkfreigabe hinzu.",
"m365_file_sources_add": "Quelle hinzufügen", "m365_file_sources_add": "Quelle hinzufügen",
"m365_fsrc_label": "Bezeichnung", "m365_fsrc_label": "Bezeichnung",
"m365_fsrc_name": "Name",
"m365_fsrc_sftp_auth": "Auth",
"m365_fsrc_path": "Pfad", "m365_fsrc_path": "Pfad",
"m365_fsrc_smb_detected": "SMB/CIFS-Netzwerkfreigabe erkannt", "m365_fsrc_smb_detected": "SMB/CIFS-Netzwerkfreigabe erkannt",
"m365_fsrc_smb_host": "SMB-Host", "m365_fsrc_smb_host": "SMB-Host",
"m365_fsrc_smb_user": "Benutzername", "m365_fsrc_smb_user": "Benutzername",
"m365_fsrc_smb_pw": "Passwort", "m365_fsrc_smb_pw": "Passwort",
"m365_fsrc_smb_pw_hint": "Das Passwort wird im OS-Schlüsselbund gespeichert — nie in einer Datei.", "m365_fsrc_smb_pw_hint": "Das Passwort wird im OS-Schlüsselbund gespeichert — nie in einer Datei.",
"m365_fsrc_pw_keychain_placeholder": "Im OS-Schlüsselbund gespeichert",
"m365_fsrc_add_btn": "Hinzufügen", "m365_fsrc_add_btn": "Hinzufügen",
"m365_fsrc_saved": "Quelle gespeichert", "m365_fsrc_saved": "Quelle gespeichert",
"m365_fsrc_saving": "Speichern...", "m365_fsrc_saving": "Speichern...",
"m365_fsrc_path_required": "Pfad ist erforderlich.", "m365_fsrc_path_required": "Pfad ist erforderlich.",
"m365_fsrc_type_local": "Lokaler Ordner",
"m365_fsrc_type_smb": "Netzwerkfreigabe (SMB)",
"m365_fsrc_type_sftp": "SFTP-Server",
"m365_fsrc_sftp_host": "SFTP-Host",
"m365_fsrc_sftp_port": "Port",
"m365_fsrc_sftp_user": "Benutzername",
"m365_fsrc_sftp_remote_path": "Remote-Pfad",
"m365_fsrc_sftp_auth_password": "Passwort",
"m365_fsrc_sftp_auth_key": "SSH-Schlüssel",
"m365_fsrc_sftp_pw": "Passwort",
"m365_fsrc_sftp_pw_hint": "Passwort wird im OS-Schlüsselbund gespeichert — nie in einer Datei.",
"m365_fsrc_sftp_key_upload": "Private Schlüsseldatei",
"m365_fsrc_sftp_key_btn": "Schlüssel hochladen",
"m365_fsrc_sftp_key_uploaded": "Schlüssel hochgeladen",
"m365_fsrc_sftp_passphrase": "Passphrase (wenn Schlüssel verschlüsselt ist)",
"m365_fsrc_sftp_passphrase_hint": "Passphrase wird im OS-Schlüsselbund gespeichert — nie in einer Datei.",
"m365_fsrc_sftp_not_installed": "paramiko nicht installiert — ausführen: pip install paramiko",
"m365_fsrc_name_placeholder": "z.B. Lehrerdateien, NAS-Archiv",
"m365_fsrc_path_placeholder": "~/Dokumente oder //nas/freigaben",
"m365_fsrc_smb_host_placeholder": "nas.schule.de",
"m365_fsrc_smb_user_placeholder": "DOMÄNE\\Benutzername",
"m365_fsrc_smb_user_edit_placeholder": "DOMÄNE\\Benutzername oder Benutzername",
"m365_fsrc_sftp_host_placeholder": "sftp.schule.de",
"m365_fsrc_sftp_user_placeholder": "backup_user",
"m365_fsrc_sftp_path_placeholder": "/var/data",
"m365_fsrc_sftp_passphrase_placeholder": "Leer lassen, wenn der Schlüssel nicht verschlüsselt ist",
"m365_fsrc_sftp_host_required": "SFTP-Host ist erforderlich.",
"m365_fsrc_sftp_user_required": "SFTP-Benutzername ist erforderlich.",
"m365_fsrc_scan_btn": "Scannen", "m365_fsrc_scan_btn": "Scannen",
"m365_fsrc_scan_start": "Datei-Scan wird gestartet", "m365_fsrc_scan_start": "Datei-Scan wird gestartet",
"m365_src_group_files": "Dateiquellen", "m365_src_group_files": "Dateiquellen",
@ -634,6 +677,14 @@
"m365_settings_tab_general": "Allgemein", "m365_settings_tab_general": "Allgemein",
"m365_settings_tab_email": "E-Mail-Bericht", "m365_settings_tab_email": "E-Mail-Bericht",
"m365_settings_tab_database": "Datenbank", "m365_settings_tab_database": "Datenbank",
"m365_settings_tab_auditlog": "Prüfprotokoll",
"m365_audit_title": "Compliance-Prüfprotokoll",
"m365_audit_col_time": "Zeitpunkt",
"m365_audit_col_action": "Aktion",
"m365_audit_col_detail": "Detail",
"m365_audit_col_ip": "IP",
"m365_audit_loading": "Wird geladen…",
"m365_audit_empty": "Noch keine Prüfereignisse aufgezeichnet.",
"m365_settings_appearance": "Erscheinungsbild", "m365_settings_appearance": "Erscheinungsbild",
"m365_settings_language": "Sprache", "m365_settings_language": "Sprache",
"m365_settings_theme": "Design", "m365_settings_theme": "Design",
@ -704,6 +755,8 @@
"m365_sched_after_scan": "Nach dem Scan", "m365_sched_after_scan": "Nach dem Scan",
"m365_sched_auto_email": "Bericht automatisch senden", "m365_sched_auto_email": "Bericht automatisch senden",
"m365_sched_auto_retention": "Aufbewahrungsrichtlinie durchsetzen", "m365_sched_auto_retention": "Aufbewahrungsrichtlinie durchsetzen",
"m365_sched_report_only": "Nur Bericht",
"m365_sched_report_only_hint": "Letzte Scanergebnisse senden, ohne einen neuen Scan durchzuführen. Erfordert Scanergebnisse in der Datenbank.",
"m365_sched_status": "Status", "m365_sched_status": "Status",
"m365_sched_run_now": "▶ Jetzt ausführen", "m365_sched_run_now": "▶ Jetzt ausführen",
"m365_sched_add": "+ Geplante Suche hinzufügen", "m365_sched_add": "+ Geplante Suche hinzufügen",
@ -712,6 +765,9 @@
"m365_sched_editor_edit": "Geplante Suche bearbeiten", "m365_sched_editor_edit": "Geplante Suche bearbeiten",
"m365_sched_name_required": "Name ist erforderlich", "m365_sched_name_required": "Name ist erforderlich",
"m365_sched_no_runs": "Noch keine geplanten Läufe", "m365_sched_no_runs": "Noch keine geplanten Läufe",
"m365_sched_no_jobs": "Noch keine geplanten Scans.",
"m365_sched_running": "Läuft...",
"m365_sched_disabled": "Deaktiviert",
"m365_sched_freq_daily": "Täglich", "m365_sched_freq_daily": "Täglich",
"m365_sched_freq_weekly": "Wöchentlich", "m365_sched_freq_weekly": "Wöchentlich",
"m365_sched_freq_monthly": "Monatlich", "m365_sched_freq_monthly": "Monatlich",
@ -759,9 +815,7 @@
"role_staff": "Personal", "role_staff": "Personal",
"role_student": "Schüler", "role_student": "Schüler",
"role_other": "Andere", "role_other": "Andere",
"m365_settings_tab_security": "Sicherheit", "m365_settings_tab_security": "Sicherheit",
"share_modal_title": "Ergebnisse teilen", "share_modal_title": "Ergebnisse teilen",
"share_modal_desc": "Schreibgeschützte Links ermöglichen einem Datenschutzbeauftragten oder Prüfer, Ergebnisse einzusehen und Verwendungszwecke zuzuweisen, ohne Zugriff auf Scansteuerung oder Anmeldedaten.", "share_modal_desc": "Schreibgeschützte Links ermöglichen einem Datenschutzbeauftragten oder Prüfer, Ergebnisse einzusehen und Verwendungszwecke zuzuweisen, ohne Zugriff auf Scansteuerung oder Anmeldedaten.",
"share_new_link": "Neuer Link", "share_new_link": "Neuer Link",
@ -794,15 +848,16 @@
"share_scope_all": "Alle", "share_scope_all": "Alle",
"share_scope_type_role": "Rolle", "share_scope_type_role": "Rolle",
"share_scope_type_user": "Benutzer", "share_scope_type_user": "Benutzer",
"share_date_from": "Elemente ab",
"share_date_to": "Elemente bis",
"share_scope_role_lbl": "Rolle", "share_scope_role_lbl": "Rolle",
"share_scope_user_lbl": "Benutzer-E-Mail", "share_scope_user_lbl": "Benutzer-E-Mail",
"share_scope_user_placeholder": "alice@schule.de", "share_scope_user_placeholder": "alice@schule.de",
"share_scope_user_invalid": "Bitte gib eine gültige E-Mail-Adresse für den Benutzerbereich an.", "share_scope_user_invalid": "Bitte gib eine gültige E-Mail-Adresse für den Benutzerbereich an.",
"share_scope_staff": "Mitarbeitende", "share_scope_staff": "Mitarbeitende",
"share_scope_student": "Schüler", "share_scope_student": "Schüler",
"viewer_pin_group_title": "Betrachter-PIN", "viewer_pin_group_title": "Betrachter-PIN",
"viewer_pin_desc": "Eine numerische PIN (48 Stellen), die es jedem ermöglicht, <code style=\"font-size:10px\">/view</code> im Browser zu öffnen und schreibgeschützt auf Ergebnisse zuzugreifen \u2013 ohne Token-Link.", "viewer_pin_desc": "Eine numerische PIN (48 Stellen), die es jedem ermöglicht, <code style=\"font-size:10px\">/view</code> im Browser zu öffnen und schreibgeschützt auf Ergebnisse zuzugreifen ohne Token-Link.",
"viewer_pin_clear": "PIN löschen", "viewer_pin_clear": "PIN löschen",
"viewer_pin_is_set": "Betrachter-PIN ist festgelegt", "viewer_pin_is_set": "Betrachter-PIN ist festgelegt",
"viewer_pin_not_set_msg": "Keine PIN festgelegt — /view erfordert einen Token-Link", "viewer_pin_not_set_msg": "Keine PIN festgelegt — /view erfordert einen Token-Link",
@ -811,12 +866,11 @@
"viewer_pin_saved": "PIN gespeichert", "viewer_pin_saved": "PIN gespeichert",
"viewer_pin_clear_confirm": "Betrachter-PIN entfernen? /view erfordert dann wieder einen Token-Link.", "viewer_pin_clear_confirm": "Betrachter-PIN entfernen? /view erfordert dann wieder einen Token-Link.",
"viewer_pin_cleared": "PIN gelöscht", "viewer_pin_cleared": "PIN gelöscht",
"interface_pin_group_title": "Interface-PIN", "interface_pin_group_title": "Interface-PIN",
"interface_pin_desc": "Eine numerische PIN (4\u20138 Stellen), die eingegeben werden muss, bevor auf die Scanner-Oberfläche zugegriffen werden kann. Betrachter, die <code style=\"font-size:10px\">/view</code> aufrufen, sind nicht betroffen.", "interface_pin_desc": "Eine numerische PIN (48 Stellen), die eingegeben werden muss, bevor auf die Scanner-Oberfläche zugegriffen werden kann. Betrachter, die <code style=\"font-size:10px\">/view</code> aufrufen, sind nicht betroffen.",
"interface_pin_clear": "PIN löschen", "interface_pin_clear": "PIN löschen",
"interface_pin_is_set": "Interface-PIN ist gesetzt", "interface_pin_is_set": "Interface-PIN ist gesetzt",
"interface_pin_not_set_msg": "Keine PIN gesetzt \u2014 Oberfläche ist für alle im Netzwerk offen", "interface_pin_not_set_msg": "Keine PIN gesetzt Oberfläche ist für alle im Netzwerk offen",
"interface_pin_saved": "PIN gespeichert", "interface_pin_saved": "PIN gespeichert",
"interface_pin_clear_confirm": "Interface-PIN entfernen? Der Scanner ist dann für alle im Netzwerk zugänglich.", "interface_pin_clear_confirm": "Interface-PIN entfernen? Der Scanner ist dann für alle im Netzwerk zugänglich.",
"interface_pin_cleared": "PIN gelöscht", "interface_pin_cleared": "PIN gelöscht",
@ -824,5 +878,31 @@
"interface_pin_login_btn": "Weiter", "interface_pin_login_btn": "Weiter",
"interface_pin_err_incorrect": "Falsche PIN.", "interface_pin_err_incorrect": "Falsche PIN.",
"interface_pin_err_too_many": "Zu viele Versuche. Bitte später erneut versuchen.", "interface_pin_err_too_many": "Zu viele Versuche. Bitte später erneut versuchen.",
"interface_pin_err_network": "Netzwerkfehler. Bitte erneut versuchen." "interface_pin_err_network": "Netzwerkfehler. Bitte erneut versuchen.",
"m365_settings_tab_ai": "KI / NER",
"m365_ai_title": "KI-gestützte Entitätserkennung",
"m365_ai_desc": "Claude KI statt spaCy für Name-, Adress- und Organisationserkennung verwenden. Deutlich genauer bei dänischen Texten — insbesondere bei Doppelnamen und fremdsprachigen Namen. Benötigt einen Anthropic-API-Schlüssel; Abrechnung per Token.",
"m365_ai_enable": "Claude NER aktivieren",
"m365_ai_api_key_label": "Anthropic-API-Schlüssel",
"m365_ai_show_key": "Anzeigen",
"m365_ai_hide_key": "Ausblenden",
"m365_ai_key_set": "API-Schlüssel gespeichert",
"m365_ai_key_not_set": "Kein API-Schlüssel gespeichert",
"m365_ai_test": "Schlüssel testen",
"m365_ai_testing": "Wird getestet…",
"m365_ai_test_ok": "API-Schlüssel gültig",
"m365_ai_test_fail": "Test fehlgeschlagen",
"m365_ai_saved": "Gespeichert",
"m365_ai_model_note": "Modell: claude-haiku-4-5 · Abrechnung nach Anthropic-Token-Tarifen · Ergebnisse werden pro Dokument gecacht.",
"m365_settings_updates": "Softwareaktualisierung",
"m365_update_idle": "Prüfen, ob eine neuere Version verfügbar ist.",
"m365_update_auto": "Updates automatisch installieren (tägliche Prüfung — die App startet sich selbst neu)",
"m365_update_check": "Nach Updates suchen",
"m365_update_install": "Update installieren",
"m365_update_checking": "Wird geprüft…",
"m365_update_uptodate": "Sie verwenden die neueste Version.",
"m365_update_available": "Update verfügbar",
"m365_update_installing": "Update wird installiert — die App startet neu…",
"m365_update_failed": "Updateprüfung fehlgeschlagen",
"m365_update_scan_running": "Update nicht möglich, während ein Scan läuft."
} }

View File

@ -106,7 +106,7 @@
"history_lbl": "History", "history_lbl": "History",
"history_items": "items", "history_items": "items",
"history_btn_sessions": "Sessions", "history_btn_sessions": "Sessions",
"history_btn_latest": "Latest scan", "history_btn_latest": "Open items",
"history_picker_empty": "No past scans", "history_picker_empty": "No past scans",
"history_delta_badge": "Delta", "history_delta_badge": "Delta",
"history_latest_badge": "Latest", "history_latest_badge": "Latest",
@ -348,8 +348,9 @@
"m365_resuming": "Resuming — skipping already-scanned items…", "m365_resuming": "Resuming — skipping already-scanned items…",
"m365_opt_delta": "Delta scan", "m365_opt_delta": "Delta scan",
"m365_opt_delta_hint": "Changed items only (after first full scan)", "m365_opt_delta_hint": "Changed items only (after first full scan)",
"m365_delta_tokens_saved": "Tokens saved", "m365_delta_tokens_saved": "Tokens saved for {n} source(s)",
"m365_delta_clear": "Clear tokens", "m365_delta_clear": "Clear tokens",
"m365_delta_tokens_hint": "Saved change-tokens let delta scans fetch only items modified since the last scan. Clear tokens forces the next scan to be a full scan.",
"m365_delta_cleared": "Delta tokens cleared — next scan will be a full scan.", "m365_delta_cleared": "Delta tokens cleared — next scan will be a full scan.",
"m365_delta_mode": "Delta mode — fetching changed items only…", "m365_delta_mode": "Delta mode — fetching changed items only…",
"m365_smtp_title": "✉ Email report", "m365_smtp_title": "✉ Email report",
@ -365,6 +366,7 @@
"m365_smtp_recipients_hint": "Comma or semicolon separated", "m365_smtp_recipients_hint": "Comma or semicolon separated",
"m365_smtp_save": "Save", "m365_smtp_save": "Save",
"m365_smtp_auto_email_manual": "Email report after manual scan", "m365_smtp_auto_email_manual": "Email report after manual scan",
"m365_smtp_prefer_smtp": "Always send via SMTP (skip Microsoft Graph)",
"m365_smtp_send": "Send now", "m365_smtp_send": "Send now",
"m365_smtp_saved": "Settings saved.", "m365_smtp_saved": "Settings saved.",
"m365_smtp_sending": "Sending…", "m365_smtp_sending": "Sending…",
@ -559,8 +561,8 @@
"m365_db_import_mode": "Mode:", "m365_db_import_mode": "Mode:",
"m365_db_import_merge": "Merge (safe)", "m365_db_import_merge": "Merge (safe)",
"m365_db_import_replace": "Replace (full restore)", "m365_db_import_replace": "Replace (full restore)",
"m365_db_import_replace_warn": "⚠ Replace mode will erase all existing scan data before restoring. Make sure you have a backup of ~/.gdpr_scanner.db first.", "m365_db_import_replace_warn": "⚠ Replace mode will erase all existing scan data before restoring. Make sure you have a backup of ~/.gdprscanner/scanner.db first.",
"m365_db_import_replace_confirm": "Replace mode will erase ALL existing scan data and restore from the archive.\\n\\nMake sure you have a manual backup of ~/.gdpr_scanner.db.\\n\\nProceed?", "m365_db_import_replace_confirm": "Replace mode will erase ALL existing scan data and restore from the archive.\\n\\nMake sure you have a manual backup of ~/.gdprscanner/scanner.db.\\n\\nProceed?",
"m365_db_import_no_file": "Please select a ZIP file first.", "m365_db_import_no_file": "Please select a ZIP file first.",
"m365_db_importing": "Importing…", "m365_db_importing": "Importing…",
"m365_db_imported": "Imported", "m365_db_imported": "Imported",
@ -570,7 +572,17 @@
"m365_opt_skip_gps": "Ignore GPS in images", "m365_opt_skip_gps": "Ignore GPS in images",
"m365_opt_skip_gps_hint": "Images with GPS coordinates are not flagged — useful when scanning students whose smartphones embed location in every photo.", "m365_opt_skip_gps_hint": "Images with GPS coordinates are not flagged — useful when scanning students whose smartphones embed location in every photo.",
"m365_opt_min_cpr": "Min. CPR count per file", "m365_opt_min_cpr": "Min. CPR count per file",
"m365_opt_scan_emails": "Scan for email addresses",
"m365_opt_scan_emails_hint": "Flags files that contain email addresses. Off by default — email addresses are very common and may produce many results.",
"m365_opt_scan_phones": "Scan for phone numbers",
"m365_opt_scan_phones_hint": "Flags files containing Danish phone numbers (8 digits). Useful for finding contact lists and parent correspondence.",
"m365_badge_emails": "email",
"m365_badge_phones": "phone",
"m365_opt_min_cpr_hint": "Files with fewer distinct CPR numbers than this threshold are not reported. Set to 2 to avoid false positives when students have their own CPR in documents.", "m365_opt_min_cpr_hint": "Files with fewer distinct CPR numbers than this threshold are not reported. Set to 2 to avoid false positives when students have their own CPR in documents.",
"m365_opt_cpr_only": "CPR-only mode",
"m365_opt_cpr_only_hint": "Only flag files that contain CPR numbers. Files with only email addresses, phone numbers, detected faces, or EXIF metadata are skipped.",
"m365_opt_ocr_lang": "OCR language",
"m365_opt_ocr_lang_hint": "Tesseract language pack(s) used when scanning scanned PDFs and images. Language packs must be installed on the server (e.g. tesseract-ocr-dan). Multiple packs: dan+eng.",
"m365_filter_photo_only": "📷 Photos / biometric", "m365_filter_photo_only": "📷 Photos / biometric",
"m365_filter_all_roles": "All roles", "m365_filter_all_roles": "All roles",
"m365_filter_staff": "Staff", "m365_filter_staff": "Staff",
@ -598,16 +610,47 @@
"m365_file_sources_empty": "No file sources configured. Add a local folder or network share below.", "m365_file_sources_empty": "No file sources configured. Add a local folder or network share below.",
"m365_file_sources_add": "Add source", "m365_file_sources_add": "Add source",
"m365_fsrc_label": "Label", "m365_fsrc_label": "Label",
"m365_fsrc_name": "Name",
"m365_fsrc_sftp_auth": "Auth",
"m365_fsrc_path": "Path", "m365_fsrc_path": "Path",
"m365_fsrc_smb_detected": "SMB/CIFS network share detected", "m365_fsrc_smb_detected": "SMB/CIFS network share detected",
"m365_fsrc_smb_host": "SMB host", "m365_fsrc_smb_host": "SMB host",
"m365_fsrc_smb_user": "Username", "m365_fsrc_smb_user": "Username",
"m365_fsrc_smb_pw": "Password", "m365_fsrc_smb_pw": "Password",
"m365_fsrc_smb_pw_hint": "Password is saved to the OS keychain — never stored in a file.", "m365_fsrc_smb_pw_hint": "Password is saved to the OS keychain — never stored in a file.",
"m365_fsrc_pw_keychain_placeholder": "Stored in OS keychain",
"m365_fsrc_add_btn": "Add", "m365_fsrc_add_btn": "Add",
"m365_fsrc_saved": "Source saved", "m365_fsrc_saved": "Source saved",
"m365_fsrc_saving": "Saving...", "m365_fsrc_saving": "Saving...",
"m365_fsrc_path_required": "Path is required.", "m365_fsrc_path_required": "Path is required.",
"m365_fsrc_type_local": "Local folder",
"m365_fsrc_type_smb": "Network share (SMB)",
"m365_fsrc_type_sftp": "SFTP server",
"m365_fsrc_sftp_host": "SFTP host",
"m365_fsrc_sftp_port": "Port",
"m365_fsrc_sftp_user": "Username",
"m365_fsrc_sftp_remote_path": "Remote path",
"m365_fsrc_sftp_auth_password": "Password",
"m365_fsrc_sftp_auth_key": "SSH key",
"m365_fsrc_sftp_pw": "Password",
"m365_fsrc_sftp_pw_hint": "Password is saved to the OS keychain — never stored in a file.",
"m365_fsrc_sftp_key_upload": "Private key file",
"m365_fsrc_sftp_key_btn": "Upload key",
"m365_fsrc_sftp_key_uploaded": "Key uploaded",
"m365_fsrc_sftp_passphrase": "Passphrase (if key is encrypted)",
"m365_fsrc_sftp_passphrase_hint": "Passphrase is saved to the OS keychain — never stored in a file.",
"m365_fsrc_sftp_not_installed": "paramiko not installed — run: pip install paramiko",
"m365_fsrc_name_placeholder": "e.g. Teacher files, NAS archive",
"m365_fsrc_path_placeholder": "~/Documents or //nas/shares",
"m365_fsrc_smb_host_placeholder": "nas.school.dk",
"m365_fsrc_smb_user_placeholder": "DOMAIN\\username",
"m365_fsrc_smb_user_edit_placeholder": "DOMAIN\\username or username",
"m365_fsrc_sftp_host_placeholder": "sftp.school.dk",
"m365_fsrc_sftp_user_placeholder": "backup_user",
"m365_fsrc_sftp_path_placeholder": "/var/data",
"m365_fsrc_sftp_passphrase_placeholder": "Leave blank if key has no passphrase",
"m365_fsrc_sftp_host_required": "SFTP host is required.",
"m365_fsrc_sftp_user_required": "SFTP username is required.",
"m365_fsrc_scan_btn": "Scan", "m365_fsrc_scan_btn": "Scan",
"m365_fsrc_scan_start": "Starting file scan", "m365_fsrc_scan_start": "Starting file scan",
"m365_src_group_files": "File sources", "m365_src_group_files": "File sources",
@ -634,6 +677,14 @@
"m365_settings_tab_general": "General", "m365_settings_tab_general": "General",
"m365_settings_tab_email": "Email report", "m365_settings_tab_email": "Email report",
"m365_settings_tab_database": "Database", "m365_settings_tab_database": "Database",
"m365_settings_tab_auditlog": "Audit Log",
"m365_audit_title": "Compliance Audit Log",
"m365_audit_col_time": "Time",
"m365_audit_col_action": "Action",
"m365_audit_col_detail": "Detail",
"m365_audit_col_ip": "IP",
"m365_audit_loading": "Loading…",
"m365_audit_empty": "No audit events recorded yet.",
"m365_settings_appearance": "Appearance", "m365_settings_appearance": "Appearance",
"m365_settings_language": "Language", "m365_settings_language": "Language",
"m365_settings_theme": "Theme", "m365_settings_theme": "Theme",
@ -704,6 +755,8 @@
"m365_sched_after_scan": "After scan", "m365_sched_after_scan": "After scan",
"m365_sched_auto_email": "Email report automatically", "m365_sched_auto_email": "Email report automatically",
"m365_sched_auto_retention": "Enforce retention policy", "m365_sched_auto_retention": "Enforce retention policy",
"m365_sched_report_only": "Report only",
"m365_sched_report_only_hint": "Email the latest scan results without running a new scan. Requires scan results in the database.",
"m365_sched_status": "Status", "m365_sched_status": "Status",
"m365_sched_run_now": "▶ Run now", "m365_sched_run_now": "▶ Run now",
"m365_sched_add": "+ Add scheduled scan", "m365_sched_add": "+ Add scheduled scan",
@ -712,6 +765,9 @@
"m365_sched_editor_edit": "Edit scheduled scan", "m365_sched_editor_edit": "Edit scheduled scan",
"m365_sched_name_required": "Name is required", "m365_sched_name_required": "Name is required",
"m365_sched_no_runs": "No scheduled runs yet", "m365_sched_no_runs": "No scheduled runs yet",
"m365_sched_no_jobs": "No scheduled scans yet.",
"m365_sched_running": "Running...",
"m365_sched_disabled": "Disabled",
"m365_sched_freq_daily": "Daily", "m365_sched_freq_daily": "Daily",
"m365_sched_freq_weekly": "Weekly", "m365_sched_freq_weekly": "Weekly",
"m365_sched_freq_monthly": "Monthly", "m365_sched_freq_monthly": "Monthly",
@ -759,9 +815,7 @@
"role_staff": "Staff", "role_staff": "Staff",
"role_student": "Student", "role_student": "Student",
"role_other": "Other", "role_other": "Other",
"m365_settings_tab_security": "Security", "m365_settings_tab_security": "Security",
"share_modal_title": "Share results", "share_modal_title": "Share results",
"share_modal_desc": "Read-only links let a DPO or reviewer browse results and tag dispositions without access to scan controls or credentials.", "share_modal_desc": "Read-only links let a DPO or reviewer browse results and tag dispositions without access to scan controls or credentials.",
"share_new_link": "New link", "share_new_link": "New link",
@ -794,29 +848,29 @@
"share_scope_all": "All", "share_scope_all": "All",
"share_scope_type_role": "Role", "share_scope_type_role": "Role",
"share_scope_type_user": "User", "share_scope_type_user": "User",
"share_date_from": "Items from",
"share_date_to": "Items until",
"share_scope_role_lbl": "Role", "share_scope_role_lbl": "Role",
"share_scope_user_lbl": "User email", "share_scope_user_lbl": "User email",
"share_scope_user_placeholder": "alice@school.dk", "share_scope_user_placeholder": "alice@school.dk",
"share_scope_user_invalid": "Please enter a valid email address for the user scope.", "share_scope_user_invalid": "Please enter a valid email address for the user scope.",
"share_scope_staff": "Staff", "share_scope_staff": "Staff",
"share_scope_student": "Students", "share_scope_student": "Students",
"viewer_pin_group_title": "Viewer PIN", "viewer_pin_group_title": "Viewer PIN",
"viewer_pin_desc": "A numeric PIN (4\u20138 digits) that lets anyone open <code style=\"font-size:10px\">/view</code> in a browser for read-only access to results without a token URL.", "viewer_pin_desc": "A numeric PIN (48 digits) that lets anyone open <code style=\"font-size:10px\">/view</code> in a browser for read-only access to results without a token URL.",
"viewer_pin_clear": "Clear PIN", "viewer_pin_clear": "Clear PIN",
"viewer_pin_is_set": "Viewer PIN is set", "viewer_pin_is_set": "Viewer PIN is set",
"viewer_pin_not_set_msg": "No PIN set \u2014 /view requires a token link", "viewer_pin_not_set_msg": "No PIN set /view requires a token link",
"viewer_pin_format": "PIN must be 4\u20138 digits.", "viewer_pin_format": "PIN must be 48 digits.",
"viewer_pin_saving": "Saving\u2026", "viewer_pin_saving": "Saving",
"viewer_pin_saved": "PIN saved", "viewer_pin_saved": "PIN saved",
"viewer_pin_clear_confirm": "Remove the viewer PIN? /view will require a token link again.", "viewer_pin_clear_confirm": "Remove the viewer PIN? /view will require a token link again.",
"viewer_pin_cleared": "PIN cleared", "viewer_pin_cleared": "PIN cleared",
"interface_pin_group_title": "Interface PIN", "interface_pin_group_title": "Interface PIN",
"interface_pin_desc": "A numeric PIN (4\u20138 digits) that must be entered before accessing the main scanner interface. Viewers accessing <code style=\"font-size:10px\">/view</code> are not affected.", "interface_pin_desc": "A numeric PIN (48 digits) that must be entered before accessing the main scanner interface. Viewers accessing <code style=\"font-size:10px\">/view</code> are not affected.",
"interface_pin_clear": "Clear PIN", "interface_pin_clear": "Clear PIN",
"interface_pin_is_set": "Interface PIN is set", "interface_pin_is_set": "Interface PIN is set",
"interface_pin_not_set_msg": "No PIN set \u2014 interface is open to anyone on the network", "interface_pin_not_set_msg": "No PIN set interface is open to anyone on the network",
"interface_pin_saved": "PIN saved", "interface_pin_saved": "PIN saved",
"interface_pin_clear_confirm": "Remove the interface PIN? The scanner will be accessible to anyone on the network.", "interface_pin_clear_confirm": "Remove the interface PIN? The scanner will be accessible to anyone on the network.",
"interface_pin_cleared": "PIN cleared", "interface_pin_cleared": "PIN cleared",
@ -824,5 +878,31 @@
"interface_pin_login_btn": "Continue", "interface_pin_login_btn": "Continue",
"interface_pin_err_incorrect": "Incorrect PIN.", "interface_pin_err_incorrect": "Incorrect PIN.",
"interface_pin_err_too_many": "Too many attempts. Try again later.", "interface_pin_err_too_many": "Too many attempts. Try again later.",
"interface_pin_err_network": "Network error. Please try again." "interface_pin_err_network": "Network error. Please try again.",
"m365_settings_tab_ai": "AI / NER",
"m365_ai_title": "AI-Enhanced Named Entity Recognition",
"m365_ai_desc": "Use Claude AI instead of spaCy for name, address, and organisation detection. Significantly more accurate on Danish text — especially hyphenated surnames and foreign-origin names. Requires an Anthropic API key; charged per token.",
"m365_ai_enable": "Enable Claude NER",
"m365_ai_api_key_label": "Anthropic API key",
"m365_ai_show_key": "Show",
"m365_ai_hide_key": "Hide",
"m365_ai_key_set": "API key saved",
"m365_ai_key_not_set": "No API key saved",
"m365_ai_test": "Test key",
"m365_ai_testing": "Testing…",
"m365_ai_test_ok": "API key valid",
"m365_ai_test_fail": "Test failed",
"m365_ai_saved": "Saved",
"m365_ai_model_note": "Model: claude-haiku-4-5 · billed at Anthropic token rates · results cached per document.",
"m365_settings_updates": "Software update",
"m365_update_idle": "Check whether a newer version is available.",
"m365_update_auto": "Install updates automatically (checked daily — the app restarts itself)",
"m365_update_check": "Check for updates",
"m365_update_install": "Install update",
"m365_update_checking": "Checking…",
"m365_update_uptodate": "You are running the latest version.",
"m365_update_available": "Update available",
"m365_update_installing": "Installing update — the app will restart…",
"m365_update_failed": "Update check failed",
"m365_update_scan_running": "Cannot update while a scan is running."
} }

View File

@ -39,9 +39,11 @@ except ImportError:
GRAPH_BASE = "https://graph.microsoft.com/v1.0" GRAPH_BASE = "https://graph.microsoft.com/v1.0"
# Delegated scopes — used when signing in as a specific user (device code flow) # Delegated scopes — used when signing in as a specific user (device code flow)
# Files.ReadWrite.All is a superset of Files.Read.All; required for in-place
# OneDrive/SharePoint/Teams redaction (PUT /drives/{id}/items/{id}/content).
SCOPES = [ SCOPES = [
"Mail.Read", "Mail.Read",
"Files.Read.All", "Files.ReadWrite.All",
"Sites.Read.All", "Sites.Read.All",
"Team.ReadBasic.All", "Team.ReadBasic.All",
"ChannelMessage.Read.All", "ChannelMessage.Read.All",
@ -82,8 +84,9 @@ class M365PermissionError(M365Error):
f"to access this resource.\n" f"to access this resource.\n"
f" Path: {path}\n" f" Path: {path}\n"
f" Fix: the signed-in user must be a Global/Exchange Admin, OR an admin must " f" Fix: the signed-in user must be a Global/Exchange Admin, OR an admin must "
f"grant Application permissions (Mail.Read, Files.Read.All, Sites.Read.All) " f"grant Application permissions (Mail.Read, Files.ReadWrite.All, Sites.Read.All) "
f"in Azure → App registrations → API permissions → Grant admin consent." f"in Azure → App registrations → API permissions → Grant admin consent.\n"
f" Note: Files.ReadWrite.All (not Files.Read.All) is required for file redaction."
) )
@ -549,6 +552,8 @@ class M365Connector:
r.raise_for_status() r.raise_for_status()
return True # 204 No Content = success return True # 204 No Content = success
raise _requests.exceptions.RetryError(f"Gave up after {self._MAX_RETRIES} attempts: {url}") raise _requests.exceptions.RetryError(f"Gave up after {self._MAX_RETRIES} attempts: {url}")
def delete_message(self, user_id: str, message_id: str) -> bool:
"""Move an email to Deleted Items (soft delete).""" """Move an email to Deleted Items (soft delete)."""
base = "/me" if (not user_id or user_id == "me") else f"/users/{user_id}" base = "/me" if (not user_id or user_id == "me") else f"/users/{user_id}"
try: try:
@ -885,6 +890,50 @@ class M365Connector:
url = f"{GRAPH_BASE}/drives/{drive_id}/items/{item_id}/content" url = f"{GRAPH_BASE}/drives/{drive_id}/items/{item_id}/content"
return self._get_bytes(url) return self._get_bytes(url)
def put_drive_item_content(self, drive_id: str, item_id: str, content: bytes,
user_id: str = "") -> None:
"""Replace file content via Graph. Tries drives/{drive_id} first; falls back
to users/{user_id}/drive when drive_id is absent, then /me/drive."""
if drive_id:
url = f"{GRAPH_BASE}/drives/{drive_id}/items/{item_id}/content"
elif user_id and user_id != "me":
url = f"{GRAPH_BASE}/users/{user_id}/drive/items/{item_id}/content"
else:
url = f"{GRAPH_BASE}/me/drive/items/{item_id}/content"
for attempt in range(self._MAX_RETRIES):
try:
r = _requests.put(url, headers={**self._headers(),
"Content-Type": "application/octet-stream"},
data=content, timeout=self._TIMEOUT_BYTES)
except self._RETRYABLE_ERRORS:
if attempt == self._MAX_RETRIES - 1:
raise
self._backoff_sleep(attempt)
continue
if r.status_code == 429:
self._backoff_sleep(attempt, float(r.headers.get("Retry-After", 5)))
continue
if r.status_code in (503, 504):
if attempt < self._MAX_RETRIES - 1:
self._backoff_sleep(attempt)
continue
if r.status_code == 401 and attempt == 0:
self._token = None
if self.try_silent_auth():
self.put_drive_item_content(drive_id, item_id, content, user_id)
return
if r.status_code == 403:
try:
msg = r.json().get("error", {}).get("message", "")
except Exception:
msg = r.text[:200]
raise M365PermissionError(url, msg)
r.raise_for_status()
return
raise _requests.exceptions.RetryError(f"Gave up after {self._MAX_RETRIES} attempts: {url}")
# ── Teams ───────────────────────────────────────────────────────────────── # ── Teams ─────────────────────────────────────────────────────────────────
def list_all_teams(self) -> list: def list_all_teams(self) -> list:

View File

@ -37,12 +37,16 @@ pystray>=0.19 # System tray icon
# ── File system scanning (optional) ────────────────────────────────────────── # ── File system scanning (optional) ──────────────────────────────────────────
smbprotocol>=1.13 # SMB2/3 network share scanning without mounting smbprotocol>=1.13 # SMB2/3 network share scanning without mounting
keyring>=25.0 # OS keychain credential storage for SMB passwords paramiko>=3.4 # SFTP scanning over SSH
keyring>=25.0 # OS keychain credential storage for SMB/SFTP passwords
python-dotenv>=1.0 # .env file fallback for headless SMB credentials python-dotenv>=1.0 # .env file fallback for headless SMB credentials
# ── Scheduler (#19) ────────────────────────────────────────────────────────── # ── Scheduler (#19) ──────────────────────────────────────────────────────────
APScheduler>=3.10 # In-process scheduled scans APScheduler>=3.10 # In-process scheduled scans
# ── AI NER (Claude) ──────────────────────────────────────────────────────────
anthropic>=0.40.0 # Claude API client for AI-enhanced NER
# ── Google Workspace scanning (#10) ────────────────────────────────────────── # ── Google Workspace scanning (#10) ──────────────────────────────────────────
google-auth>=2.0 # Service account + domain-wide delegation google-auth>=2.0 # Service account + domain-wide delegation
google-auth-httplib2 # HTTP transport for google-auth google-auth-httplib2 # HTTP transport for google-auth

View File

@ -19,6 +19,99 @@ All three scan engines must include `"source": "m365"` / `"google"` / `"file"` i
## `_scan_bytes` injection ## `_scan_bytes` injection
`scan_engine.py` declares stub versions of `_scan_bytes` / `_scan_bytes_timeout` at module level. `gdpr_scanner.py` replaces them with the real `cpr_detector` implementations at startup. `routes/google_scan.py` pulls them from `gdpr_scanner` via `__getattr__`. Never import these directly in blueprint or engine modules — that breaks the circular-import barrier. `scan_engine.py` declares stub versions of `_scan_bytes` / `_scan_bytes_timeout` at module level. `gdpr_scanner.py` replaces them with the real `cpr_detector` implementations at startup. `routes/google_scan.py` pulls them from `gdpr_scanner` via `__getattr__`. Never import these directly in blueprint or engine modules — that breaks the circular-import barrier.
## M365 connector exceptions — m365_connector.py
Exception hierarchy (all inherit `M365Error(Exception)`):
| Exception | Trigger | Handler |
|---|---|---|
| `M365PermissionError` | 403 Forbidden | `scan_error` broadcast with human-readable permission hint |
| `M365DeltaTokenExpired` | 410 Gone on delta endpoint | Caller clears token and falls back to full scan |
| `M365DriveNotFound` | 404 Not Found on any path | `scan_phase` broadcast ("not provisioned — skipped") in `_scan_user_onedrive`; full-scan path's `except Exception: return` also silences it |
**`M365DriveNotFound` — why it exists:** `_get()` previously fell through to `raise_for_status()` on 404, which was caught by the generic `except Exception` handler and broadcast as a red `scan_error`. Adding the specific exception makes the delta path consistent with the full-scan path: a user without a provisioned OneDrive is skipped silently. **Do not add a 404 handler to `_get()` that returns a fallback value** — that would silently mask genuine path bugs.
## Export — routes/export.py
- **`GDPRDb.get_session_sources()`** — returns a `set` of source-key strings for every scan in the current session window. Used by both `_build_excel_bytes()` and `_build_article30_docx()` to include zero-hit sources in summary tables. Do not derive the scanned-source set from `by_source` alone — that dict only contains sources with flagged items.
- **Excel Summary sheet** — shows all scanned sources (even with 0 items). Per-source tabs only created for sources with items.
- **ART.30 breakdown table** — iterates `scanned_sources` (not `by_source`) so Gmail, Drive, etc. appear with `0 | 0 | 0 | —` when the scan found nothing.
- **Role-filtered exports**`_build_excel_bytes(role='')` and `_build_article30_docx(role='')` accept `role='student'` or `role='staff'`. A local `_items` list is built at the top of each function; GPS sheet, External transfers sheet, and Art.30 tables all see only the filtered subset. Filenames get `_elever` / `_ansatte` suffix.
- **`POST /api/redact_item`** — rewrites a file in-place with CPR numbers replaced by `██████-████` / `█` blocks, removes the card from the grid, logs a `"redacted"` disposition. Source types: `local` (DOCX/XLSX/CSV/TXT/PDF, written via temp+move), `onedrive`/`sharepoint`/`teams` (Graph download → redact → PUT, requires `Files.ReadWrite.All`), `gdrive` (Drive API, requires `drive` scope), `sftp` (paramiko read/write, item must still be in `state.flagged_items`), `smb` (smbprotocol `FILE_SUPERSEDE`). **Keep `_redactExts`/`_cloudRedactExts` in `results.js` and `_REDACT_EXTS`/`_GDRIVE_MIME_MAP`/`_ALL_REDACTABLE_TYPES` in `export.py` in sync** — the button and the route must agree.
- **PDF redaction**`redact_pdf_secure` uses PyMuPDF `page.apply_redactions()` (physical removal). Falls back to reportlab overlay if PyMuPDF absent. Text pages use `find_cpr_char_bboxes`; scanned pages use OCR at 200 DPI + `find_cpr_image_bboxes`.
## Preview — routes/database.py
`GET /api/preview/<item_id>?source_type=…&account_id=…` dispatches by `source_type`:
- **`local` / `smb`** — re-reads from disk; renders images as data URIs, text/CSV/PDF/DOCX/XLSX inline.
- **`email`** — fetches M365 message body via Graph (requires `state.connector`).
- **`gmail`** — shows info card with "Open in Gmail" link (X-Frame-Options blocks embedding).
- **`gdrive`** — returns `https://drive.google.com/file/d/{id}/preview` iframe.
- **All other values** (M365 files) — calls Graph `/preview` POST; tries `drive_id`-based path first, then user-drive, then `/me/drive`.
**`_source_type` must be set in `google_scan.py`** — Gmail items need `meta["_source_type"] = "gmail"` and Drive items `"gdrive"` before `_broadcast_card`. Without it, cards fall through to the M365 branch, which calls Graph with a Gmail ID and gets a 404.
**`state.connector` guard** — only the `email` and M365 `else` branches require M365 auth. The `local`/`smb`/`gmail`/`gdrive` branches must not gate on `state.connector` — they work in Google-only deployments.
## Compliance audit log — gdpr_db.py + routes/
- **`audit_log` table** — created by `_DDL` (`CREATE TABLE IF NOT EXISTS`), auto-appears on next server start. Schema: `id, ts (Unix float), action, actor, detail, ip`.
- **`log_audit_event(action, detail, actor, ip)`** — module-level helper; silently no-ops on any exception. Import: `from gdpr_db import log_audit_event as _audit`.
- **`GET /api/audit_log?limit=200&action=<filter>`** — in `routes/app_routes.py`. No auth gate.
- **Recorded events**`profile_save/delete`, `token_create/revoke`, `viewer_pin_set/change/clear`, `interface_pin_set/change/clear`, `source_add/update/delete`, `scheduler_job_save/delete`, `scan_start/stop`, `smtp_save`, `disposition`, `disposition_bulk`, `admin_pin_set/change`, `item_delete`, `item_redact`, `app_update`.
- **`actor` always empty** — no per-user login; field reserved for future use.
## Email sending — routes/email.py + m365_connector.py
- **`_post()` returns `{}` on empty body** — Graph `sendMail` returns HTTP 202 with no body; `r.json()` on empty raises `JSONDecodeError`. Do not revert to unconditional `r.json()`.
- **Graph preferred over SMTP**`smtp_test` and `send_report` try `_send_email_graph()` first; fall back to SMTP only if Graph raises. If Graph fails and no SMTP host saved, the Graph exception surfaces directly.
- **Auto-email after manual scan**`_maybe_send_auto_email()` in `routes/scan.py` called from the `_run()` thread after `run_scan()` returns. Reads `smtp_cfg.get("auto_email_manual")`; no-ops if false, no flagged items, or no recipients.
- **Gmail vs Google Workspace** — auth error handlers check if SMTP username ends in `@gmail.com`/`@googlemail.com`; custom domains are treated as Google Workspace and error message points to the Workspace admin console.
- **Canonical SMTP config keys are `username` and `use_tls`** — all backend readers (`smtp_test`, `_send_report_email`, `_send_email_graph`) use these. The Settings → E-mailrapport tab (`scheduler.js`) historically saved `user`/`starttls`, which left `username` empty so `server.login()` was skipped and the server rejected the send. Frontend now sends the canonical keys, and `_load_smtp_config()` normalises legacy `user``username` / `starttls``use_tls` for already-saved configs. The send-report modal (`scan.js`) already used the canonical keys. Keep both UIs and the backend on `username`/`use_tls`.
- **Graph 202 ≠ delivered**`_send_email_graph` returns on Graph's HTTP 202 (queued), and `smtp_test`/`send_report` treat that as success and never fall back to SMTP. A recipient on a domain Exchange Online considers an accepted/internal domain (e.g. a Google-hosted subdomain of the O365 domain) is silently dropped after the 202. There is no in-app fix for that routing; reaching such recipients requires SMTP (e.g. Google Workspace `smtp.gmail.com`/`smtp-relay.gmail.com`) or fixing Exchange Accepted Domains.
- **`prefer_smtp` config flag** — when truthy, `smtp_test`, `send_report`, and `_maybe_send_auto_email` (routes/scan.py) skip the Graph path entirely and send via SMTP. This is the in-app escape hatch for the Graph-202 routing trap above. The gate is `... and not smtp_cfg.get("prefer_smtp")` on each Graph branch — keep all three in sync. UI: `#st-smtpPreferSmtp` toggle (key `m365_smtp_prefer_smtp`), saved/loaded by `scheduler.js`.
## Scheduler — scan_scheduler.py + routes/scheduler.py
- **Job config keys**`id`, `name`, `enabled`, `frequency` (daily/weekly/monthly), `day_of_week`, `day_of_month`, `hour`, `minute`, `profile_id`, `auto_email`, `auto_retention`, `retention_years`, `fiscal_year_end`, `report_only`. Stored in `~/.gdprscanner/schedule.json`.
- **`_execute_scan(job_id)`** — acquires per-job lock (`_running_jobs` set), records DB run via `db.begin_schedule_run()`, runs M365 → file → Google pipeline, then emails and applies retention. DB run finalised in `finally`.
- **Report-only path** — when `report_only=True`, short-circuits before M365 auth check, populates `_m.flagged_items` from `db.get_session_items()` if empty, calls `_send_email_report()`. Does NOT acquire scan lock; fails with `RuntimeError("No scan results available")` if DB is also empty.
- **`_m.flagged_items` and `state.flagged_items` are the same object** — assigned at startup; in-place updates (`flagged_items[:] = ...`) propagate to both.
- **`scheduler_started` / `scheduler_done` SSE events** — separate from `scan_done` (M365). `scheduler_done` carries `flagged`, `scanned`, `emailed`, `job_name`.
- **Profile options merge into file sources** — scheduler unpacks `{**fs, **_fs_extra}` before calling `run_file_scan(fs)`. Do not pass `fs` directly — the file scan reads `source.get(...)` and silently falls back to defaults without the merge.
## Claude NER — document_scanner.py + app_config.py + routes/app_routes.py
Optional AI-powered NER replacing spaCy. Activated via `config.json` keys `claude_ner` (bool) and `claude_api_key` (str, **Fernet-encrypted at rest** with an `enc:` prefix — same scheme as the SMTP password).
- **`ANTHROPIC_OK`** — module-level flag in `document_scanner.py`; `True` if `anthropic` is importable. Guards all Claude code paths.
- **`_ner_claude(text, api_key)`** — calls `claude-haiku-4-5-20251001` in 8 000-char chunks. Thread-safe cache keyed by `hash(text)`, evicts oldest when > 2 000 entries.
- **Always read the key via `app_config.get_claude_api_key()`** — it decrypts and transparently handles legacy plaintext. Never read `config.json["claude_api_key"]` directly; `save_claude_config()` writes it encrypted.
- **`GET/POST /api/settings/claude`** — GET returns `{"enabled": bool, "api_key_set": bool}` (never exposes key). POST accepts `{"enabled": bool, "api_key": "..."}` — omitting `api_key` leaves stored key unchanged.
- **`POST /api/settings/claude/test`** — minimal 8-token API call; returns `{"ok": true}` or `{"ok": false, "error": "..."}`.
- **Do not import `anthropic` at module level outside `document_scanner.py`**`routes/app_routes.py` imports it locally inside the function body so the server starts without the package.
## Software update — routes/updates.py
- **Git-checkout only**`_supported()` requires a `.git` dir and not `sys.frozen`. The frozen desktop build gets `{"supported": false}` and the UI hides the Settings group.
- **`POST /api/update/apply`** — stash-if-dirty → `merge --ff-only origin/<branch>` → pip install only if `requirements.txt` changed → audit `app_update``_schedule_restart()` re-execs the process via `os.execv` (same PID; works under systemd and `start_gdpr.sh`). Refuses with `code: "scan_running"` (409) while `state._scan_lock` or `state._google_scan_lock` is held.
- **`apply_update()` never restarts itself** — callers decide. Tests patch `_schedule_restart`; the auto-update thread calls `_restart_self()` directly.
- **Auto-update thread**`start_auto_update_thread()` called from `gdpr_scanner.py` `__main__`. Hourly tick, applies at most once per 24 h when `config.json["auto_update"]` is true; skips (and retries next tick) while a scan runs.
- **`update_gdpr.sh`** — standalone CLI/cron equivalent of the same logic; keep stash/ff-only/requirements behaviour in sync.
## Viewer mode — routes/viewer.py
- **`/view` auth chain** — token (`?token=`) → session cookie (`session["viewer_ok"]`) → PIN form → 403. Never skip this order.
- **Token scope** — stored as `"scope": {"role": "student"|"staff"}`, `{"user": [...], "display_name": "..."}`, or `{}` in `viewer_tokens.json`. Enforced server-side in `GET /api/db/flagged`. **Column name is `user_role`** — do not use `role`.
- **`session["viewer_scope"]`** — set at `/view` token validation. `GET /api/db/flagged` reads `session.get("viewer_scope", {})` — defaults to `{}` (unrestricted) for PIN-authenticated sessions.
- **`viewer_tokens.json` format** — `{"tokens": [...], "__pin__": {"hash": "…", "salt": "…"}}`. Old bare-list format handled transparently. Do not write as bare list.
- **Rate-limit state** (`_pin_attempts` dict) — in-memory only, resets on server restart. Intentional.
- **User-scoped tokens**`scope.user` always a list; legacy single-string coerced on read. File-scan items (`account_id = ""`) never appear in user-scoped views. `POST /api/viewer/tokens` rejects combined `role`+`user` scope with 400.
- **Date-range scoping**`valid_from`/`valid_to` (YYYY-MM-DD) in scope dict; filtered via lexicographic string comparison in `GET /api/db/flagged`. Server validates format and enforces `valid_from ≤ valid_to`.
- **`app.secret_key`** — derived from `machine_id` bytes so sessions survive restarts. Set once at startup; do not override.
- **Flask binds to `0.0.0.0`**`gdpr_scanner.py`, `m365_launcher.py`, and `build_gdpr.py` all use `host="0.0.0.0"`. Internal loopback URLs intentionally keep `127.0.0.1`.
## Gotchas ## Gotchas
- **`_load_settings()` return** — does NOT include `file_sources`. Returns only: sources, user_ids, options, retention_years, fiscal_year_end, email_to. - **`_load_settings()` return** — does NOT include `file_sources`. Returns only: sources, user_ids, options, retention_years, fiscal_year_end, email_to.

View File

@ -72,6 +72,50 @@ def get_lang_json():
return jsonify(state.LANG) return jsonify(state.LANG)
@bp.route("/api/audit_log")
def audit_log_list():
"""Return recent compliance audit log entries."""
try:
from gdpr_db import get_db as _get_db
limit = min(int(request.args.get("limit", 200)), 1000)
action = request.args.get("action") or None
return jsonify(_get_db().get_audit_log(limit=limit, action=action))
except Exception as e:
return jsonify({"error": str(e)}), 500
@bp.route("/api/settings/claude", methods=["GET", "POST"])
def claude_settings():
from app_config import get_claude_config, save_claude_config
if request.method == "GET":
return jsonify(get_claude_config())
data = request.get_json(silent=True) or {}
api_key = data.get("api_key") # None = keep existing key
if api_key == "":
api_key = None # empty string = don't change
save_claude_config(bool(data.get("enabled", False)), api_key)
return jsonify({"ok": True})
@bp.route("/api/settings/claude/test", methods=["POST"])
def claude_test():
from app_config import get_claude_api_key
api_key = get_claude_api_key()
if not api_key:
return jsonify({"ok": False, "error": "No API key saved"}), 400
try:
import anthropic
client = anthropic.Anthropic(api_key=api_key)
client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=8,
messages=[{"role": "user", "content": "Hi"}],
)
return jsonify({"ok": True})
except Exception as e:
return jsonify({"ok": False, "error": str(e)}), 400
@bp.route("/manual") @bp.route("/manual")
def manual(): def manual():
"""Serve the user manual as a styled, printable HTML page. """Serve the user manual as a styled, printable HTML page.

View File

@ -11,11 +11,12 @@ from checkpoint import _clear_checkpoint, _DELTA_PATH
from cpr_detector import _extract_exif, _html_esc, _placeholder_svg from cpr_detector import _extract_exif, _html_esc, _placeholder_svg
try: try:
from gdpr_db import get_db as _get_db from gdpr_db import get_db as _get_db, log_audit_event as _audit
DB_OK = True DB_OK = True
except ImportError: except ImportError:
DB_OK = False DB_OK = False
def _get_db(*a, **kw): return None # type: ignore[misc] def _get_db(*a, **kw): return None # type: ignore[misc]
def _audit(*a, **kw): pass # type: ignore[misc]
try: try:
import document_scanner as _ds # noqa: F401 import document_scanner as _ds # noqa: F401
@ -140,6 +141,9 @@ def db_set_disposition():
notes = data.get("notes", ""), notes = data.get("notes", ""),
reviewed_by = data.get("reviewed_by", ""), reviewed_by = data.get("reviewed_by", ""),
) )
_audit("disposition",
f"item_id={item_id!r} status={data.get('status','')!r}",
ip=request.remote_addr or "")
return jsonify({"status": "saved"}) return jsonify({"status": "saved"})
@ -160,6 +164,9 @@ def db_set_disposition_bulk():
legal_basis=data.get("legal_basis", ""), legal_basis=data.get("legal_basis", ""),
notes=data.get("notes", ""), notes=data.get("notes", ""),
reviewed_by=data.get("reviewed_by", "")) reviewed_by=data.get("reviewed_by", ""))
_audit("disposition_bulk",
f"count={len(item_ids)} status={status!r}",
ip=request.remote_addr or "")
return jsonify({"saved": len(item_ids)}) return jsonify({"saved": len(item_ids)})
@ -173,7 +180,11 @@ def db_get_disposition(item_id):
@bp.route("/api/db/flagged") @bp.route("/api/db/flagged")
def db_flagged_items(): def db_flagged_items():
"""Return flagged items from the most recent completed scan session. """Return flagged items for the results grid.
With ?ref=N, returns the items from that specific past scan session (history
mode). Without ref, returns every item still awaiting action across all
scans (the default landing view) not just the latest session window.
Used by the read-only viewer to load results without an active SSE connection. Used by the read-only viewer to load results without an active SSE connection.
Respects viewer_scope.role stored in the session for scoped tokens. Respects viewer_scope.role stored in the session for scoped tokens.
""" """
@ -181,6 +192,8 @@ def db_flagged_items():
from flask import session as _session from flask import session as _session
scope = _session.get("viewer_scope", {}) scope = _session.get("viewer_scope", {})
role_filt = scope.get("role", "") if isinstance(scope, dict) else "" role_filt = scope.get("role", "") if isinstance(scope, dict) else ""
date_from = scope.get("valid_from", "") if isinstance(scope, dict) else ""
date_to = scope.get("valid_to", "") if isinstance(scope, dict) else ""
# user may be a list of emails (current) or a legacy single string # user may be a list of emails (current) or a legacy single string
raw_user = scope.get("user", "") if isinstance(scope, dict) else "" raw_user = scope.get("user", "") if isinstance(scope, dict) else ""
if isinstance(raw_user, list): if isinstance(raw_user, list):
@ -188,7 +201,13 @@ def db_flagged_items():
else: else:
user_filt = {raw_user.lower()} if raw_user else set() user_filt = {raw_user.lower()} if raw_user else set()
ref_scan_id = request.args.get("ref", type=int) ref_scan_id = request.args.get("ref", type=int)
if ref_scan_id:
# History mode — a specific past session was requested.
items = _get_db().get_session_items(ref_scan_id=ref_scan_id) items = _get_db().get_session_items(ref_scan_id=ref_scan_id)
else:
# Default landing / viewer — show every item still awaiting action,
# across all scans, not just the latest session window.
items = _get_db().get_open_items()
# Normalise JSON-encoded columns the same way scan_engine does for SSE cards # Normalise JSON-encoded columns the same way scan_engine does for SSE cards
import json as _json import json as _json
out = [] out = []
@ -197,6 +216,26 @@ def db_flagged_items():
continue continue
if user_filt and (row.get("account_id", "") or "").lower() not in user_filt: if user_filt and (row.get("account_id", "") or "").lower() not in user_filt:
continue continue
if date_from and (row.get("modified") or "") < date_from:
continue
if date_to and (row.get("modified") or "") > date_to:
continue
row["special_category"] = _json.loads(row.get("special_category") or "[]") if isinstance(row.get("special_category"), str) else row.get("special_category", [])
row["exif"] = _json.loads(row.get("exif_json") or "{}") if isinstance(row.get("exif_json"), str) else row.get("exif", {})
row.pop("exif_json", None)
out.append(row)
return jsonify(out)
@bp.route("/api/db/related/<item_id>")
def db_related_items(item_id):
"""Return flagged items from the same session sharing at least one CPR hash."""
if not DB_OK:
return jsonify([])
ref = request.args.get("ref", type=int)
import json as _json
out = []
for row in _get_db().get_related_items(item_id, ref_scan_id=ref):
row["special_category"] = _json.loads(row.get("special_category") or "[]") if isinstance(row.get("special_category"), str) else row.get("special_category", []) row["special_category"] = _json.loads(row.get("special_category") or "[]") if isinstance(row.get("special_category"), str) else row.get("special_category", [])
row["exif"] = _json.loads(row.get("exif_json") or "{}") if isinstance(row.get("exif_json"), str) else row.get("exif", {}) row["exif"] = _json.loads(row.get("exif_json") or "{}") if isinstance(row.get("exif_json"), str) else row.get("exif", {})
row.pop("exif_json", None) row.pop("exif_json", None)
@ -259,10 +298,13 @@ def admin_pin_set():
new_pin = data.get("new_pin", "").strip() new_pin = data.get("new_pin", "").strip()
if not new_pin: if not new_pin:
return jsonify({"error": "new_pin required"}), 400 return jsonify({"error": "new_pin required"}), 400
if _admin_pin_is_set(): had_pin = _admin_pin_is_set()
if had_pin:
if not _verify_admin_pin(data.get("current_pin", "")): if not _verify_admin_pin(data.get("current_pin", "")):
return jsonify({"error": "incorrect_pin"}), 403 return jsonify({"error": "incorrect_pin"}), 403
_set_admin_pin(new_pin) _set_admin_pin(new_pin)
_audit("admin_pin_change" if had_pin else "admin_pin_set", "",
ip=request.remote_addr or "")
return jsonify({"ok": True}) return jsonify({"ok": True})
@ -328,6 +370,29 @@ def db_import():
return jsonify({"error": str(e)}), 500 return jsonify({"error": str(e)}), 500
def _excerpt_page(excerpt: str, item_meta: dict) -> str:
"""Minimal HTML page showing a stored body excerpt as a preview fallback."""
import html as _html
subject = _html.escape(item_meta.get("name", ""))
modified = item_meta.get("modified", "")
account = _html.escape(item_meta.get("account_name", ""))
body = "<pre style='white-space:pre-wrap;font-family:sans-serif;margin:0'>" + _html.escape(excerpt) + "</pre>"
note = "<p style='font-size:11px;color:#888;margin-top:12px'>Stored excerpt — connect to reload the full message.</p>"
return (
"<!DOCTYPE html><html><head><meta charset='utf-8'>"
"<style>body{font-family:-apple-system,sans-serif;font-size:13px;"
"padding:12px 16px;background:#fff;color:#111;word-break:break-word}"
".hdr{border-bottom:1px solid #eee;margin-bottom:12px;padding-bottom:10px}"
".hdr-row{color:#555;font-size:12px;margin-bottom:3px}"
".hdr-row b{color:#111}</style></head><body>"
f"<div class='hdr'>"
+ (f"<div class='hdr-row'><b>From:</b> {account}</div>" if account else "")
+ (f"<div class='hdr-row'><b>Date:</b> {_html.escape(modified)}</div>" if modified else "")
+ (f"<div class='hdr-row'><b>Subject:</b> {subject}</div>" if subject else "")
+ f"</div>{body}{note}</body></html>"
)
@bp.route("/api/preview/<item_id>") @bp.route("/api/preview/<item_id>")
def get_preview(item_id): def get_preview(item_id):
"""Return a preview URL or HTML for a flagged item.""" """Return a preview URL or HTML for a flagged item."""
@ -520,14 +585,17 @@ def get_preview(item_id):
except Exception as e: except Exception as e:
return jsonify({"error": str(e)}) return jsonify({"error": str(e)})
if not state.connector:
return jsonify({"error": "not authenticated"}), 401
item_meta = next((x for x in state.flagged_items if x.get("id") == item_id), {}) item_meta = next((x for x in state.flagged_items if x.get("id") == item_id), {})
drive_id = item_meta.get("drive_id", "") drive_id = item_meta.get("drive_id", "")
try: try:
if source_type == "email": if source_type == "email":
excerpt = item_meta.get("body_excerpt", "")
if not state.connector:
if excerpt:
import html as _html
return jsonify({"type": "html", "html": _excerpt_page(excerpt, item_meta)})
return jsonify({"error": "not authenticated"}), 401
uid = account_id uid = account_id
try: try:
msg = state.connector._get( msg = state.connector._get(
@ -535,6 +603,8 @@ def get_preview(item_id):
{"$select": "subject,from,receivedDateTime,body"} {"$select": "subject,from,receivedDateTime,body"}
) )
except Exception as e: except Exception as e:
if excerpt:
return jsonify({"type": "html", "html": _excerpt_page(excerpt, item_meta)})
return jsonify({"error": f"Could not load email: {e}"}) return jsonify({"error": f"Could not load email: {e}"})
sender = msg.get("from", {}).get("emailAddress", {}) sender = msg.get("from", {}).get("emailAddress", {})
@ -592,8 +662,51 @@ def get_preview(item_id):
</body></html>""" </body></html>"""
return jsonify({"type": "html", "html": page}) return jsonify({"type": "html", "html": page})
elif source_type in ("gmail", "gdrive"):
item_url = item_meta.get("url", "")
name = item_meta.get("name", "")
if source_type == "gdrive" and item_url:
# Extract Drive file ID and use the embeddable /preview URL
import re as _re
m = _re.search(r"/file/d/([^/]+)", item_url)
if m:
fid = m.group(1)
return jsonify({"type": "iframe", "url": f"https://drive.google.com/file/d/{fid}/preview"})
# Fallback: generic Drive embed
return jsonify({"type": "iframe", "url": item_url.replace("/view", "/preview")})
# Gmail — not embeddable; show link card + stored body excerpt if available
icon = "✉️" if source_type == "gmail" else "☁️"
label = "Open in Gmail" if source_type == "gmail" else "Open in Google Drive"
excerpt = item_meta.get("body_excerpt", "")
link_html = (
f'<a href="{_html_esc(item_url)}" target="_blank" '
f'style="display:inline-block;margin-top:12px;padding:8px 16px;'
f'background:#3b7dd8;color:#fff;border-radius:6px;text-decoration:none;font-size:12px">'
f'{label}</a>'
) if item_url else ""
if excerpt and source_type == "gmail":
html_out = _excerpt_page(excerpt, item_meta)
if item_url:
# Inject the "Open in Gmail" link before </body>
html_out = html_out.replace(
"</body>",
f'<div style="margin-top:12px">{link_html}</div></body>'
)
else:
html_out = (
f'<div style="padding:24px;text-align:center;font-family:sans-serif">'
f'<div style="font-size:40px">{icon}</div>'
f'<div style="font-size:13px;font-weight:600;margin:8px 0">{_html_esc(name)}</div>'
f'<div style="font-size:11px;color:var(--muted)">No inline preview available for this item</div>'
f'{link_html}'
f'</div>'
)
return jsonify({"type": "html", "html": html_out})
else: else:
# OneDrive / SharePoint / Teams — use Graph's embed preview API # OneDrive / SharePoint / Teams — use Graph's embed preview API
if not state.connector:
return jsonify({"error": "not authenticated"}), 401
preview_url = None preview_url = None
errors = [] errors = []

View File

@ -5,6 +5,10 @@ from __future__ import annotations
from flask import Blueprint, jsonify, request from flask import Blueprint, jsonify, request
from routes import state from routes import state
from app_config import _load_smtp_config, _save_smtp_config from app_config import _load_smtp_config, _save_smtp_config
try:
from gdpr_db import log_audit_event as _audit
except ImportError:
def _audit(*a, **kw): pass # type: ignore[misc]
from routes.export import _build_excel_bytes from routes.export import _build_excel_bytes
bp = Blueprint("email", __name__) bp = Blueprint("email", __name__)
@ -119,6 +123,7 @@ def smtp_config_save():
if not data.get("password") and existing.get("password"): if not data.get("password") and existing.get("password"):
data["password"] = existing["password"] data["password"] = existing["password"]
_save_smtp_config(data) _save_smtp_config(data)
_audit("smtp_save", f"host={data.get('host','')!r}", ip=request.remote_addr or "")
return jsonify({"status": "saved"}) return jsonify({"status": "saved"})
@ -143,8 +148,12 @@ def smtp_test():
"</body></html>" "</body></html>"
) )
# Try Graph API first # Try Graph API first — unless the user opted to always use SMTP. Graph
if state.connector and state.connector.is_authenticated(): # returns 202 (queued) even for recipients Exchange later silently drops
# (e.g. a Google-hosted subdomain of the O365 domain), so SMTP is the only
# reliable path for those; prefer_smtp forces it.
prefer_smtp = bool(saved.get("prefer_smtp"))
if state.connector and state.connector.is_authenticated() and not prefer_smtp:
try: try:
_send_email_graph(subject, body_html, recipients) _send_email_graph(subject, body_html, recipients)
return jsonify({"ok": True, "method": "graph", "recipients": recipients}) return jsonify({"ok": True, "method": "graph", "recipients": recipients})
@ -280,8 +289,8 @@ def send_report():
"</body></html>" "</body></html>"
) )
# Try Graph API first # Try Graph API first — unless prefer_smtp is set (see smtp_test for why).
if state.connector and state.connector.is_authenticated(): if state.connector and state.connector.is_authenticated() and not smtp_cfg.get("prefer_smtp"):
try: try:
_send_email_graph(subject, body_html, recipients, _send_email_graph(subject, body_html, recipients,
attachment_bytes=xl_bytes, attachment_name=fname) attachment_bytes=xl_bytes, attachment_name=fname)

View File

@ -9,11 +9,12 @@ from routes import state
from app_config import _GUID_RE, _resolve_display_name from app_config import _GUID_RE, _resolve_display_name
try: try:
from gdpr_db import get_db as _get_db from gdpr_db import get_db as _get_db, log_audit_event as _audit
DB_OK = True DB_OK = True
except ImportError: except ImportError:
DB_OK = False DB_OK = False
def _get_db(*a, **kw): return None # type: ignore[misc] def _get_db(*a, **kw): return None # type: ignore[misc]
def _audit(*a, **kw): pass # type: ignore[misc]
try: try:
from m365_connector import M365PermissionError from m365_connector import M365PermissionError
@ -44,6 +45,7 @@ def _build_excel_bytes(role: str = "") -> tuple[bytes, str]:
"gdrive": ("💾 Google Drive", "D5F5E3"), "gdrive": ("💾 Google Drive", "D5F5E3"),
"local": ("📁 Local", "E6F7E6"), "local": ("📁 Local", "E6F7E6"),
"smb": ("🌐 Network", "E0F0FA"), "smb": ("🌐 Network", "E0F0FA"),
"sftp": ("🔒 SFTP", "EDE9F7"),
} }
COLS = [ COLS = [
("Name / Subject", 45), ("Name / Subject", 45),
@ -403,6 +405,7 @@ def _build_article30_docx(role: str = "") -> tuple[bytes, str]:
"gdrive": "Google Drive", "gdrive": "Google Drive",
"local": "Local files", "local": "Local files",
"smb": "Network / SMB", "smb": "Network / SMB",
"sftp": "SFTP",
} }
# ── Colour palette ──────────────────────────────────────────────────────── # ── Colour palette ────────────────────────────────────────────────────────
@ -597,7 +600,7 @@ def _build_article30_docx(role: str = "") -> tuple[bytes, str]:
r = p.add_run(txt); r.bold = True r = p.add_run(txt); r.bold = True
r.font.size = Pt(10); r.font.color.rgb = WHITE r.font.size = Pt(10); r.font.color.rgb = WHITE
for src_key in ("email", "onedrive", "sharepoint", "teams", "gmail", "gdrive", "local", "smb"): for src_key in ("email", "onedrive", "sharepoint", "teams", "gmail", "gdrive", "local", "smb", "sftp"):
if src_key not in scanned_sources: if src_key not in scanned_sources:
continue continue
src_items = by_source.get(src_key, []) src_items = by_source.get(src_key, [])
@ -1156,6 +1159,7 @@ def export_article30():
return jsonify({"error": str(e)}), 500 return jsonify({"error": str(e)}), 500
@bp.route("/api/delete_item", methods=["POST"])
def delete_item(): def delete_item():
"""Delete a single flagged item. Returns {ok, error}.""" """Delete a single flagged item. Returns {ok, error}."""
if not state.connector: if not state.connector:
@ -1188,6 +1192,9 @@ def delete_item():
reason="manual") reason="manual")
_db.delete_item_record(item_id) _db.delete_item_record(item_id)
except Exception: pass except Exception: pass
_audit("item_delete",
f"id={item_id!r} name={item_meta.get('name','')!r}",
ip=request.remote_addr or "")
return jsonify({"ok": True}) return jsonify({"ok": True})
return jsonify({"ok": False, "error": "Delete returned unexpected result"}) return jsonify({"ok": False, "error": "Delete returned unexpected result"})
except M365PermissionError: except M365PermissionError:
@ -1198,6 +1205,502 @@ def delete_item():
return jsonify({"ok": False, "error": str(e)}) return jsonify({"ok": False, "error": str(e)})
_REDACT_EXTS = {".docx", ".xlsx", ".csv", ".txt", ".pdf"}
_M365_CLOUD_TYPES = {"onedrive", "sharepoint", "teams"}
_GDRIVE_MIME_MAP = {
".docx": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
".xlsx": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
".pdf": "application/pdf",
}
_ALL_REDACTABLE_TYPES = {"local", "smb", "sftp", "gdrive"} | _M365_CLOUD_TYPES
@bp.route("/api/redact_item", methods=["POST"])
def redact_item():
"""Redact CPR numbers in-place in a local, SMB, SFTP, M365, or Google Drive file."""
from pathlib import Path as _Path
import tempfile as _tempfile
import shutil as _shutil
data = request.get_json() or {}
item_id = data.get("id", "")
if not item_id:
return jsonify({"ok": False, "error": "id required"}), 400
# Resolve item meta: in-memory first (active scan), then DB (history)
item_meta = next((x for x in state.flagged_items if x.get("id") == item_id), None)
if item_meta is None:
_db = _get_db() if DB_OK else None
if _db:
row = _db._connect().execute(
"SELECT * FROM flagged_items WHERE id=? LIMIT 1", (item_id,)
).fetchone()
item_meta = dict(row) if row else {}
else:
item_meta = {}
source_type = item_meta.get("source_type", "")
is_m365_cloud = source_type in _M365_CLOUD_TYPES
if source_type not in _ALL_REDACTABLE_TYPES:
return jsonify({"ok": False, "error": "Redaction is only supported for local, SMB, SFTP, M365, and Google Drive files"}), 400
# --- local path branch ---
if source_type == "local":
full_path = item_meta.get("full_path", "")
if not full_path:
return jsonify({"ok": False, "error": "File path not available — rescan to enable redaction"}), 400
path = _Path(full_path).expanduser()
if not path.exists():
return jsonify({"ok": False, "error": f"File not found: {full_path}"}), 404
ext = path.suffix.lower()
if ext not in _REDACT_EXTS:
return jsonify({"ok": False, "error": f"Redaction not supported for {ext or 'this'} files. Supported: DOCX, XLSX, CSV, TXT, PDF"}), 400
tmp_path = None
try:
from document_scanner import (
scan_docx, redact_docx,
scan_xlsx, redact_xlsx,
redact_csv,
scan_pdf, redact_pdf_secure,
find_pii_spans_in_text,
)
with _tempfile.NamedTemporaryFile(suffix=ext, delete=False, dir=path.parent) as tmp:
tmp_path = _Path(tmp.name)
if ext == ".docx":
results = scan_docx(path)
redacted = redact_docx(path, tmp_path, results, use_ner=False)
elif ext == ".xlsx":
results = scan_xlsx(path)
redacted = redact_xlsx(path, tmp_path, results, use_ner=False)
elif ext == ".csv":
redacted = redact_csv(path, tmp_path, use_ner=False)
elif ext == ".pdf":
results = scan_pdf(path)
redacted = redact_pdf_secure(path, tmp_path, results,
force_ocr=False, lang="dan+eng",
dpi=200, poppler_path=None,
use_ner=False)
if redacted is False:
raise RuntimeError("PDF redaction failed — PyMuPDF and reportlab both unavailable. Install with: pip install pymupdf")
else: # .txt
text = path.read_text(encoding="utf-8", errors="replace")
spans = [(s, e, l) for s, e, l in find_pii_spans_in_text(text, use_ner=False) if l == "CPR"]
chars = list(text)
for s, e, _ in sorted(spans, reverse=True):
chars[s:e] = [""] * (e - s)
tmp_path.write_text("".join(chars), encoding="utf-8")
redacted = len(spans)
_shutil.move(str(tmp_path), str(path))
tmp_path = None
except Exception as exc:
if tmp_path and tmp_path.exists():
try:
tmp_path.unlink()
except Exception:
pass
logger.exception("[redact] local file error")
return jsonify({"ok": False, "error": str(exc)}), 500
# --- M365 cloud branch (OneDrive / SharePoint / Teams) ---
elif is_m365_cloud:
conn = state.connector
if conn is None:
return jsonify({"ok": False, "error": "M365 not connected — cannot redact cloud files"}), 400
name = item_meta.get("name", "")
ext = _Path(name).suffix.lower() if name else ""
if ext not in _REDACT_EXTS - {".csv", ".txt"}:
return jsonify({"ok": False, "error": f"Redaction not supported for {ext or 'this'} cloud files. Supported: DOCX, XLSX, PDF"}), 400
drive_id = item_meta.get("drive_id") or item_meta.get("_drive_id", "")
account_id = item_meta.get("account_id") or item_meta.get("_account_id", "")
tmp_path = None
try:
# Download
if drive_id:
raw = conn.download_sharepoint_item(drive_id, item_id)
elif account_id and account_id != "me":
raw = conn.download_drive_item_for(account_id, item_id)
else:
raw = conn.download_drive_item(item_id)
from document_scanner import (
scan_docx, redact_docx,
scan_xlsx, redact_xlsx,
scan_pdf, redact_pdf_secure,
)
with _tempfile.NamedTemporaryFile(suffix=ext, delete=False) as tmp:
tmp.write(raw)
tmp_path = _Path(tmp.name)
del raw
with _tempfile.NamedTemporaryFile(suffix=ext, delete=False) as out:
out_path = _Path(out.name)
if ext == ".docx":
results = scan_docx(tmp_path)
redacted = redact_docx(tmp_path, out_path, results, use_ner=False)
elif ext == ".xlsx":
results = scan_xlsx(tmp_path)
redacted = redact_xlsx(tmp_path, out_path, results, use_ner=False)
else: # .pdf
results = scan_pdf(tmp_path)
redacted = redact_pdf_secure(tmp_path, out_path, results,
force_ocr=False, lang="dan+eng",
dpi=200, poppler_path=None,
use_ner=False)
if redacted is False:
raise RuntimeError("PDF redaction failed — PyMuPDF and reportlab both unavailable. Install with: pip install pymupdf")
# Upload redacted bytes back
redacted_bytes = out_path.read_bytes()
conn.put_drive_item_content(drive_id, item_id, redacted_bytes, user_id=account_id)
del redacted_bytes
except Exception as exc:
logger.exception("[redact] cloud file error")
return jsonify({"ok": False, "error": str(exc)}), 500
finally:
for p in ("tmp_path", "out_path"):
_p = locals().get(p)
if _p and _p.exists():
try:
_p.unlink()
except Exception:
pass
# --- Google Drive branch ---
elif source_type == "gdrive":
gconn = state.google_connector
if gconn is None:
return jsonify({"ok": False, "error": "Google not connected — cannot redact Drive files"}), 400
name = item_meta.get("name", "")
ext = _Path(name).suffix.lower() if name else ""
if ext not in _GDRIVE_MIME_MAP:
return jsonify({"ok": False, "error": f"Redaction not supported for {ext or 'this'} Drive files. Supported: DOCX, XLSX, PDF"}), 400
# item_id is "gdrive:{file_id}"
gfile_id = item_id[len("gdrive:"):] if item_id.startswith("gdrive:") else item_id
user_email = item_meta.get("account_id") or item_meta.get("_account_id", "")
tmp_path = out_path = None
try:
from document_scanner import (
scan_docx, redact_docx,
scan_xlsx, redact_xlsx,
scan_pdf, redact_pdf_secure,
)
from google_connector import GoogleError as _GoogleError
# Refuse Google-native formats (Docs/Sheets exported as DOCX)
try:
mime = gconn.get_drive_file_mime(user_email, gfile_id)
except Exception as exc:
return jsonify({"ok": False, "error": f"Could not read Drive file info: {exc}"}), 500
if mime.startswith("application/vnd.google-apps."):
return jsonify({"ok": False, "error": (
"Cannot redact a Google Docs/Sheets/Slides file in-place. "
"Export it as DOCX/XLSX/PDF first, then redact the exported copy."
)}), 400
raw = gconn.download_drive_file_by_id(user_email, gfile_id)
with _tempfile.NamedTemporaryFile(suffix=ext, delete=False) as tmp:
tmp.write(raw)
tmp_path = _Path(tmp.name)
del raw
with _tempfile.NamedTemporaryFile(suffix=ext, delete=False) as out:
out_path = _Path(out.name)
if ext == ".docx":
results = scan_docx(tmp_path)
redacted = redact_docx(tmp_path, out_path, results, use_ner=False)
elif ext == ".xlsx":
results = scan_xlsx(tmp_path)
redacted = redact_xlsx(tmp_path, out_path, results, use_ner=False)
else: # .pdf
results = scan_pdf(tmp_path)
redacted = redact_pdf_secure(tmp_path, out_path, results,
force_ocr=False, lang="dan+eng",
dpi=200, poppler_path=None,
use_ner=False)
if redacted is False:
raise RuntimeError("PDF redaction failed — PyMuPDF and reportlab both unavailable. Install with: pip install pymupdf")
redacted_bytes = out_path.read_bytes()
gconn.update_drive_file(user_email, gfile_id, redacted_bytes, _GDRIVE_MIME_MAP[ext])
del redacted_bytes
except Exception as exc:
logger.exception("[redact] gdrive file error")
return jsonify({"ok": False, "error": str(exc)}), 500
finally:
for _p in (tmp_path, out_path):
if _p and _p.exists():
try:
_p.unlink()
except Exception:
pass
# --- SFTP branch ---
elif source_type == "sftp":
full_path = item_meta.get("full_path", "")
source_uri = item_meta.get("account_name", "") # sftp://user@host/root_path
if not full_path:
return jsonify({"ok": False, "error": "File path not available — rescan to enable SFTP redaction"}), 400
if not source_uri:
return jsonify({"ok": False, "error": "SFTP source info not in memory — rescan and redact in the same session"}), 400
ext = _Path(full_path).suffix.lower()
if ext not in _REDACT_EXTS:
return jsonify({"ok": False, "error": f"Redaction not supported for {ext or 'this'} files. Supported: DOCX, XLSX, CSV, TXT, PDF"}), 400
# Parse sftp://user@host/root to find matching source config
try:
from urllib.parse import urlparse as _urlparse
_u = _urlparse(source_uri)
_sftp_host = _u.hostname or ""
_sftp_user = _u.username or ""
except Exception:
_sftp_host = _sftp_user = ""
from app_config import _load_file_sources, _resolve_sftp_credentials
_sftp_source = next(
(s for s in _load_file_sources()
if s.get("source_type") == "sftp"
and s.get("sftp_host", "") == _sftp_host
and s.get("sftp_user", "") == _sftp_user),
None,
)
if _sftp_source is None:
return jsonify({"ok": False, "error": f"SFTP source config not found for {_sftp_host} — rescan to enable redaction"}), 400
_sftp_source = _resolve_sftp_credentials(_sftp_source)
tmp_path = out_path = None
try:
from sftp_connector import SFTPScanner as _SFTPScanner
from document_scanner import (
scan_docx, redact_docx,
scan_xlsx, redact_xlsx,
redact_csv,
scan_pdf, redact_pdf_secure,
find_pii_spans_in_text,
)
_sftp = _SFTPScanner(
host=_sftp_source.get("sftp_host", ""),
root_path=_sftp_source.get("path", "/"),
username=_sftp_source.get("sftp_user", ""),
port=int(_sftp_source.get("sftp_port", 22)),
auth_type=_sftp_source.get("sftp_auth", "password"),
password=_sftp_source.get("sftp_password") or None,
key_path=_sftp_source.get("sftp_key_path") or None,
passphrase=_sftp_source.get("sftp_passphrase") or None,
)
raw = _sftp.read_file(full_path)
with _tempfile.NamedTemporaryFile(suffix=ext, delete=False) as tmp:
tmp.write(raw)
tmp_path = _Path(tmp.name)
del raw
with _tempfile.NamedTemporaryFile(suffix=ext, delete=False) as out:
out_path = _Path(out.name)
if ext == ".docx":
results = scan_docx(tmp_path)
redacted = redact_docx(tmp_path, out_path, results, use_ner=False)
elif ext == ".xlsx":
results = scan_xlsx(tmp_path)
redacted = redact_xlsx(tmp_path, out_path, results, use_ner=False)
elif ext == ".csv":
redacted = redact_csv(tmp_path, out_path, use_ner=False)
elif ext == ".pdf":
results = scan_pdf(tmp_path)
redacted = redact_pdf_secure(tmp_path, out_path, results,
force_ocr=False, lang="dan+eng",
dpi=200, poppler_path=None,
use_ner=False)
if redacted is False:
raise RuntimeError("PDF redaction failed — install PyMuPDF: pip install pymupdf")
else: # .txt
text = tmp_path.read_text(encoding="utf-8", errors="replace")
spans = [(s, e, l) for s, e, l in find_pii_spans_in_text(text, use_ner=False) if l == "CPR"]
chars = list(text)
for s, e, _ in sorted(spans, reverse=True):
chars[s:e] = [""] * (e - s)
out_path.write_text("".join(chars), encoding="utf-8")
redacted = len(spans)
_sftp.write_file(full_path, out_path.read_bytes())
except Exception as exc:
logger.exception("[redact] sftp file error")
return jsonify({"ok": False, "error": str(exc)}), 500
finally:
for _p in (tmp_path, out_path):
if _p and _p.exists():
try:
_p.unlink()
except Exception:
pass
# --- SMB branch ---
elif source_type == "smb":
full_path = item_meta.get("full_path", "")
if not full_path:
return jsonify({"ok": False, "error": "File path not available — rescan to enable SMB redaction"}), 400
ext = _Path(full_path.replace("\\", "/").split("/")[-1]).suffix.lower()
if ext not in _REDACT_EXTS:
return jsonify({"ok": False, "error": f"Redaction not supported for {ext or 'this'} files. Supported: DOCX, XLSX, CSV, TXT, PDF"}), 400
# Parse //host/share/... to find matching source config
_norm = full_path.replace("\\", "/").lstrip("/")
_parts = _norm.split("/", 2)
_smb_host_fp = _parts[0] if len(_parts) > 0 else ""
from app_config import _load_file_sources
from file_scanner import get_smb_password as _get_smb_pw
_smb_source = next(
(s for s in _load_file_sources()
if s.get("source_type", "smb") in ("smb", "")
and (s.get("smb_host", "") == _smb_host_fp
or s.get("path", "").replace("\\", "/").lstrip("/").split("/")[0] == _smb_host_fp)),
None,
)
if _smb_source is None:
return jsonify({"ok": False, "error": f"SMB source config not found for {_smb_host_fp}"}), 400
_smb_user = _smb_source.get("smb_user", "")
_smb_domain = _smb_source.get("smb_domain", "")
_smb_kc = _smb_source.get("keychain_key") or None
_smb_pw = _smb_source.get("smb_password") or _get_smb_pw(_smb_host_fp, _smb_user, _smb_kc) or ""
tmp_path = out_path = None
try:
from file_scanner import write_smb_file as _write_smb
from document_scanner import (
scan_docx, redact_docx,
scan_xlsx, redact_xlsx,
redact_csv,
scan_pdf, redact_pdf_secure,
find_pii_spans_in_text,
)
# Download current content
from file_scanner import _smb_read_file as _smb_read, SMB_OK as _SMB_OK
if not _SMB_OK:
raise RuntimeError("smbprotocol not installed — run: pip install smbprotocol")
import uuid as _uuid
from smbprotocol.connection import Connection as _SmbConn
from smbprotocol.session import Session as _SmbSession
from smbprotocol.tree import TreeConnect as _SmbTree
_norm2 = full_path.replace("\\", "/").lstrip("/")
_fp = _norm2.split("/", 2)
_fhost = _fp[0]; _fshare = _fp[1] if len(_fp) > 1 else ""
_frel = (_fp[2].replace("/", "\\")) if len(_fp) > 2 else ""
_smb_conn = _SmbConn(_uuid.uuid4(), _fhost, 445)
_smb_conn.connect(timeout=30)
try:
_smb_sess = _SmbSession(_smb_conn,
username=f"{_smb_domain}\\{_smb_user}" if _smb_domain else _smb_user,
password=_smb_pw, require_encryption=False)
_smb_sess.connect()
try:
_smb_tree = _SmbTree(_smb_sess, f"\\\\{_fhost}\\{_fshare}")
_smb_tree.connect()
try:
raw = _smb_read(_smb_tree, _frel)
finally:
_smb_tree.disconnect()
finally:
_smb_sess.disconnect()
finally:
_smb_conn.disconnect()
with _tempfile.NamedTemporaryFile(suffix=ext, delete=False) as tmp:
tmp.write(raw)
tmp_path = _Path(tmp.name)
del raw
with _tempfile.NamedTemporaryFile(suffix=ext, delete=False) as out:
out_path = _Path(out.name)
if ext == ".docx":
results = scan_docx(tmp_path)
redacted = redact_docx(tmp_path, out_path, results, use_ner=False)
elif ext == ".xlsx":
results = scan_xlsx(tmp_path)
redacted = redact_xlsx(tmp_path, out_path, results, use_ner=False)
elif ext == ".csv":
redacted = redact_csv(tmp_path, out_path, use_ner=False)
elif ext == ".pdf":
results = scan_pdf(tmp_path)
redacted = redact_pdf_secure(tmp_path, out_path, results,
force_ocr=False, lang="dan+eng",
dpi=200, poppler_path=None,
use_ner=False)
if redacted is False:
raise RuntimeError("PDF redaction failed — install PyMuPDF: pip install pymupdf")
else: # .txt
text = tmp_path.read_text(encoding="utf-8", errors="replace")
spans = [(s, e, l) for s, e, l in find_pii_spans_in_text(text, use_ner=False) if l == "CPR"]
chars = list(text)
for s, e, _ in sorted(spans, reverse=True):
chars[s:e] = [""] * (e - s)
out_path.write_text("".join(chars), encoding="utf-8")
redacted = len(spans)
_write_smb(full_path, out_path.read_bytes(), _smb_user, _smb_pw, _smb_domain)
except Exception as exc:
logger.exception("[redact] smb file error")
return jsonify({"ok": False, "error": str(exc)}), 500
finally:
for _p in (tmp_path, out_path):
if _p and _p.exists():
try:
_p.unlink()
except Exception:
pass
# --- shared: remove from grid + DB ---
state.flagged_items[:] = [x for x in state.flagged_items if x.get("id") != item_id]
_db = _get_db() if DB_OK else None
if _db:
try:
_db.log_deletion(item_meta, reason="redacted")
_db.delete_item_record(item_id)
except Exception:
pass
_audit("item_redact",
f"id={item_id!r} name={item_meta.get('name','')!r} spans={redacted}",
ip=request.remote_addr or "")
logger.info("[redact] %s%d CPR span(s) redacted", item_meta.get('name', item_id), redacted)
return jsonify({"ok": True, "redacted": redacted})
@bp.route("/api/delete_bulk", methods=["POST"]) @bp.route("/api/delete_bulk", methods=["POST"])
def delete_bulk(): def delete_bulk():
"""Delete multiple items matching criteria. Streams progress as SSE.""" """Delete multiple items matching criteria. Streams progress as SSE."""
@ -1257,6 +1760,7 @@ def delete_bulk():
return jsonify({ return jsonify({
"ok": True, "ok": True,
"deleted": len(deleted_ids), "deleted": len(deleted_ids),
"deleted_ids": deleted_ids, # so the grid can mark exactly these
"failed": len(failed_items), "failed": len(failed_items),
"errors": failed_items[:10], # cap error list "errors": failed_items[:10], # cap error list
}) })

View File

@ -141,8 +141,13 @@ def _run_google_scan(options: dict):
scan_body = bool(scan_opts.get("scan_body", True)) scan_body = bool(scan_opts.get("scan_body", True))
scan_att = bool(scan_opts.get("scan_attachments", True)) scan_att = bool(scan_opts.get("scan_attachments", True))
delta_enabled = bool(scan_opts.get("delta", False)) delta_enabled = bool(scan_opts.get("delta", False))
scan_emails = bool(scan_opts.get("scan_emails", False))
scan_phones = bool(scan_opts.get("scan_phones", False))
ocr_lang = str(scan_opts.get("ocr_lang", "dan+eng")) or "dan+eng"
cpr_only = bool(scan_opts.get("cpr_only", False))
from checkpoint import _load_delta_tokens, _save_delta_tokens from checkpoint import (_load_delta_tokens, _save_delta_tokens,
_save_checkpoint, _load_checkpoint, _clear_checkpoint)
_drive_delta_tokens: dict = _load_delta_tokens() if delta_enabled else {} _drive_delta_tokens: dict = _load_delta_tokens() if delta_enabled else {}
_new_drive_tokens: dict = {} _new_drive_tokens: dict = {}
@ -193,14 +198,45 @@ def _run_google_scan(options: dict):
except Exception as e: except Exception as e:
logger.error("[google_scan] begin_scan failed: %s", e) logger.error("[google_scan] begin_scan failed: %s", e)
# ── Checkpoint: resume from a previous interrupted Google scan ────────────
import hashlib as _hl, json as _js
_gck_prefix = "google"
_gck_key = _hl.sha256(_js.dumps({
"emails": sorted(user_emails),
"sources": sorted(sources),
"older_than_days": scan_opts.get("older_than_days", 0),
}, sort_keys=True).encode()).hexdigest()[:16]
_gck = _load_checkpoint(_gck_key, prefix=_gck_prefix)
_g_scanned_ids: set = set(_gck["scanned_ids"]) if _gck else set()
_google_flagged: list = [] # items found by this Google scan (for checkpoint)
_gck_resumed = len(_g_scanned_ids)
if _gck:
from scan_engine import _with_disposition as _wd_ck
_google_flagged = list(_gck.get("flagged", []))
flagged_items.extend(_google_flagged)
broadcast("scan_phase", {"phase": f"Resuming — skipping {_gck_resumed} already-scanned items…"})
for _card in _google_flagged:
broadcast("scan_file_flagged", _wd_ck(_card, _db))
_GCHECKPOINT_SAVE_EVERY = 25
_g_items_since_save = 0
total_flagged = 0 total_flagged = 0
total_scanned = 0 total_scanned = 0
t_start = _time.monotonic() t_start = _time.monotonic()
def _check_abort(): def _check_abort():
from gdpr_scanner import _scan_abort as _sa if _scan_abort.is_set():
if _sa.is_set(): # Emit google_scan_done (not scan_cancelled) so that the frontend
broadcast("scan_cancelled", {"completed": total_scanned}) # google_scan_done handler can decide whether to close the SSE based
# on whether other scan types (M365, file) are still running.
# scan_cancelled would unconditionally close the SSE connection,
# dropping events from a concurrently running new scan.
broadcast("google_scan_done", {
"flagged_count": total_flagged,
"total_scanned": total_scanned,
"elapsed_seconds": round(_time.monotonic() - t_start, 1),
"cancelled": True,
})
return True return True
return False return False
@ -212,6 +248,8 @@ def _run_google_scan(options: dict):
"source": item_meta.get("_source", ""), "source": item_meta.get("_source", ""),
"source_type": item_meta.get("_source_type", ""), "source_type": item_meta.get("_source_type", ""),
"cpr_count": len(cprs), "cpr_count": len(cprs),
"email_count": item_meta.get("_email_count", 0),
"phone_count": item_meta.get("_phone_count", 0),
"url": item_meta.get("_url", ""), "url": item_meta.get("_url", ""),
"size_kb": round(item_meta.get("size", 0) / 1024, 1), "size_kb": round(item_meta.get("size", 0) / 1024, 1),
"modified": (item_meta.get("lastModifiedDateTime") or item_meta.get("receivedDateTime") or "")[:10], "modified": (item_meta.get("lastModifiedDateTime") or item_meta.get("receivedDateTime") or "")[:10],
@ -228,8 +266,10 @@ def _run_google_scan(options: dict):
"special_category": [], "special_category": [],
"face_count": 0, "face_count": 0,
"exif": {}, "exif": {},
"body_excerpt": item_meta.get("_body_excerpt", ""),
} }
flagged_items.append(card) flagged_items.append(card)
_google_flagged.append(card)
broadcast("scan_file_flagged", _with_disposition(card, _db)) broadcast("scan_file_flagged", _with_disposition(card, _db))
total_flagged += 1 total_flagged += 1
if _db and _db_scan_id: if _db and _db_scan_id:
@ -261,6 +301,10 @@ def _run_google_scan(options: dict):
): ):
if _check_abort(): if _check_abort():
return return
_item_id = meta.get("id", "")
if _item_id in _g_scanned_ids:
total_scanned += 1
continue
total_scanned += 1 total_scanned += 1
broadcast("scan_file", {"file": meta.get("name", "")}) broadcast("scan_file", {"file": meta.get("name", "")})
broadcast("scan_progress", { broadcast("scan_progress", {
@ -272,14 +316,33 @@ def _run_google_scan(options: dict):
}) })
try: try:
meta["_account"] = _display_name meta["_account"] = _display_name
result = _scan_bytes(data, meta.get("name", "msg.txt")) meta["_source_type"] = "gmail"
# Extract a plain-text excerpt before scanning (body is discarded after)
try:
import re as _re
_raw = data[:3000].decode("utf-8", errors="replace")
_plain = _re.sub(r"<[^>]+>", " ", _raw)
meta["_body_excerpt"] = " ".join(_plain.split())[:500]
except Exception:
meta["_body_excerpt"] = ""
result = _scan_bytes(data, meta.get("name", "msg.txt"), lang=ocr_lang)
except Exception as e: except Exception as e:
broadcast("scan_error", {"file": meta.get("name", ""), "error": str(e)}) broadcast("scan_error", {"file": meta.get("name", ""), "error": str(e)})
_g_scanned_ids.add(_item_id)
continue continue
cprs = result.get("cprs", []) cprs = result.get("cprs", [])
pii_counts = result.get("pii_counts") pii_counts = result.get("pii_counts")
if cprs or (pii_counts and any(pii_counts.values())): _em = list(dict.fromkeys(e["formatted"] for e in result.get("emails", []))) if scan_emails else []
_ph = list(dict.fromkeys(p["formatted"] for p in result.get("phones", []))) if scan_phones else []
if cprs or (not cpr_only and ((pii_counts and any(pii_counts.values())) or _em or _ph)):
meta["_email_count"] = len(_em)
meta["_phone_count"] = len(_ph)
_broadcast_card(meta, cprs, pii_counts) _broadcast_card(meta, cprs, pii_counts)
_g_scanned_ids.add(_item_id)
_g_items_since_save += 1
if _g_items_since_save >= _GCHECKPOINT_SAVE_EVERY:
_save_checkpoint(_gck_key, _g_scanned_ids, _google_flagged, {}, prefix=_gck_prefix)
_g_items_since_save = 0
except GoogleError as e: except GoogleError as e:
broadcast("scan_error", {"file": f"Gmail/{user_email}", "error": str(e)}) broadcast("scan_error", {"file": f"Gmail/{user_email}", "error": str(e)})
except Exception as e: except Exception as e:
@ -302,23 +365,31 @@ def _run_google_scan(options: dict):
except Exception as delta_err: except Exception as delta_err:
broadcast("scan_phase", {"phase": f"{user_email} — Google Drive (delta token invalid — full scan)"}) broadcast("scan_phase", {"phase": f"{user_email} — Google Drive (delta token invalid — full scan)"})
logger.warning("[gdrive delta] %s: %s — falling back to full scan", user_email, delta_err) logger.warning("[gdrive delta] %s: %s — falling back to full scan", user_email, delta_err)
drive_items = list(conn.iter_drive_files(user_email, max_files=max_files, max_file_mb=max_file_mb)) # Record start token BEFORE iterating so the next delta starts from here
try: try:
_new_drive_tokens[delta_key] = conn.get_drive_start_token(user_email) _new_drive_tokens[delta_key] = conn.get_drive_start_token(user_email)
except Exception: except Exception:
pass pass
# Use a lazy generator (no list()) so _check_abort() fires between items
drive_items = conn.iter_drive_files(user_email, max_files=max_files, max_file_mb=max_file_mb)
else: else:
broadcast("scan_phase", {"phase": f"{user_email} — Google Drive"}) broadcast("scan_phase", {"phase": f"{user_email} — Google Drive"})
drive_items = list(conn.iter_drive_files(user_email, max_files=max_files, max_file_mb=max_file_mb)) # Record start token BEFORE iterating so the next delta starts from here
if delta_enabled: if delta_enabled:
try: try:
_new_drive_tokens[delta_key] = conn.get_drive_start_token(user_email) _new_drive_tokens[delta_key] = conn.get_drive_start_token(user_email)
except Exception: except Exception:
pass pass
# Use a lazy generator (no list()) so _check_abort() fires between items
drive_items = conn.iter_drive_files(user_email, max_files=max_files, max_file_mb=max_file_mb)
for meta, data in drive_items: for meta, data in drive_items:
if _check_abort(): if _check_abort():
return return
_item_id = meta.get("id", "")
if _item_id in _g_scanned_ids:
total_scanned += 1
continue
total_scanned += 1 total_scanned += 1
broadcast("scan_file", {"file": meta.get("name", "")}) broadcast("scan_file", {"file": meta.get("name", "")})
broadcast("scan_progress", { broadcast("scan_progress", {
@ -330,14 +401,25 @@ def _run_google_scan(options: dict):
}) })
try: try:
meta["_account"] = _display_name meta["_account"] = _display_name
result = _scan_bytes(data, meta.get("name", "file")) meta["_source_type"] = "gdrive"
result = _scan_bytes(data, meta.get("name", "file"), lang=ocr_lang)
except Exception as e: except Exception as e:
broadcast("scan_error", {"file": meta.get("name", ""), "error": str(e)}) broadcast("scan_error", {"file": meta.get("name", ""), "error": str(e)})
_g_scanned_ids.add(_item_id)
continue continue
cprs = result.get("cprs", []) cprs = result.get("cprs", [])
pii_counts = result.get("pii_counts") pii_counts = result.get("pii_counts")
if cprs or (pii_counts and any(pii_counts.values())): _em = list(dict.fromkeys(e["formatted"] for e in result.get("emails", []))) if scan_emails else []
_ph = list(dict.fromkeys(p["formatted"] for p in result.get("phones", []))) if scan_phones else []
if cprs or (not cpr_only and ((pii_counts and any(pii_counts.values())) or _em or _ph)):
meta["_email_count"] = len(_em)
meta["_phone_count"] = len(_ph)
_broadcast_card(meta, cprs, pii_counts) _broadcast_card(meta, cprs, pii_counts)
_g_scanned_ids.add(_item_id)
_g_items_since_save += 1
if _g_items_since_save >= _GCHECKPOINT_SAVE_EVERY:
_save_checkpoint(_gck_key, _g_scanned_ids, _google_flagged, {}, prefix=_gck_prefix)
_g_items_since_save = 0
except GoogleError as e: except GoogleError as e:
broadcast("scan_error", {"file": f"Drive/{user_email}", "error": str(e)}) broadcast("scan_error", {"file": f"Drive/{user_email}", "error": str(e)})
except Exception as e: except Exception as e:
@ -350,6 +432,9 @@ def _run_google_scan(options: dict):
except Exception as e: except Exception as e:
logger.warning("[gdrive delta] token save failed: %s", e) logger.warning("[gdrive delta] token save failed: %s", e)
if not _scan_abort.is_set():
_clear_checkpoint(prefix=_gck_prefix)
elapsed = _time.monotonic() - t_start elapsed = _time.monotonic() - t_start
broadcast("google_scan_done", { broadcast("google_scan_done", {
"flagged_count": total_flagged, "flagged_count": total_flagged,

View File

@ -4,6 +4,10 @@ Scan profiles
from __future__ import annotations from __future__ import annotations
from flask import Blueprint, jsonify, request from flask import Blueprint, jsonify, request
from app_config import _profiles_load, _profile_save, _profile_delete, _profile_get from app_config import _profiles_load, _profile_save, _profile_delete, _profile_get
try:
from gdpr_db import log_audit_event as _audit
except ImportError:
def _audit(*a, **kw): pass # type: ignore[misc]
bp = Blueprint("profiles", __name__) bp = Blueprint("profiles", __name__)
@ -21,6 +25,8 @@ def profiles_save():
if not profile.get("name"): if not profile.get("name"):
return jsonify({"error": "name required"}), 400 return jsonify({"error": "name required"}), 400
saved = _profile_save(profile) saved = _profile_save(profile)
_audit("profile_save", f"name={profile.get('name')!r}",
ip=request.remote_addr or "")
return jsonify({"status": "saved", "profile": saved}) return jsonify({"status": "saved", "profile": saved})
@ -32,6 +38,8 @@ def profiles_delete():
if not key: if not key:
return jsonify({"error": "name or id required"}), 400 return jsonify({"error": "name or id required"}), 400
ok = _profile_delete(key) ok = _profile_delete(key)
if ok:
_audit("profile_delete", f"key={key!r}", ip=request.remote_addr or "")
return jsonify({"status": "deleted" if ok else "not_found"}) return jsonify({"status": "deleted" if ok else "not_found"})
@ -43,5 +51,3 @@ def profiles_get():
if not p: if not p:
return jsonify({"error": "not found"}), 404 return jsonify({"error": "not found"}), 404
return jsonify({"profile": p}) return jsonify({"profile": p})

View File

@ -13,12 +13,17 @@ from app_config import (
) )
from checkpoint import ( from checkpoint import (
_checkpoint_key, _load_checkpoint, _clear_checkpoint, _checkpoint_key, _load_checkpoint, _clear_checkpoint,
_load_delta_tokens, _DELTA_PATH, _load_delta_tokens, _DELTA_PATH, _cp_path,
) )
bp = Blueprint("scan", __name__) bp = Blueprint("scan", __name__)
_log = logging.getLogger(__name__) _log = logging.getLogger(__name__)
try:
from gdpr_db import log_audit_event as _audit
except ImportError:
def _audit(*a, **kw): pass # type: ignore[misc]
def _maybe_send_auto_email(): def _maybe_send_auto_email():
"""Send the scan report email after a manual scan if auto_email_manual is enabled.""" """Send the scan report email after a manual scan if auto_email_manual is enabled."""
@ -49,7 +54,7 @@ def _maybe_send_auto_email():
"</body></html>" "</body></html>"
) )
if state.connector and state.connector.is_authenticated(): if state.connector and state.connector.is_authenticated() and not smtp_cfg.get("prefer_smtp"):
try: try:
_send_email_graph(subject, body_html, recipients, _send_email_graph(subject, body_html, recipients,
attachment_bytes=xl_bytes, attachment_name=fname) attachment_bytes=xl_bytes, attachment_name=fname)
@ -71,8 +76,12 @@ def scan_status():
acquired = state._scan_lock.acquire(blocking=False) acquired = state._scan_lock.acquire(blocking=False)
if acquired: if acquired:
state._scan_lock.release() state._scan_lock.release()
g_acquired = state._google_scan_lock.acquire(blocking=False)
if g_acquired:
state._google_scan_lock.release()
return jsonify({ return jsonify({
"running": not acquired, "running": not acquired, # M365 + file scan lock
"google_running": not g_acquired, # Google scan lock (separate)
"scan_id": _sse_mod._current_scan_id or None, "scan_id": _sse_mod._current_scan_id or None,
}) })
@ -108,12 +117,17 @@ def scan_start():
finally: finally:
state._scan_lock.release() state._scan_lock.release()
threading.Thread(target=_run, daemon=True).start() threading.Thread(target=_run, daemon=True).start()
_audit("scan_start",
f"sources={options.get('sources',[])} profile_id={profile_id!r}",
ip=request.remote_addr or "")
return jsonify({"status": "started"}) return jsonify({"status": "started"})
@bp.route("/api/scan/stop", methods=["POST"]) @bp.route("/api/scan/stop", methods=["POST"])
def scan_stop(): def scan_stop():
state._scan_abort.set() state._scan_abort.set()
state._google_scan_abort.set()
_audit("scan_stop", "", ip=request.remote_addr or "")
return jsonify({"status": "stopping"}) return jsonify({"status": "stopping"})
@ -121,28 +135,80 @@ def scan_stop():
def scan_checkpoint_info(): def scan_checkpoint_info():
"""Return info about any saved checkpoint for the given scan options. """Return info about any saved checkpoint for the given scan options.
If check_only=true, just reports whether a scan is currently running.""" If check_only=true, just reports whether a scan is currently running."""
import hashlib, json as _json
options = request.get_json() or {} options = request.get_json() or {}
if options.get("check_only"): if options.get("check_only"):
acquired = state._scan_lock.acquire(blocking=False) acquired = state._scan_lock.acquire(blocking=False)
if acquired: if acquired:
state._scan_lock.release() state._scan_lock.release()
return jsonify({"running": not acquired}) return jsonify({"running": not acquired})
engines = {}
# M365
if options.get("sources"):
key = _checkpoint_key(options) key = _checkpoint_key(options)
cp = _load_checkpoint(key) cp = _load_checkpoint(key, prefix="m365")
if not cp: if cp:
return jsonify({"exists": False}) engines["m365"] = {
return jsonify({
"exists": True, "exists": True,
"scanned_count": len(cp.get("scanned_ids", [])), "scanned_count": len(cp.get("scanned_ids", [])),
"flagged_count": len(cp.get("flagged", [])), "flagged_count": len(cp.get("flagged", [])),
"started_at": cp.get("meta", {}).get("started_at"), "started_at": cp.get("meta", {}).get("started_at"),
}
# Google
google_emails = options.get("googleUserEmails", [])
google_sources = options.get("googleSources", [])
if google_emails and google_sources:
gkey = hashlib.sha256(_json.dumps({
"emails": sorted(google_emails),
"sources": sorted(google_sources),
"older_than_days": options.get("options", {}).get("older_than_days", 0),
}, sort_keys=True).encode()).hexdigest()[:16]
cp = _load_checkpoint(gkey, prefix="google")
if cp:
engines["google"] = {
"exists": True,
"scanned_count": len(cp.get("scanned_ids", [])),
"flagged_count": len(cp.get("flagged", [])),
"started_at": cp.get("meta", {}).get("started_at"),
}
# File sources (one checkpoint per source ID)
for src_id in options.get("fileSources", []):
fkey = _checkpoint_key({"sources": ["file"], "user_ids": [src_id], "options": {}})
cp = _load_checkpoint(fkey, prefix=f"file_{src_id}")
if cp:
fe = engines.setdefault("file", {"exists": True, "scanned_count": 0, "flagged_count": 0, "started_at": None})
fe["scanned_count"] += len(cp.get("scanned_ids", []))
fe["flagged_count"] += len(cp.get("flagged", []))
if not fe["started_at"]:
fe["started_at"] = cp.get("meta", {}).get("started_at")
if not engines:
return jsonify({"exists": False})
started_ats = [v["started_at"] for v in engines.values() if v.get("started_at")]
return jsonify({
"exists": True,
"scanned_count": sum(v.get("scanned_count", 0) for v in engines.values()),
"flagged_count": sum(v.get("flagged_count", 0) for v in engines.values()),
"started_at": min(started_ats) if started_ats else None,
"engines": engines,
}) })
@bp.route("/api/scan/clear_checkpoint", methods=["POST"]) @bp.route("/api/scan/clear_checkpoint", methods=["POST"])
def scan_clear_checkpoint(): def scan_clear_checkpoint():
"""Discard any saved checkpoint so the next scan starts fresh.""" """Discard all saved checkpoints so the next scan starts fresh."""
_clear_checkpoint() from pathlib import Path
data_dir = Path.home() / ".gdprscanner"
for f in data_dir.glob("checkpoint_*.json"):
try:
f.unlink()
except Exception:
pass
return jsonify({"status": "cleared"}) return jsonify({"status": "cleared"})

View File

@ -4,6 +4,10 @@ Scheduler API routes — multi-job CRUD, status, history, run-now.
from __future__ import annotations from __future__ import annotations
from flask import Blueprint, jsonify, request from flask import Blueprint, jsonify, request
import sys, os, threading import sys, os, threading
try:
from gdpr_db import log_audit_event as _audit
except ImportError:
def _audit(*a, **kw): pass # type: ignore[misc]
bp = Blueprint("scheduler", __name__) bp = Blueprint("scheduler", __name__)
@ -52,6 +56,9 @@ def scheduler_jobs_save():
_sched().reload() _sched().reload()
except Exception: except Exception:
pass pass
_audit("scheduler_job_save",
f"id={job_id!r} name={jobs[i].get('name','')!r}",
ip=request.remote_addr or "")
return jsonify({"ok": True, "job": jobs[i]}) return jsonify({"ok": True, "job": jobs[i]})
# New job # New job
job = sm._new_job(data) job = sm._new_job(data)
@ -61,6 +68,9 @@ def scheduler_jobs_save():
_sched().reload() _sched().reload()
except Exception: except Exception:
pass pass
_audit("scheduler_job_save",
f"id={job.get('id','')!r} name={job.get('name','')!r}",
ip=request.remote_addr or "")
return jsonify({"ok": True, "job": job}) return jsonify({"ok": True, "job": job})
except Exception as e: except Exception as e:
import traceback import traceback
@ -81,6 +91,7 @@ def scheduler_jobs_delete():
_sched().reload() _sched().reload()
except Exception: except Exception:
pass pass
_audit("scheduler_job_delete", f"id={job_id!r}", ip=request.remote_addr or "")
return jsonify({"ok": True}) return jsonify({"ok": True})
except Exception as e: except Exception as e:
import traceback import traceback

View File

@ -3,9 +3,15 @@ File sources and file scan
""" """
from __future__ import annotations from __future__ import annotations
import threading import threading
import uuid as _uuid
from pathlib import Path
from flask import Blueprint, jsonify, request from flask import Blueprint, jsonify, request
from routes import state from routes import state
from app_config import _load_file_sources, _save_file_sources from app_config import _load_file_sources, _save_file_sources, _SFTP_KEYS_DIR
try:
from gdpr_db import log_audit_event as _audit
except ImportError:
def _audit(*a, **kw): pass # type: ignore[misc]
try: try:
from file_scanner import store_smb_password, SMB_OK as _SMB_OK from file_scanner import store_smb_password, SMB_OK as _SMB_OK
@ -15,6 +21,12 @@ except ImportError:
_SMB_OK = False _SMB_OK = False
def store_smb_password(*a, **kw): return False # type: ignore[misc] def store_smb_password(*a, **kw): return False # type: ignore[misc]
try:
from sftp_connector import store_sftp_password, SFTP_OK as _SFTP_OK
except ImportError:
_SFTP_OK = False
def store_sftp_password(*a, **kw): return False # type: ignore[misc]
bp = Blueprint("sources", __name__) bp = Blueprint("sources", __name__)
@ -25,6 +37,7 @@ def file_sources_list():
return jsonify({ return jsonify({
"sources": sources, "sources": sources,
"smb_available": _SMB_OK, "smb_available": _SMB_OK,
"sftp_available": _SFTP_OK,
"scanner_ok": _FILE_SCANNER_OK, "scanner_ok": _FILE_SCANNER_OK,
}) })
@ -32,61 +45,156 @@ def file_sources_list():
@bp.route("/api/file_sources/save", methods=["POST"]) @bp.route("/api/file_sources/save", methods=["POST"])
def file_sources_save(): def file_sources_save():
"""Add or update a file source. Assigns a UUID if id is missing.""" """Add or update a file source. Assigns a UUID if id is missing."""
import uuid as _uuid
data = request.get_json() or {} data = request.get_json() or {}
path = data.get("path", "").strip() source_type = data.get("source_type", "")
if not path:
# Validate required fields per source type
if source_type == "sftp":
if not data.get("sftp_host", "").strip():
return jsonify({"error": "sftp_host required"}), 400
if not data.get("sftp_user", "").strip():
return jsonify({"error": "sftp_user required"}), 400
if not data.get("path", "").strip():
data["path"] = "/"
else:
if not data.get("path", "").strip():
return jsonify({"error": "path required"}), 400 return jsonify({"error": "path required"}), 400
sources = _load_file_sources() sources = _load_file_sources()
uid = data.get("id") or "" uid = data.get("id") or ""
for i, s in enumerate(sources): for i, s in enumerate(sources):
if s.get("id") == uid: if s.get("id") == uid:
sources[i] = {**s, **data} sources[i] = {**s, **data}
_save_file_sources(sources) _save_file_sources(sources)
_audit("source_update",
f"name={data.get('name','')!r} type={data.get('source_type','local')!r}",
ip=request.remote_addr or "")
return jsonify({"ok": True, "source": sources[i]}) return jsonify({"ok": True, "source": sources[i]})
data["id"] = data.get("id") or str(_uuid.uuid4()) data["id"] = data.get("id") or str(_uuid.uuid4())
sources.append(data) sources.append(data)
_save_file_sources(sources) _save_file_sources(sources)
_audit("source_add",
f"name={data.get('name','')!r} type={data.get('source_type','local')!r}",
ip=request.remote_addr or "")
return jsonify({"ok": True, "source": data}) return jsonify({"ok": True, "source": data})
@bp.route("/api/file_sources/delete", methods=["POST"]) @bp.route("/api/file_sources/delete", methods=["POST"])
def file_sources_delete(): def file_sources_delete():
"""Remove a file source by id.""" """Remove a file source by id. Also deletes any associated SFTP key file."""
uid = (request.get_json() or {}).get("id", "") uid = (request.get_json() or {}).get("id", "")
if not uid: if not uid:
return jsonify({"error": "id required"}), 400 return jsonify({"error": "id required"}), 400
sources = [s for s in _load_file_sources() if s.get("id") != uid] sources = _load_file_sources()
deleted = next((s for s in sources if s.get("id") == uid), None)
sources = [s for s in sources if s.get("id") != uid]
_save_file_sources(sources) _save_file_sources(sources)
if deleted:
_audit("source_delete",
f"name={deleted.get('name','')!r} type={deleted.get('source_type','local')!r}",
ip=request.remote_addr or "")
# Clean up key file if this was an SFTP key-auth source
if deleted and deleted.get("sftp_key_path"):
key_file = Path(deleted["sftp_key_path"])
if key_file.parent == _SFTP_KEYS_DIR and key_file.exists():
try:
key_file.unlink()
except OSError:
pass
return jsonify({"ok": True}) return jsonify({"ok": True})
@bp.route("/api/file_sources/store_creds", methods=["POST"]) @bp.route("/api/file_sources/store_creds", methods=["POST"])
def file_sources_store_creds(): def file_sources_store_creds():
"""Store SMB password in the OS keychain.""" """Store SMB or SFTP password/passphrase in the OS keychain."""
data = request.get_json() or {}
source_type = data.get("source_type", "smb")
password = data.get("password", "")
if source_type == "sftp":
if not _SFTP_OK:
return jsonify({"error": "paramiko not installed — run: pip install paramiko"}), 503
host = data.get("sftp_host", "")
user = data.get("sftp_user", "")
if not user or not password:
return jsonify({"error": "sftp_user and password required"}), 400
key = data.get("keychain_key") or f"sftp:{user}@{host}"
ok = store_sftp_password(host, user, password, key)
if ok:
return jsonify({"ok": True, "keychain_key": key})
return jsonify({"error": "keyring not available — install: pip install keyring"}), 500
else:
if not _FILE_SCANNER_OK: if not _FILE_SCANNER_OK:
return jsonify({"error": "file_scanner not available"}), 503 return jsonify({"error": "file_scanner not available"}), 503
data = request.get_json() or {}
smb_host = data.get("smb_host", "") smb_host = data.get("smb_host", "")
smb_user = data.get("smb_user", "") smb_user = data.get("smb_user", "")
password = data.get("password", "")
key = data.get("keychain_key") or smb_user
if not smb_user or not password: if not smb_user or not password:
return jsonify({"error": "smb_user and password required"}), 400 return jsonify({"error": "smb_user and password required"}), 400
key = data.get("keychain_key") or smb_user
ok = store_smb_password(smb_host, smb_user, password, key) ok = store_smb_password(smb_host, smb_user, password, key)
if ok: if ok:
return jsonify({"ok": True, "keychain_key": key}) return jsonify({"ok": True, "keychain_key": key})
return jsonify({"error": "keyring not available — install: pip install keyring"}), 500 return jsonify({"error": "keyring not available — install: pip install keyring"}), 500
@bp.route("/api/file_sources/upload_key", methods=["POST"])
def file_sources_upload_key():
"""Accept an SSH private key file upload and store it in the SFTP keys directory.
Validates the file is a recognised private key format before saving.
Returns {"key_id": uuid, "key_path": absolute_path}.
"""
if not _SFTP_OK:
return jsonify({"error": "paramiko not installed — run: pip install paramiko"}), 503
if "key_file" not in request.files:
return jsonify({"error": "key_file required"}), 400
file = request.files["key_file"]
raw = file.read(65536) # 64 KB is more than enough for any private key
# Validate before saving — try loading the key material with paramiko
import io
import paramiko
loaded = False
for cls in (paramiko.RSAKey, paramiko.Ed25519Key, paramiko.ECDSAKey, paramiko.DSSKey):
try:
cls.from_private_key(io.BytesIO(raw))
loaded = True
break
except (paramiko.ssh_exception.SSHException, Exception):
continue
if not loaded:
# Might be passphrase-protected — still accept it; validation will happen at connect time
if b"-----BEGIN" not in raw and b"OPENSSH PRIVATE KEY" not in raw:
return jsonify({"error": "File does not appear to be a private key"}), 400
key_id = str(_uuid.uuid4())
key_path = _SFTP_KEYS_DIR / key_id
key_path.write_bytes(raw)
key_path.chmod(0o600)
return jsonify({"ok": True, "key_id": key_id, "key_path": str(key_path)})
@bp.route("/api/file_scan/start", methods=["POST"]) @bp.route("/api/file_scan/start", methods=["POST"])
def file_scan_start(): def file_scan_start():
"""Start a file system scan for a single file source.""" """Start a file system scan for a single file source (local, SMB, or SFTP)."""
if not _FILE_SCANNER_OK: source = request.get_json() or {}
source_type = source.get("source_type", "")
if source_type == "sftp":
if not _SFTP_OK:
return jsonify({"error": "paramiko not installed — run: pip install paramiko"}), 503
elif not _FILE_SCANNER_OK:
return jsonify({"error": "file_scanner not available"}), 503 return jsonify({"error": "file_scanner not available"}), 503
if not state._scan_lock.acquire(blocking=False): if not state._scan_lock.acquire(blocking=False):
return jsonify({"error": "scan already running"}), 409 return jsonify({"error": "scan already running"}), 409
source = request.get_json() or {}
state._scan_abort.clear() state._scan_abort.clear()
def _run(): def _run():

216
routes/updates.py Normal file
View File

@ -0,0 +1,216 @@
"""
Software update routes: check origin for new commits, apply the update,
and an optional auto-update background thread.
Only available when running from a git checkout the frozen desktop
build (PyInstaller) reports supported=False and the UI hides the group.
Applying an update fast-forwards to origin/<branch>, reinstalls
dependencies if requirements.txt changed, then re-execs the process so
the new code is loaded. Local edits are stashed (kept), never discarded.
"""
from __future__ import annotations
import os
import subprocess
import sys
import threading
import time
from pathlib import Path
from flask import Blueprint, jsonify, request
from routes import state
from app_config import get_update_config, save_update_config
bp = Blueprint("updates", __name__)
_REPO_DIR = Path(__file__).parent.parent
_GIT_TIMEOUT = 30
_AUTO_CHECK_INTERVAL = 24 * 3600 # auto-update checks once per day
_last_auto_check = [0.0]
def _supported() -> bool:
return (not getattr(sys, "frozen", False)) and (_REPO_DIR / ".git").exists()
def _git(*args: str, timeout: int = _GIT_TIMEOUT) -> subprocess.CompletedProcess:
return subprocess.run(
["git", *args], cwd=_REPO_DIR,
capture_output=True, text=True, timeout=timeout,
)
def _scan_running() -> bool:
return state._scan_lock.locked() or state._google_scan_lock.locked()
def check_for_update() -> dict:
"""Fetch origin and compare HEAD against the tracked branch."""
if not _supported():
return {"supported": False}
try:
branch = _git("rev-parse", "--abbrev-ref", "HEAD").stdout.strip() or "main"
fetch = _git("fetch", "origin", branch, timeout=60)
if fetch.returncode != 0:
return {"supported": True, "error": fetch.stderr.strip()[:300] or "git fetch failed"}
local = _git("rev-parse", "HEAD").stdout.strip()
remote = _git("rev-parse", f"origin/{branch}").stdout.strip()
except (subprocess.TimeoutExpired, OSError) as e:
return {"supported": True, "error": str(e)[:300]}
info = {
"supported": True, "branch": branch,
"current": local[:7], "latest": remote[:7],
"up_to_date": local == remote, "commits": [],
}
if local != remote:
lg = _git("log", "--oneline", f"HEAD..origin/{branch}")
info["commits"] = lg.stdout.strip().splitlines()[:20]
return info
def apply_update() -> dict:
"""Fast-forward to origin/<branch>; returns {"ok", "updated", ...}.
Does NOT restart the process callers decide (the route schedules a
re-exec, the auto-update thread restarts directly).
"""
chk = check_for_update()
if not chk.get("supported"):
return {"ok": False, "code": "unsupported",
"error": "Updates require running from a git checkout."}
if chk.get("error"):
return {"ok": False, "code": "check_failed", "error": chk["error"]}
if chk.get("up_to_date"):
return {"ok": True, "updated": False, "current": chk["current"]}
if _scan_running():
return {"ok": False, "code": "scan_running",
"error": "Cannot update while a scan is running."}
branch = chk["branch"]
try:
if _git("diff-index", "--quiet", "HEAD", "--").returncode != 0:
_git("stash", "push", "-m",
"auto-stash before update " + time.strftime("%Y-%m-%d %H:%M:%S"))
reqs_changed = _git(
"diff", "--quiet", f"HEAD..origin/{branch}", "--", "requirements.txt"
).returncode != 0
merge = _git("merge", "--ff-only", f"origin/{branch}")
if merge.returncode != 0:
return {"ok": False, "code": "merge_failed",
"error": (merge.stderr.strip() or "git merge failed")[:300]}
if reqs_changed:
subprocess.run(
[sys.executable, "-m", "pip", "install", "-q", "-r",
str(_REPO_DIR / "requirements.txt")],
cwd=_REPO_DIR, capture_output=True, timeout=600,
)
except (subprocess.TimeoutExpired, OSError) as e:
return {"ok": False, "code": "apply_failed", "error": str(e)[:300]}
try:
from gdpr_db import log_audit_event as _audit
_audit("app_update", f"{chk['current']} -> {chk['latest']}",
ip=(request.remote_addr if request else ""))
except Exception:
pass
return {"ok": True, "updated": True,
"from": chk["current"], "to": chk["latest"]}
def _mark_fds_cloexec() -> None:
"""Mark every fd above stderr close-on-exec.
Werkzeug calls ``srv.socket.set_inheritable(True)`` unconditionally
(for its debug reloader), so without this the listening socket leaks
into the exec'd process: it sits on the port as a zombie listener no
one accepts from, the port probe sees the port as busy, and the new
server hops to port+1 while clients hang against the dead socket.
"""
try:
fds = [int(f) for f in os.listdir("/proc/self/fd")] # Linux
except (OSError, ValueError):
fds = list(range(3, 4096))
for fd in fds:
if fd > 2:
try:
os.set_inheritable(fd, False)
except OSError:
pass
def _restart_self() -> None:
"""Re-exec the current process so the updated code is loaded.
Keeps the same PID, so it works both under systemd and when launched
manually via start_gdpr.sh.
"""
_mark_fds_cloexec()
try:
os.execv(sys.executable, [sys.executable] + sys.argv)
except OSError:
# Last resort: exit and rely on a supervisor (systemd Restart=) to
# bring the app back up.
os._exit(0)
def _schedule_restart(delay: float = 1.5) -> None:
def _later():
time.sleep(delay)
_restart_self()
threading.Thread(target=_later, daemon=True, name="update-restart").start()
# ── Routes ────────────────────────────────────────────────────────────────────
@bp.route("/api/update/check")
def update_check():
return jsonify(check_for_update())
@bp.route("/api/update/apply", methods=["POST"])
def update_apply():
res = apply_update()
if res.get("updated"):
res["restarting"] = True
_schedule_restart()
return jsonify(res), (200 if res.get("ok") else 409)
@bp.route("/api/update/settings", methods=["GET", "POST"])
def update_settings():
if request.method == "GET":
return jsonify({"supported": _supported(), **get_update_config()})
data = request.get_json(silent=True) or {}
save_update_config(bool(data.get("auto_update", False)))
return jsonify({"ok": True})
# ── Auto-update background thread ─────────────────────────────────────────────
def _auto_update_loop() -> None:
while True:
time.sleep(3600)
try:
if not get_update_config().get("auto_update"):
continue
if time.time() - _last_auto_check[0] < _AUTO_CHECK_INTERVAL:
continue
_last_auto_check[0] = time.time()
if _scan_running():
_last_auto_check[0] = 0.0 # retry on the next hourly tick
continue
res = apply_update()
if res.get("updated"):
print(f" Auto-update: {res['from']} -> {res['to']} — restarting")
_restart_self()
except Exception:
pass
def start_auto_update_thread() -> bool:
"""Called once at startup from gdpr_scanner.py. No-op for frozen builds."""
if not _supported():
return False
threading.Thread(target=_auto_update_loop, daemon=True, name="auto-update").start()
return True

View File

@ -19,6 +19,10 @@ from app_config import (
verify_interface_pin, verify_interface_pin,
clear_interface_pin, clear_interface_pin,
) )
try:
from gdpr_db import log_audit_event as _audit
except ImportError:
def _audit(*a, **kw): pass # type: ignore[misc]
bp = Blueprint("viewer", __name__) bp = Blueprint("viewer", __name__)
@ -97,13 +101,30 @@ def create_token():
return jsonify({"error": "scope.role must be '', 'student', or 'staff'"}), 400 return jsonify({"error": "scope.role must be '', 'student', or 'staff'"}), 400
if user_emails and not all("@" in e for e in user_emails): if user_emails and not all("@" in e for e in user_emails):
return jsonify({"error": "scope.user entries must be valid email addresses"}), 400 return jsonify({"error": "scope.user entries must be valid email addresses"}), 400
valid_from = str(raw_scope.get("valid_from", "")).strip()
valid_to = str(raw_scope.get("valid_to", "")).strip()
from datetime import datetime as _dt
for _d, _lbl in ((valid_from, "valid_from"), (valid_to, "valid_to")):
if _d:
try:
_dt.strptime(_d, "%Y-%m-%d")
except ValueError:
return jsonify({"error": f"scope.{_lbl} must be YYYY-MM-DD"}), 400
if valid_from and valid_to and valid_from > valid_to:
return jsonify({"error": "scope.valid_from must be ≤ scope.valid_to"}), 400
if user_emails: if user_emails:
scope = {"user": user_emails, "display_name": display_name or user_emails[0]} scope = {"user": user_emails, "display_name": display_name or user_emails[0]}
elif role: elif role:
scope = {"role": role} scope = {"role": role}
else: else:
scope = {} scope = {}
if valid_from:
scope["valid_from"] = valid_from
if valid_to:
scope["valid_to"] = valid_to
entry = create_viewer_token(label=label, expires_days=expires_days, scope=scope) entry = create_viewer_token(label=label, expires_days=expires_days, scope=scope)
_audit("token_create", f"label={label!r} scope={scope}",
ip=request.remote_addr or "")
return jsonify(entry), 201 return jsonify(entry), 201
@ -114,6 +135,7 @@ def delete_token(token: str):
removed = revoke_viewer_token(token) removed = revoke_viewer_token(token)
if not removed: if not removed:
return jsonify({"error": "token not found"}), 404 return jsonify({"error": "token not found"}), 404
_audit("token_revoke", f"token={token[:8]}...", ip=request.remote_addr or "")
return jsonify({"ok": True}) return jsonify({"ok": True})
@ -147,10 +169,13 @@ def pin_set():
return jsonify({"error": "pin required"}), 400 return jsonify({"error": "pin required"}), 400
if not new_pin.isdigit() or not (4 <= len(new_pin) <= 8): if not new_pin.isdigit() or not (4 <= len(new_pin) <= 8):
return jsonify({"error": "PIN must be 48 digits"}), 400 return jsonify({"error": "PIN must be 48 digits"}), 400
if get_viewer_pin_hash(): had_pin = bool(get_viewer_pin_hash())
if had_pin:
if not verify_viewer_pin(str(body.get("current_pin", "")).strip()): if not verify_viewer_pin(str(body.get("current_pin", "")).strip()):
return jsonify({"error": "current PIN is incorrect"}), 403 return jsonify({"error": "current PIN is incorrect"}), 403
set_viewer_pin(new_pin) set_viewer_pin(new_pin)
_audit("viewer_pin_change" if had_pin else "viewer_pin_set", "",
ip=request.remote_addr or "")
return jsonify({"ok": True}) return jsonify({"ok": True})
@ -162,6 +187,7 @@ def pin_clear():
if not verify_viewer_pin(str(body.get("current_pin", "")).strip()): if not verify_viewer_pin(str(body.get("current_pin", "")).strip()):
return jsonify({"error": "current PIN is incorrect"}), 403 return jsonify({"error": "current PIN is incorrect"}), 403
clear_viewer_pin() clear_viewer_pin()
_audit("viewer_pin_clear", "", ip=request.remote_addr or "")
return jsonify({"ok": True}) return jsonify({"ok": True})
@ -185,10 +211,13 @@ def interface_pin_set():
return jsonify({"error": "pin required"}), 400 return jsonify({"error": "pin required"}), 400
if not new_pin.isdigit() or not (4 <= len(new_pin) <= 8): if not new_pin.isdigit() or not (4 <= len(new_pin) <= 8):
return jsonify({"error": "PIN must be 48 digits"}), 400 return jsonify({"error": "PIN must be 48 digits"}), 400
if get_interface_pin_hash(): had_ipin = bool(get_interface_pin_hash())
if had_ipin:
if not verify_interface_pin(str(body.get("current_pin", "")).strip()): if not verify_interface_pin(str(body.get("current_pin", "")).strip()):
return jsonify({"error": "current PIN is incorrect"}), 403 return jsonify({"error": "current PIN is incorrect"}), 403
set_interface_pin(new_pin) set_interface_pin(new_pin)
_audit("interface_pin_change" if had_ipin else "interface_pin_set", "",
ip=request.remote_addr or "")
return jsonify({"ok": True}) return jsonify({"ok": True})
@ -200,6 +229,7 @@ def interface_pin_clear():
if not verify_interface_pin(str(body.get("current_pin", "")).strip()): if not verify_interface_pin(str(body.get("current_pin", "")).strip()):
return jsonify({"error": "current PIN is incorrect"}), 403 return jsonify({"error": "current PIN is incorrect"}), 403
clear_interface_pin() clear_interface_pin()
_audit("interface_pin_clear", "", ip=request.remote_addr or "")
return jsonify({"ok": True}) return jsonify({"ok": True})

View File

@ -75,6 +75,12 @@ except ImportError:
FileScanner = None # type: ignore[assignment,misc] FileScanner = None # type: ignore[assignment,misc]
FILE_SCANNER_OK = False FILE_SCANNER_OK = False
try:
from sftp_connector import SFTPScanner, SFTP_OK as _SFTP_OK
except ImportError:
SFTPScanner = None # type: ignore[assignment,misc]
_SFTP_OK = False
try: try:
import document_scanner as ds import document_scanner as ds
SCANNER_OK = True SCANNER_OK = True
@ -104,8 +110,8 @@ AUDIO_EXTS: set = set()
SUPPORTED_EXTS: set = set() SUPPORTED_EXTS: set = set()
# cpr_detector helpers — injected by gdpr_scanner.py # cpr_detector helpers — injected by gdpr_scanner.py
def _scan_bytes(content, filename, poppler_path=None): return {"cprs": [], "dates": []} # type: ignore[misc] def _scan_bytes(content, filename, poppler_path=None, lang="dan+eng"): return {"cprs": [], "dates": []} # type: ignore[misc]
def _scan_bytes_timeout(content, filename, timeout=60): return {"cprs": [], "dates": []} # type: ignore[misc] def _scan_bytes_timeout(content, filename, timeout=60, lang="dan+eng"): return {"cprs": [], "dates": []} # type: ignore[misc]
def _detect_photo_faces(content, filename): return 0 # type: ignore[misc] def _detect_photo_faces(content, filename): return 0 # type: ignore[misc]
def _extract_exif(content, filename): return {} # type: ignore[misc] def _extract_exif(content, filename): return {} # type: ignore[misc]
def _extract_video_metadata(content, filename): return {} # type: ignore[misc] def _extract_video_metadata(content, filename): return {} # type: ignore[misc]
@ -119,8 +125,8 @@ def _html_esc(s): return str(s) # type: ignore[misc]
# checkpoint helpers — injected by gdpr_scanner.py # checkpoint helpers — injected by gdpr_scanner.py
def _checkpoint_key(opts): return "" # type: ignore[misc] def _checkpoint_key(opts): return "" # type: ignore[misc]
def _save_checkpoint(*a, **kw): pass # type: ignore[misc] def _save_checkpoint(*a, **kw): pass # type: ignore[misc]
def _load_checkpoint(key): return None # type: ignore[misc] def _load_checkpoint(key, **kw): return None # type: ignore[misc]
def _clear_checkpoint(): pass # type: ignore[misc] def _clear_checkpoint(**kw): pass # type: ignore[misc]
def _load_delta_tokens(): return {} # type: ignore[misc] def _load_delta_tokens(): return {} # type: ignore[misc]
def _save_delta_tokens(t): pass # type: ignore[misc] def _save_delta_tokens(t): pass # type: ignore[misc]
@ -151,18 +157,21 @@ def _with_disposition(card: dict, db) -> dict:
def run_file_scan(source: dict): def run_file_scan(source: dict):
"""Scan a single local or SMB file source for CPR numbers and PII. """Scan a single local, SMB, or SFTP file source for CPR numbers and PII.
Reuses _scan_bytes, _broadcast_card, _check_special_category, Reuses _scan_bytes, _broadcast_card, _check_special_category,
_detect_photo_faces and all other existing scan helpers. _detect_photo_faces and all other existing scan helpers.
Args: Args:
source: file source dict with keys: source: file source dict with keys:
path, label, smb_host, smb_user, smb_domain, keychain_key, source_type ("local"|"smb"|"sftp"), path, label,
smb_host, smb_user, smb_domain, keychain_key,
sftp_host, sftp_port, sftp_user, sftp_auth, sftp_key_path,
scan_photos (bool), max_file_mb (int) scan_photos (bool), max_file_mb (int)
""" """
# state vars accessed via _state module # state vars accessed via _state module
source_kind = source.get("source_type", "")
path = source.get("path", "") path = source.get("path", "")
label = source.get("label") or path label = source.get("label") or path
smb_host = source.get("smb_host") or None smb_host = source.get("smb_host") or None
@ -173,9 +182,17 @@ def run_file_scan(source: dict):
scan_photos = bool(source.get("scan_photos", False)) scan_photos = bool(source.get("scan_photos", False))
skip_gps_images = bool(source.get("skip_gps_images", False)) skip_gps_images = bool(source.get("skip_gps_images", False))
min_cpr_count = max(1, int(source.get("min_cpr_count", 1))) min_cpr_count = max(1, int(source.get("min_cpr_count", 1)))
scan_emails = bool(source.get("scan_emails", False))
scan_phones = bool(source.get("scan_phones", False))
cpr_only = bool(source.get("cpr_only", False))
ocr_lang = str(source.get("ocr_lang", "dan+eng")) or "dan+eng"
max_mb = int(source.get("max_file_mb", 50)) max_mb = int(source.get("max_file_mb", 50))
if not FILE_SCANNER_OK: if source_kind == "sftp":
if not _SFTP_OK:
broadcast("scan_error", {"file": label, "error": "paramiko not installed — run: pip install paramiko"})
return
elif not FILE_SCANNER_OK:
broadcast("scan_error", {"file": label, "error": "file_scanner.py not found"}) broadcast("scan_error", {"file": label, "error": "file_scanner.py not found"})
return return
@ -194,12 +211,44 @@ def run_file_scan(source: dict):
except Exception as e: except Exception as e:
logger.error("[db] start_scan failed: %s", e) logger.error("[db] start_scan failed: %s", e)
# \u2500\u2500 Checkpoint: resume from a previous interrupted file scan \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500
_ck_prefix = f"file_{source.get('id', 'local')}"
_ck_key = _checkpoint_key({"sources": [source.get("source_type", "local")], "user_ids": [source.get("id", path)], "options": {}})
_ck = _load_checkpoint(_ck_key, prefix=_ck_prefix)
_file_scanned_ids: set = set(_ck["scanned_ids"]) if _ck else set()
_file_flagged: list = [] # items found by this file scan run (for checkpoint)
_ck_resumed = len(_file_scanned_ids)
if _ck:
_file_flagged = list(_ck.get("flagged", []))
for card in _file_flagged:
_state.flagged_items.append(card)
broadcast("scan_phase", {"phase": LANG.get("m365_resuming", f"Resuming \u2014 skipping {_ck_resumed} already-scanned items\u2026")})
for card in _file_flagged:
broadcast("scan_file_flagged", _with_disposition(card, _db))
_CHECKPOINT_SAVE_EVERY_FILE = 25
_file_items_since_save = 0
total_scanned = 0 total_scanned = 0
total_flagged = 0 total_flagged = 0
broadcast("scan_phase", {"phase": f"Files \u2014 {label}"}) broadcast("scan_phase", {"phase": f"Files \u2014 {label}"})
try: try:
if source_kind == "sftp":
fs = SFTPScanner(
host=source.get("sftp_host", ""),
root_path=path,
username=source.get("sftp_user", ""),
port=int(source.get("sftp_port", 22)),
auth_type=source.get("sftp_auth", "password"),
password=source.get("sftp_password") or None,
key_path=source.get("sftp_key_path") or None,
passphrase=source.get("sftp_passphrase") or None,
keychain_key=keychain_key,
max_file_bytes=max_mb * 1_048_576,
label=label,
)
else:
fs = FileScanner( fs = FileScanner(
path=path, path=path,
smb_host=smb_host, smb_host=smb_host,
@ -217,6 +266,10 @@ def run_file_scan(source: dict):
if _state._scan_abort.is_set(): if _state._scan_abort.is_set():
break break
if rel_path in _file_scanned_ids:
total_scanned += 1
continue
total_scanned += 1 total_scanned += 1
broadcast("scan_progress", {"scanned": total_scanned, "flagged": total_flagged, "file": rel_path, "pct": min(90, 10 + total_scanned // 10), "source": "file"}) broadcast("scan_progress", {"scanned": total_scanned, "flagged": total_flagged, "file": rel_path, "pct": min(90, 10 + total_scanned // 10), "source": "file"})
@ -235,12 +288,14 @@ def run_file_scan(source: dict):
result: dict = {"cprs": [], "dates": []} result: dict = {"cprs": [], "dates": []}
if ext not in PHOTO_EXTS and ext not in VIDEO_EXTS and ext not in AUDIO_EXTS: if ext not in PHOTO_EXTS and ext not in VIDEO_EXTS and ext not in AUDIO_EXTS:
try: try:
result = _scan_bytes_timeout(content, rel_path) result = _scan_bytes_timeout(content, rel_path, lang=ocr_lang)
except Exception as e: except Exception as e:
broadcast("scan_error", {"file": rel_path, "error": str(e)}) broadcast("scan_error", {"file": rel_path, "error": str(e)})
continue continue
cprs = result.get("cprs", []) cprs = result.get("cprs", [])
emails = result.get("emails", []) if scan_emails else []
phones = result.get("phones", []) if scan_phones else []
# Photo / biometric scan + EXIF/video/audio metadata extraction # Photo / biometric scan + EXIF/video/audio metadata extraction
_face_count = 0 _face_count = 0
@ -257,11 +312,13 @@ def run_file_scan(source: dict):
# Apply filters: distinct CPR threshold and GPS suppression # Apply filters: distinct CPR threshold and GPS suppression
_distinct_cprs = list(dict.fromkeys(c["formatted"] for c in cprs)) _distinct_cprs = list(dict.fromkeys(c["formatted"] for c in cprs))
_cpr_qualifies = len(_distinct_cprs) >= min_cpr_count _cpr_qualifies = len(_distinct_cprs) >= min_cpr_count
_distinct_emails = list(dict.fromkeys(e["formatted"] for e in emails))
_distinct_phones = list(dict.fromkeys(p["formatted"] for p in phones))
_exif_has_pii = _exif.get("has_pii") and ( _exif_has_pii = _exif.get("has_pii") and (
not skip_gps_images or bool(_exif.get("pii_fields") or _exif.get("author")) not skip_gps_images or bool(_exif.get("pii_fields") or _exif.get("author"))
) )
if not (_cpr_qualifies and cprs) and _face_count == 0 and not _exif_has_pii: if not (_cpr_qualifies and cprs) and (cpr_only or (not _distinct_emails and not _distinct_phones and _face_count == 0 and not _exif_has_pii)):
continue continue
# Build card metadata # Build card metadata
@ -297,6 +354,8 @@ def run_file_scan(source: dict):
"source": label, "source": label,
"source_type": source_type, "source_type": source_type,
"cpr_count": len(cprs), "cpr_count": len(cprs),
"email_count": len(_distinct_emails),
"phone_count": len(_distinct_phones),
"url": "", "url": "",
"size_kb": meta["size_kb"], "size_kb": meta["size_kb"],
"modified": meta["modified"], "modified": meta["modified"],
@ -317,6 +376,7 @@ def run_file_scan(source: dict):
} }
_state.flagged_items.append(card) _state.flagged_items.append(card)
_file_flagged.append(card)
total_flagged += 1 total_flagged += 1
broadcast("scan_file_flagged", _with_disposition(card, _db)) broadcast("scan_file_flagged", _with_disposition(card, _db))
@ -326,10 +386,19 @@ def run_file_scan(source: dict):
except Exception as e: except Exception as e:
logger.error("[db] save_item failed: %s", e) logger.error("[db] save_item failed: %s", e)
_file_scanned_ids.add(rel_path)
_file_items_since_save += 1
if _file_items_since_save >= _CHECKPOINT_SAVE_EVERY_FILE:
_save_checkpoint(_ck_key, _file_scanned_ids, _file_flagged, _state.scan_meta, prefix=_ck_prefix)
_file_items_since_save = 0
except Exception as e: except Exception as e:
import traceback import traceback
broadcast("scan_error", {"file": label, "error": str(e)}) broadcast("scan_error", {"file": label, "error": str(e)})
logger.error("[file_scan] error:\n%s", traceback.format_exc()) logger.error("[file_scan] error:\n%s", traceback.format_exc())
else:
if not _state._scan_abort.is_set():
_clear_checkpoint(prefix=_ck_prefix)
finally: finally:
if _db and _db_scan_id: if _db and _db_scan_id:
try: try:
@ -409,6 +478,10 @@ def run_scan(options: dict):
scan_photos = bool(scan_opts.get("scan_photos", False)) # biometric photo scan (#9) scan_photos = bool(scan_opts.get("scan_photos", False)) # biometric photo scan (#9)
skip_gps_images= bool(scan_opts.get("skip_gps_images", False)) skip_gps_images= bool(scan_opts.get("skip_gps_images", False))
min_cpr_count = max(1, int(scan_opts.get("min_cpr_count", 1))) min_cpr_count = max(1, int(scan_opts.get("min_cpr_count", 1)))
ocr_lang = str(scan_opts.get("ocr_lang", "dan+eng")) or "dan+eng"
cpr_only = bool(scan_opts.get("cpr_only", False))
scan_emails = bool(scan_opts.get("scan_emails", False))
scan_phones = bool(scan_opts.get("scan_phones", False))
# Delta token state — loaded once, updated per-source, saved on completion # Delta token state — loaded once, updated per-source, saved on completion
delta_tokens: dict = _load_delta_tokens() if delta_enabled else {} delta_tokens: dict = _load_delta_tokens() if delta_enabled else {}
@ -462,6 +535,8 @@ def run_scan(options: dict):
"source": item_meta.get("_source", ""), "source": item_meta.get("_source", ""),
"source_type": item_meta.get("_source_type", ""), "source_type": item_meta.get("_source_type", ""),
"cpr_count": len(cprs), "cpr_count": len(cprs),
"email_count": item_meta.get("_email_count", 0),
"phone_count": item_meta.get("_phone_count", 0),
"url": item_meta.get("webUrl", "") or item_meta.get("_url", ""), "url": item_meta.get("webUrl", "") or item_meta.get("_url", ""),
"size_kb": round(item_meta.get("size", 0) / 1024, 1), "size_kb": round(item_meta.get("size", 0) / 1024, 1),
"modified": (item_meta.get("lastModifiedDateTime") or item_meta.get("receivedDateTime") or "")[:10], "modified": (item_meta.get("lastModifiedDateTime") or item_meta.get("receivedDateTime") or "")[:10],
@ -478,6 +553,7 @@ def run_scan(options: dict):
"special_category": item_meta.get("_special_category", []), "special_category": item_meta.get("_special_category", []),
"face_count": item_meta.get("_face_count", 0), "face_count": item_meta.get("_face_count", 0),
"exif": item_meta.get("_exif", {}), "exif": item_meta.get("_exif", {}),
"body_excerpt": item_meta.get("_body_excerpt", ""),
} }
_state.flagged_items.append(card) _state.flagged_items.append(card)
broadcast("scan_file_flagged", _with_disposition(card, _db)) broadcast("scan_file_flagged", _with_disposition(card, _db))
@ -1002,6 +1078,14 @@ def run_scan(options: dict):
if _check_abort(): if _check_abort():
# Save checkpoint so scan can be resumed later # Save checkpoint so scan can be resumed later
_save_checkpoint(ck_key, scanned_ids, _state.flagged_items, _state.scan_meta) _save_checkpoint(ck_key, scanned_ids, _state.flagged_items, _state.scan_meta)
# Finalise the DB scan record so items found before the stop stay
# visible — this early return otherwise skips finish_scan below,
# stranding them (invisible to get_session_items / get_open_items).
if _db and _db_scan_id:
try:
_db.finish_scan(_db_scan_id, resumed_count + idx + 1)
except Exception as _e:
logger.error("[db] finish_scan (aborted) failed: %s", _e)
return return
idx += 1 idx += 1
kind, meta, _ = _work_q.popleft() # releases this item from the deque immediately kind, meta, _ = _work_q.popleft() # releases this item from the deque immediately
@ -1029,11 +1113,17 @@ def run_scan(options: dict):
# Scan body — use pre-extracted text (body HTML was stripped at # Scan body — use pre-extracted text (body HTML was stripped at
# collection time to keep work_items memory footprint small) # collection time to keep work_items memory footprint small)
all_cprs = [] all_cprs = []
all_emails = []
all_phones = []
body_text = "" body_text = ""
if scan_email_body: if scan_email_body:
body_text = meta.pop("_precomputed_body", "") body_text = meta.pop("_precomputed_body", "")
body_result = _scan_text_direct(body_text) body_result = _scan_text_direct(body_text)
all_cprs = list(body_result.get("cprs", [])) all_cprs = list(body_result.get("cprs", []))
if scan_emails:
all_emails = list(body_result.get("emails", []))
if scan_phones:
all_phones = list(body_result.get("phones", []))
# <span data-i18n="m365_opt_attachments" data-i18n="m365_opt_attachments">Scan attachments</span> # <span data-i18n="m365_opt_attachments" data-i18n="m365_opt_attachments">Scan attachments</span>
uid = meta.get("_account_id", "me") uid = meta.get("_account_id", "me")
@ -1053,21 +1143,31 @@ def run_scan(options: dict):
try: try:
att_bytes = (conn.download_attachment_for(uid, msg_id, att["id"]) att_bytes = (conn.download_attachment_for(uid, msg_id, att["id"])
if uid != "me" else conn.download_attachment(msg_id, att["id"])) if uid != "me" else conn.download_attachment(msg_id, att["id"]))
att_result = _scan_bytes(att_bytes, att_name) att_result = _scan_bytes(att_bytes, att_name, lang=ocr_lang)
att_cprs = att_result.get("cprs", []) att_cprs = att_result.get("cprs", [])
all_cprs.extend(att_cprs) all_cprs.extend(att_cprs)
if scan_emails:
all_emails.extend(att_result.get("emails", []))
if scan_phones:
all_phones.extend(att_result.get("phones", []))
att_results.append({"name": att_name, "cpr_count": len(att_cprs)}) att_results.append({"name": att_name, "cpr_count": len(att_cprs)})
except Exception as att_err: except Exception as att_err:
broadcast("scan_error", {"file": att_name, "error": str(att_err)}) broadcast("scan_error", {"file": att_name, "error": str(att_err)})
if all_cprs: _distinct_emails = list(dict.fromkeys(e["formatted"] for e in all_emails))
_distinct_phones = list(dict.fromkeys(p["formatted"] for p in all_phones))
if all_cprs or (not cpr_only and (_distinct_emails or _distinct_phones)):
meta["_thumb"] = _placeholder_svg(".eml", subject) meta["_thumb"] = _placeholder_svg(".eml", subject)
meta["_thumb_is_jpeg"] = False meta["_thumb_is_jpeg"] = False
meta["_attachments"] = att_results meta["_attachments"] = att_results
meta["_email_count"] = len(_distinct_emails)
meta["_phone_count"] = len(_distinct_phones)
_email_pii = _get_pii_counts(body_text) if scan_email_body else {} _email_pii = _get_pii_counts(body_text) if scan_email_body else {}
meta["_transfer_risk"] = _check_transfer_risk(meta) meta["_transfer_risk"] = _check_transfer_risk(meta)
meta["_special_category"] = _check_special_category( meta["_special_category"] = _check_special_category(
body_text if scan_email_body else "", all_cprs) body_text if scan_email_body else "", all_cprs)
# Store a short excerpt so preview still works if Graph is unavailable
meta["_body_excerpt"] = body_text[:500].strip() if body_text else ""
_broadcast_card(meta, all_cprs, pii_counts=_email_pii) _broadcast_card(meta, all_cprs, pii_counts=_email_pii)
del body_text # free email text — may be large for HTML-rich emails del body_text # free email text — may be large for HTML-rich emails
@ -1093,10 +1193,12 @@ def run_scan(options: dict):
else: else:
content = conn.download_item(meta) content = conn.download_item(meta)
# CPR scan — skip for video and audio (metadata-only; no text layer) # CPR/email/phone scan — skip for video and audio (metadata-only; no text layer)
_media_only = ext in VIDEO_EXTS or ext in AUDIO_EXTS _media_only = ext in VIDEO_EXTS or ext in AUDIO_EXTS
result = {"cprs": [], "dates": []} if _media_only else _scan_bytes(content, name) result = {"cprs": [], "dates": [], "emails": [], "phones": []} if _media_only else _scan_bytes(content, name, lang=ocr_lang)
cprs = result.get("cprs", []) cprs = result.get("cprs", [])
emails = result.get("emails", []) if scan_emails else []
phones = result.get("phones", []) if scan_phones else []
# ── Biometric photo scan (#9) + EXIF/video/audio metadata (#18) ─ # ── Biometric photo scan (#9) + EXIF/video/audio metadata (#18) ─
_face_count = 0 _face_count = 0
@ -1113,12 +1215,14 @@ def run_scan(options: dict):
# Apply filters: distinct CPR threshold and GPS suppression # Apply filters: distinct CPR threshold and GPS suppression
_distinct_cprs = list(dict.fromkeys(c["formatted"] for c in cprs)) _distinct_cprs = list(dict.fromkeys(c["formatted"] for c in cprs))
_cpr_qualifies = len(_distinct_cprs) >= min_cpr_count _cpr_qualifies = len(_distinct_cprs) >= min_cpr_count
_distinct_emails = list(dict.fromkeys(e["formatted"] for e in emails))
_distinct_phones = list(dict.fromkeys(p["formatted"] for p in phones))
_exif_has_pii = _exif.get("has_pii") and ( _exif_has_pii = _exif.get("has_pii") and (
not skip_gps_images or bool(_exif.get("pii_fields") or _exif.get("author")) not skip_gps_images or bool(_exif.get("pii_fields") or _exif.get("author"))
) )
# Flag item if CPRs found (above threshold), faces detected, or EXIF PII found # Flag item if CPRs/emails/phones found, faces detected, or EXIF PII found
if (_cpr_qualifies and cprs) or _face_count > 0 or _exif_has_pii: if (_cpr_qualifies and cprs) or (not cpr_only and (_distinct_emails or _distinct_phones or _face_count > 0 or _exif_has_pii)):
# Make thumbnail # Make thumbnail
if ext in {".jpg", ".jpeg", ".png"} and PIL_OK: if ext in {".jpg", ".jpeg", ".png"} and PIL_OK:
thumb = _make_thumb(content, name) thumb = _make_thumb(content, name)
@ -1154,6 +1258,8 @@ def run_scan(options: dict):
meta["_special_category"] = _sc meta["_special_category"] = _sc
meta["_face_count"] = _face_count meta["_face_count"] = _face_count
meta["_exif"] = _exif meta["_exif"] = _exif
meta["_email_count"] = len(_distinct_emails)
meta["_phone_count"] = len(_distinct_phones)
_broadcast_card(meta, cprs, pii_counts=_file_pii) _broadcast_card(meta, cprs, pii_counts=_file_pii)
else: else:
del content # no hits — free raw bytes immediately del content # no hits — free raw bytes immediately

View File

@ -43,6 +43,7 @@ _DEFAULT_JOB: dict[str, Any] = {
"profile_id": "", "profile_id": "",
"auto_email": False, "auto_email": False,
"auto_retention": False, "auto_retention": False,
"report_only": False,
"retention_years": None, "retention_years": None,
"fiscal_year_end": None, "fiscal_year_end": None,
} }
@ -270,6 +271,35 @@ class ScanScheduler:
}) })
from routes import state from routes import state
# ── Report-only path: skip scan, email latest DB results ──────────
if job_cfg.get("report_only"):
if not _m.flagged_items and _m.DB_OK:
try:
_db_inst = _m._get_db()
_db_rows = _db_inst.get_session_items() if _db_inst else []
if _db_rows:
_m.flagged_items[:] = _db_rows
except Exception:
pass
if not _m.flagged_items:
raise RuntimeError(
"No scan results available — run a scan first")
run["flagged"] = len(_m.flagged_items)
run["scanned"] = 0
run["status"] = "completed"
try:
self._send_email_report(job_cfg)
run["emailed"] = 1
except Exception as _re:
run["status"] = "failed"
run["error"] = f"Email failed: {_re}"
_m.broadcast("scheduler_done", {
"flagged": run["flagged"], "scanned": 0,
"emailed": run["emailed"], "job_name": job_cfg.get("name", ""),
})
return
# If connector not set, attempt to restore from saved config # If connector not set, attempt to restore from saved config
if not state.connector or not state.connector.is_authenticated(): if not state.connector or not state.connector.is_authenticated():
try: try:
@ -310,6 +340,16 @@ class ScanScheduler:
# Fire file scan for each file source in the profile # Fire file scan for each file source in the profile
# file_sources may be IDs (strings) or full dicts — resolve either # file_sources may be IDs (strings) or full dicts — resolve either
_all_file_sources = {s["id"]: s for s in (_m._load_file_sources() or []) if isinstance(s, dict)} _all_file_sources = {s["id"]: s for s in (_m._load_file_sources() or []) if isinstance(s, dict)}
# Merge per-scan options from the profile so the file scan honours
# cpr_only/ocr_lang/scan_photos/etc. (the browser does this in
# startScan(); the scheduler must mirror it).
_profile_opts = options.get("options", {}) or {}
_FS_OPT_KEYS = (
"scan_photos", "skip_gps_images", "min_cpr_count",
"scan_emails", "scan_phones", "cpr_only", "ocr_lang",
"max_file_mb",
)
_fs_extra = {k: _profile_opts[k] for k in _FS_OPT_KEYS if k in _profile_opts}
for fs in options.get("file_sources", []): for fs in options.get("file_sources", []):
# Resolve string IDs to full source dicts # Resolve string IDs to full source dicts
if isinstance(fs, str): if isinstance(fs, str):
@ -317,6 +357,7 @@ class ScanScheduler:
if not isinstance(fs, dict) or not fs.get("path"): if not isinstance(fs, dict) or not fs.get("path"):
logger.warning("[scheduler] skipping invalid file source: %r", fs) logger.warning("[scheduler] skipping invalid file source: %r", fs)
continue continue
fs = {**fs, **_fs_extra}
try: try:
_m.run_file_scan(fs) _m.run_file_scan(fs)
except Exception as _fse: except Exception as _fse:
@ -432,7 +473,7 @@ class ScanScheduler:
logger.info("[scheduler] Profile '%s': sources=%s, users=%d", logger.info("[scheduler] Profile '%s': sources=%s, users=%d",
p.get("name", pid), opts["sources"], len(opts.get("user_ids", []))) p.get("name", pid), opts["sources"], len(opts.get("user_ids", [])))
_m.broadcast("scheduler_debug", { _m.broadcast("scheduler_debug", {
"msg": f"Using profile '{p.get('name',pid)}': sources={opts['sources']}, users={len(opts.get("user_ids",[]))}"}) "msg": f"Using profile '{p.get('name',pid)}': sources={opts['sources']}, users={len(opts.get('user_ids',[]))}"})
return opts return opts
logger.info("[scheduler] Profile '%s' not found — using saved settings", pid) logger.info("[scheduler] Profile '%s' not found — using saved settings", pid)
_m.broadcast("scheduler_debug", {"msg": f"Profile id '{pid}' not found — falling back to saved settings"}) _m.broadcast("scheduler_debug", {"msg": f"Profile id '{pid}' not found — falling back to saved settings"})
@ -455,11 +496,15 @@ class ScanScheduler:
raise RuntimeError("No email recipients configured") raise RuntimeError("No email recipients configured")
job_name = job_cfg.get("name", "scheduled scan") job_name = job_cfg.get("name", "scheduled scan")
subject = f"GDPR Scanner — {job_name} {datetime.now().strftime('%Y-%m-%d %H:%M')}" subject = f"GDPR Scanner — {job_name} {datetime.now().strftime('%Y-%m-%d %H:%M')}"
if job_cfg.get("report_only"):
scan_line = f"Report on latest scan results. {len(_m.flagged_items)} item(s) flagged."
else:
scan_line = f"Scan completed. {len(_m.flagged_items)} item(s) flagged."
body = ( body = (
"<html><body style='font-family:Arial,sans-serif;color:#333;padding:24px'>" "<html><body style='font-family:Arial,sans-serif;color:#333;padding:24px'>"
"<h2 style='color:#1F3864'>&#128336; GDPR Scanner — scheduled scan report</h2>" "<h2 style='color:#1F3864'>&#128336; GDPR Scanner — scheduled scan report</h2>"
f"<p>Job: <strong>{job_name}</strong></p>" f"<p>Job: <strong>{job_name}</strong></p>"
f"<p>Scan completed. {len(_m.flagged_items)} item(s) flagged.</p>" f"<p>{scan_line}</p>"
f"<p>Report attached: {fname}</p></body></html>") f"<p>Report attached: {fname}</p></body></html>")
from routes.email import _send_email_graph from routes.email import _send_email_graph
from routes import state from routes import state

292
sftp_connector.py Normal file
View File

@ -0,0 +1,292 @@
"""
sftp_connector.py SFTP file iterator for GDPR Scanner.
Provides SFTPScanner.iter_files() which yields (relative_path, bytes, metadata)
for files on an SFTP/SSH server, using the same interface as FileScanner so that
run_file_scan() in scan_engine.py works identically for all three source types.
Optional dependency:
paramiko>=3.4 SSH/SFTP client (pip install paramiko)
If paramiko is not installed, SFTP_OK is False and callers must check before use.
"""
from __future__ import annotations
import stat
import time
from pathlib import PurePosixPath
from typing import Iterator
from file_scanner import SKIP_DIRS, MAX_FILE_BYTES, _skip, _error, KEYCHAIN_SERVICE
# ── Optional dependency ───────────────────────────────────────────────────────
try:
import paramiko
SFTP_OK = True
except ImportError:
SFTP_OK = False
try:
import keyring as _keyring
_KEYRING_OK = True
except ImportError:
_KEYRING_OK = False
# ── Credential helpers ────────────────────────────────────────────────────────
def get_sftp_password(host: str, user: str, keychain_key: str | None = None) -> str | None:
"""Return SFTP password or key passphrase from OS keychain."""
if not _KEYRING_OK:
return None
account = keychain_key or f"sftp:{user}@{host}"
try:
return _keyring.get_password(KEYCHAIN_SERVICE, account) or None
except Exception:
return None
def store_sftp_password(host: str, user: str, password: str,
keychain_key: str | None = None) -> bool:
"""Store SFTP password or passphrase in the OS keychain. Returns True on success."""
if not _KEYRING_OK:
return False
account = keychain_key or f"sftp:{user}@{host}"
try:
_keyring.set_password(KEYCHAIN_SERVICE, account, password)
return True
except Exception:
return False
# ── SFTPScanner ───────────────────────────────────────────────────────────────
class SFTPScanner:
"""SFTP file iterator — identical iter_files() interface to FileScanner."""
def __init__(
self,
host: str,
root_path: str,
username: str,
port: int = 22,
auth_type: str = "password", # "password" | "key"
password: str | None = None,
key_path: str | None = None,
passphrase: str | None = None,
keychain_key: str | None = None,
max_file_bytes: int = MAX_FILE_BYTES,
label: str = "",
):
self.host = host
self.port = port
self.root_path = root_path.rstrip("/") or "/"
self.username = username
self.auth_type = auth_type
self.key_path = key_path
self.keychain_key = keychain_key
self.max_file_bytes = max_file_bytes
self.label = label or f"{username}@{host}"
# Resolve credentials from keychain if not provided directly
self._password = password
self._passphrase = passphrase
if not self._password and auth_type == "password":
self._password = get_sftp_password(host, username, keychain_key)
if not self._passphrase and auth_type == "key" and key_path:
self._passphrase = get_sftp_password(host, username, keychain_key)
@staticmethod
def sftp_available() -> bool:
return SFTP_OK
@property
def source_type(self) -> str:
return "sftp"
# ── Public ────────────────────────────────────────────────────────────────
def iter_files(
self,
extensions: set[str] | None = None,
progress_cb=None,
) -> Iterator[tuple[str, bytes | None, dict]]:
"""Yield (relative_path, content_bytes, metadata) for every scannable file.
Same contract as FileScanner.iter_files() oversized and unreadable files
yield a sentinel with content=None and meta['skipped']=True.
"""
if not SFTP_OK:
raise RuntimeError("paramiko not installed — run: pip install paramiko")
from cpr_detector import SUPPORTED_EXTS as DEFAULT_EXTENSIONS
exts = extensions or DEFAULT_EXTENSIONS
ssh = paramiko.SSHClient()
ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
connect_kwargs: dict = {
"hostname": self.host,
"port": self.port,
"username": self.username,
"timeout": 30,
}
if self.auth_type == "key" and self.key_path:
pkey = _load_pkey(self.key_path, self._passphrase)
connect_kwargs["pkey"] = pkey
else:
connect_kwargs["password"] = self._password or ""
# Disable agent and key lookup when using password so paramiko doesn't
# prompt interactively when the server advertises pubkey auth.
connect_kwargs["look_for_keys"] = False
connect_kwargs["allow_agent"] = False
ssh.connect(**connect_kwargs)
try:
sftp = ssh.open_sftp()
try:
yield from self._walk(sftp, self.root_path, exts, progress_cb)
finally:
sftp.close()
finally:
ssh.close()
def _ssh_connect(self):
"""Return a connected paramiko SSHClient. Caller must call .close()."""
if not SFTP_OK:
raise RuntimeError("paramiko not installed — run: pip install paramiko")
ssh = paramiko.SSHClient()
ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
kw: dict = {
"hostname": self.host,
"port": self.port,
"username": self.username,
"timeout": 30,
}
if self.auth_type == "key" and self.key_path:
kw["pkey"] = _load_pkey(self.key_path, self._passphrase)
else:
kw["password"] = self._password or ""
kw["look_for_keys"] = False
kw["allow_agent"] = False
ssh.connect(**kw)
return ssh
def read_file(self, remote_path: str) -> bytes:
"""Download and return the raw bytes of a single remote file."""
ssh = self._ssh_connect()
try:
sftp = ssh.open_sftp()
try:
with sftp.open(remote_path, "rb") as fh:
return fh.read()
finally:
sftp.close()
finally:
ssh.close()
def write_file(self, remote_path: str, content: bytes) -> None:
"""Write content to remote_path on the SFTP server, overwriting if it exists."""
ssh = self._ssh_connect()
try:
sftp = ssh.open_sftp()
try:
with sftp.open(remote_path, "wb") as fh:
fh.write(content)
finally:
sftp.close()
finally:
ssh.close()
# ── Private walker ────────────────────────────────────────────────────────
def _walk(
self,
sftp,
directory: str,
exts: set[str],
progress_cb,
) -> Iterator[tuple[str, bytes | None, dict]]:
source_root = f"sftp://{self.username}@{self.host}{self.root_path}"
try:
entries = sftp.listdir_attr(directory)
except OSError as e:
rel = _rel(directory, self.root_path) or "."
yield _error(rel, str(e), "sftp", source_root)
return
for attr in entries:
name = attr.filename
if name.startswith("."):
continue
if name.lower() in SKIP_DIRS:
continue
full_remote = f"{directory}/{name}".replace("//", "/")
rel = _rel(full_remote, self.root_path)
if attr.st_mode is not None and stat.S_ISDIR(attr.st_mode):
yield from self._walk(sftp, full_remote, exts, progress_cb)
continue
ext = PurePosixPath(name).suffix.lower()
if ext not in exts:
continue
size = attr.st_size or 0
if size > self.max_file_bytes:
yield _skip(rel, size, "sftp", source_root)
continue
if progress_cb:
progress_cb(rel)
modified = (
time.strftime("%Y-%m-%d", time.gmtime(attr.st_mtime))
if attr.st_mtime else ""
)
meta = {
"size_kb": round(size / 1024, 1),
"modified": modified,
"source_type": "sftp",
"source_root": source_root,
"full_path": full_remote,
"skipped": False,
}
try:
with sftp.open(full_remote, "rb") as fh:
content = fh.read(self.max_file_bytes)
yield rel, content, meta
except OSError as e:
yield _error(rel, str(e), "sftp", source_root)
# ── Helpers ───────────────────────────────────────────────────────────────────
def _rel(full_path: str, root: str) -> str:
"""Return path relative to root, stripping leading slash."""
if full_path.startswith(root):
return full_path[len(root):].lstrip("/")
return full_path.lstrip("/")
def _load_pkey(key_path: str, passphrase: str | None):
"""Load a private key from disk, trying RSA → Ed25519 → ECDSA → DSS."""
for cls in (
paramiko.RSAKey,
paramiko.Ed25519Key,
paramiko.ECDSAKey,
paramiko.DSSKey,
):
try:
return cls.from_private_key_file(key_path, password=passphrase)
except paramiko.ssh_exception.SSHException:
continue
except FileNotFoundError:
raise
raise ValueError(f"Unrecognised private key format: {key_path}")

View File

@ -29,8 +29,58 @@ Never revert to `!!window._googleConnected` / `_fileSources.length > 0` — thos
- **`user_ids = "all"` must be deferred** — if `S._allUsers` is empty when `_applyProfile()` runs, set `window._pendingProfileAllUsers = true` instead of calling `.forEach()` on an empty array. `loadUsers()` checks this flag after populating `S._allUsers` and selects everyone. Do not remove this — reverting will silently leave all accounts unchecked whenever a profile is chosen on a fast machine before the user list loads. - **`user_ids = "all"` must be deferred** — if `S._allUsers` is empty when `_applyProfile()` runs, set `window._pendingProfileAllUsers = true` instead of calling `.forEach()` on an empty array. `loadUsers()` checks this flag after populating `S._allUsers` and selects everyone. Do not remove this — reverting will silently leave all accounts unchecked whenever a profile is chosen on a fast machine before the user list loads.
- **Source checkboxes may not exist yet**`_applyProfile()` calls `renderSourcesPanel()` first if `#sourcesPanel` contains no `input[data-source-id]` nodes. Same guard used in `loadUsers()`. Without it, `querySelectorAll` returns nothing and the profile's source selection is discarded; the next `renderSourcesPanel()` call re-renders all sources as checked (their default). - **Source checkboxes may not exist yet**`_applyProfile()` calls `renderSourcesPanel()` first if `#sourcesPanel` contains no `input[data-source-id]` nodes. Same guard used in `loadUsers()`. Without it, `querySelectorAll` returns nothing and the profile's source selection is discarded; the next `renderSourcesPanel()` call re-renders all sources as checked (their default).
## SSE teardown — scan.js
- **Do not close `S.es` in `scan_done` if other scans are still running** — M365 (`scan_done`), Google (`google_scan_done`), and File (`file_scan_done`) each emit their own done event. Close `S.es` only when all concurrent scans have finished: `scan_done` checks `!S._googleScanRunning && !S._fileScanRunning`; `google_scan_done` checks `!S._m365ScanRunning && !S._fileScanRunning`; `file_scan_done` checks `!S._m365ScanRunning && !S._googleScanRunning`.
- **Scheduled scans**`S._userStartedScan` is false for scheduler-triggered runs, so SSE is never closed and future scheduler events continue to arrive.
- **Two separate abort events**`state._scan_abort` (M365 + file) and `state._google_scan_abort` (Google). `POST /api/scan/stop` sets **both**. `_check_abort()` inside `_run_google_scan` must use the module-level `_scan_abort` alias (`= state._google_scan_abort`), not `gdpr_scanner._scan_abort`.
- **`_check_abort()` emits `google_scan_done`, not `scan_cancelled`** — `scan_cancelled` unconditionally closes the SSE; `google_scan_done` checks whether other scans are still running before closing.
- **`scan_phase` replay sets running flags — handled by `sse_replay_done`** — the `scan_phase` handler sets running flags to `true` whenever all flags are `false` and a source keyword is found in the phase text. On page refresh this fires during SSE replay of a completed scan, temporarily making the scan appear running. The `sse_replay_done` handler retries `loadHistorySession(null)` if no scan is running and `S._historyRefScanId` is still `null` after replay. Do not remove either the flag-setting logic or the retry.
- **Google Drive uses a lazy generator, not `list()`**`iter_drive_files()` iterated directly so `_check_abort()` fires between items. Wrapping in `list()` blocks the thread for the entire enumeration.
## Scan history browser — history.js + results.js
- **`S._historyRefScanId`** — `null` = live/SSE mode **or** the default open-items view; positive int = viewing a past session. Set by `loadHistorySession()`; cleared by `exitHistoryMode()`.
- **`loadHistorySession(null)``loadOpenItems()`** — passing `null` no longer resolves to the latest session. It now loads **all open (unactioned) items across every scan** via `GET /api/db/flagged` (no `ref`), leaves `_historyRefScanId` null, and shows no history banner. The "Open items" banner button (`onclick="loadHistorySession(null)"`, key `history_btn_latest`) therefore returns to this open-items view. Specific sessions are still loaded with a positive `ref`, which keeps the re-scan resolved-diff. Do not revert `null` to "resolve latest ref" — that reintroduces the "only the last scan is shown" complaint.
- **Auto-load on page load**`_sseWatchdog()` in `results.js` calls `window.loadHistorySession?.(null)` whenever `/api/scan/status` reports neither `running` (M365 + file lock) nor `google_running` (Google lock) **and** nothing is shown yet (`!S._historyRefScanId && !S.flaggedData.length`). This is **not one-shot** — it retries on every 4s poll until a session is restored, because (a) the replay buffer is empty after a server restart so `sse_replay_done` never fires, and (b) a completed scan's replayed `scan_phase` can leave a running flag set that would otherwise block the load forever. Because both locks are confirmed free, the watchdog clears the stale `_m365/_google/_fileScanRunning` flags before calling. Do not revert to a one-shot `_initialStatusChecked` gate — that reintroduces the "blank grid after refresh/restart" bug. `/api/scan/status` **must** report `google_running` separately; `running` alone misses live Google scans. The `sse_replay_done` handler in `scan.js` still retries for the non-empty-buffer (no-restart) case.
- **History banner** (`#historyBanner`) — shown when `S._historyRefScanId` is set. Do not hide/show from outside `history.js`.
- **Session picker** (`#historyDropdown`) — rendered inside `[data-history-wrap]` so the outside-click handler works correctly. Do not move the picker outside this wrapper.
- **Cache invalidation**`invalidateHistoryCache()` clears `_sessions` and `_latestRefScanId`. All three `*_done` SSE handlers call `window.invalidateHistoryCache?.()`.
- **Re-scan diff** — items present in the previous session but absent from the current one are tagged `_resolved: true`, rendered with `.card-resolved` and a green ✓ badge, and NOT added to `S.flaggedData` (grid-only, cannot be bulk-selected or exported).
- **Mode transitions**`startScan()` calls `window.exitHistoryMode?.()` before clearing the grid.
- **`renderGrid(files)` hides the landing cards** — whenever `files.length > 0` it hides `#emptyState` and `#lastScanSummary` and shows `#grid`. This is centralised here because the live `scan_file_flagged` handler (`scan.js`) shows the grid but does NOT clear those panels, so results would render *underneath* a still-visible landing/last-scan card until a manual refresh. Do not move this hiding back into individual callers — every render path (live SSE, `loadOpenItems`, history, filters) must clear the landing. The empty case (`files.length === 0`) is left untouched so callers still control the empty/landing state.
## Card user/group badge — results.js
- **`_accountPill(f)`** builds the account/role pill for both card layouts (list + grid). The **group badge is driven by `f.user_role`** (`student`/`staff`) alone, so it renders even with no display name — items from scans saved before `account_name` was persisted (DB migration 11) have only `user_role` + `account_id`. The user label resolves best-effort: `f.account_name``S._allUsers` match (by `id` or `email`) → email-style `account_id` → omit. Do not re-nest the role badge inside an `account_name` check (the old bug) — that hides the group badge for legacy items. Both layouts call `_accountPill(f)`; keep them sharing the one helper.
## CPR cross-referencing — results.js
- **`_loadRelated(f)`** — async; hides `#previewRelated` if `f.cpr_count` is 0, otherwise fetches `/api/db/related/<id>?ref=N` and renders a clickable list with per-item shared-CPR badge. Called from `openPreview`.
- **`window._openRelated(id, itemData)`** — looks up `id` in `S.flaggedData` first, falls back to `itemData` from the API response for items not yet in the grid.
## Sources panel resize — log.js + sources.js
- **`_fitSourcesPanel()`** — called at the end of every `renderSourcesPanel()`. Clears inline height, reads `scrollHeight`, then restores a saved preference from `localStorage` (`gdpr_sources_h`) or pins to `scrollHeight`.
- **`_initSourcesResize()`** — attaches pointer-drag to `#sourcesResizeHandle`. Captures `scrollHeight` as hard max on `pointerdown`; saves to `localStorage` on release.
- **Do not add a fixed `max-height` or `height` to `#sourcesPanel` in HTML** — height controlled entirely by `_fitSourcesPanel()` at runtime.
- **Do not call `_fitSourcesPanel()` before the panel has rendered**`scrollHeight` will be 0.
## Viewer mode — viewer.js
- **`window.VIEWER_MODE`** — injected by Jinja2. `auth.js` adds `viewer-mode` class to `<body>`; all hide rules are CSS (`body.viewer-mode …`) except `delBtn` which is also guarded in JS.
- **`window.VIEWER_SCOPE`** — injected alongside `VIEWER_MODE`. If `VIEWER_SCOPE.role` is set, `auth.js` pre-sets `#filterRole` and hides the dropdown.
- **Token onclick attributes** — Copy/Revoke buttons pass the token as a single-quoted JS string literal, never via `JSON.stringify` (which produces double-quoted strings that break `onclick="…"` attributes).
- **Share link base URL**`_getShareBaseUrl()` uses `window.location.origin` whenever the page is served over HTTPS or from a non-localhost host (a reverse-proxied hostname or LAN IP is already routable, and rewriting it to `http://<LAN-IP>` would bypass the proxy's TLS). Only when browsing at `localhost`/`127.0.0.1` over HTTP does it fetch `/api/local_ip` (LAN IP via UDP probe to `8.8.8.8`) so copied links work from other machines. The result is cached in `_shareBaseUrl` so Copy buttons stay within the click gesture. Both `createShareLink` and `copyTokenLink` are `async`. Do not make it return bare `window.location.origin` unconditionally — that reintroduces unusable `127.0.0.1` links.
- **Settings Security pane** — Admin PIN and Viewer PIN groups live in `stPaneSecurity`. `switchSettingsTab('security')` triggers both `stLoadPinStatus()` and `stLoadViewerPinStatus()`.
## Gotchas ## Gotchas
- **`navigator.clipboard` is `undefined` over plain HTTP** — the app is normally reached at `http://<LAN-IP>:5100`, a non-secure context where the Clipboard API does not exist, so calling `navigator.clipboard.writeText(...)` throws synchronously (a `.catch()` on it never runs). Always copy via `window._copyText(text, btn)` (defined in `viewer.js`) — it feature-detects the API and falls back to `document.execCommand('copy')`, then to a `prompt()`. Because `execCommand` needs a user gesture, don't `await` network calls between the click and the copy; `_getShareBaseUrl()` caches its result for this reason.
- **`scheduler.js` strings must use `t()`** — frequency labels, "Next", "Running...", "Disabled", empty-job text, and empty-history text all have translation keys. Do not hard-code English strings in `schedLoad()` or `schedRenderJobs()`.
- **Scheduler UI — `schedToggleReportOnly()`** — dims the Profile row, shows/hides `#schedReportOnlyHint`, and forces `#schedAutoEmail` checked. Called from the checkbox `onchange` handler and at the start of `schedAddJob()` / `schedEditJob()`.
- **Profile editor accounts** — default to unchecked. Only explicitly saved `user_ids` are checked. - **Profile editor accounts** — default to unchecked. Only explicitly saved `user_ids` are checked.
- **Date presets** — stored as `years * 365` (integer days). Do not use `* 365.25`. - **Date presets** — stored as `years * 365` (integer days). Do not use `* 365.25`.
- **`copyTokenLink` is async** — called from `onclick` attributes as a fire-and-forget (the Promise is unhandled, which is fine). It `await`s `_getShareBaseUrl()` to get the machine's LAN IP before building the URL. Do not make it synchronous or revert to `window.location.origin` directly. - **`copyTokenLink` is async** — called from `onclick` as fire-and-forget. Do not make it synchronous.
- **Escape scan-derived strings with `esc()`**`results.js` defines `esc()` (escapes `& < > " '`). Every value that originates from scanned content (`f.name`, `f.account_name`, `f.folder`, `f.source`, `f.modified`, `label`, image `alt`, and the same fields on `item`/related rows) must pass through `esc()` before going into `innerHTML` or a `title=`/`alt=` attribute. These are attacker-influenceable (e.g. a file named with markup), so an unescaped interpolation is stored XSS — including in shared read-only viewer sessions. Numeric counts (`cpr_count`, `size_kb`) don't need it. When embedding an object in an `onclick` payload, also `.replace(/"/g,'&quot;')` the `JSON.stringify(...)`.

View File

@ -378,6 +378,19 @@ function getGoogleScanOptions() {
// ── File sources pane ───────────────────────────────────────────────────────── // ── File sources pane ─────────────────────────────────────────────────────────
function _srcIcon(s) {
if (s.source_type === 'sftp') return '\uD83D\uDD12';
const isSmb = s.path && (s.path.startsWith('//') || s.path.startsWith('\\\\'));
return isSmb ? '\uD83C\uDF10' : '\uD83D\uDCC1';
}
function _srcSubtitle(s) {
if (s.source_type === 'sftp') {
return _esc((s.sftp_user||'')+'@'+(s.sftp_host||'')+(s.path||'/'));
}
return _esc(s.path||'')+(s.smb_user?' \u00b7 \uD83D\uDC64 '+_esc(s.smb_user):'');
}
function srcFileRenderList() { function srcFileRenderList() {
const list = document.getElementById('srcFileList'); const list = document.getElementById('srcFileList');
if (!list) return; if (!list) return;
@ -386,8 +399,7 @@ function srcFileRenderList() {
return; return;
} }
list.innerHTML = S._fileSources.map(function(s) { list.innerHTML = S._fileSources.map(function(s) {
const isSmb = s.path && (s.path.startsWith('//') || s.path.startsWith('\\\\')); const icon = _srcIcon(s);
const icon = isSmb ? '\uD83C\uDF10' : '\uD83D\uDCC1';
const sid = _esc(s.id||''); const sid = _esc(s.id||'');
const slabel = _esc(s.label||s.path||''); const slabel = _esc(s.label||s.path||'');
return '<div class="fsrc-row">' return '<div class="fsrc-row">'
@ -398,11 +410,47 @@ function srcFileRenderList() {
+'<button class="btn-edit" onclick="srcFileEdit(\''+sid+'\')" style="background:none;border:1px solid var(--border);color:var(--muted);padding:2px 7px;border-radius:4px;font-size:10px;cursor:pointer">'+t('m365_fsrc_edit_btn','Edit')+'</button>' +'<button class="btn-edit" onclick="srcFileEdit(\''+sid+'\')" style="background:none;border:1px solid var(--border);color:var(--muted);padding:2px 7px;border-radius:4px;font-size:10px;cursor:pointer">'+t('m365_fsrc_edit_btn','Edit')+'</button>'
+'<button class="btn-del" onclick="srcFileDelete(\''+sid+'\',\''+slabel+'\')">'+t('m365_profile_delete','Delete')+'</button>' +'<button class="btn-del" onclick="srcFileDelete(\''+sid+'\',\''+slabel+'\')">'+t('m365_profile_delete','Delete')+'</button>'
+'</div></div>' +'</div></div>'
+'<div class="fsrc-row-path">'+_esc(s.path||'')+(s.smb_user?' \u00b7 \uD83D\uDC64 '+_esc(s.smb_user):'')+'</div>' +'<div class="fsrc-row-path">'+_srcSubtitle(s)+'</div>'
+'</div>'; +'</div>';
}).join(''); }).join('');
} }
function srcFileTypeSelect(type) {
document.getElementById('srcFileSourceType').value = type;
var pathRow = document.getElementById('srcFilePathRow');
var smbFields = document.getElementById('srcFileSmbFields');
var sftpFields= document.getElementById('srcFileSftpFields');
if (pathRow) pathRow.style.display = type === 'sftp' ? 'none' : '';
if (smbFields) smbFields.style.display = type === 'smb' ? 'flex' : 'none';
if (sftpFields)sftpFields.style.display= type === 'sftp' ? 'flex' : 'none';
['srcTypeLocal','srcTypeSmb','srcTypeSftp'].forEach(function(id) {
var btn = document.getElementById(id);
if (!btn) return;
var active = (id === 'srcType' + type.charAt(0).toUpperCase() + type.slice(1));
btn.style.background = active ? 'var(--accent)' : 'none';
btn.style.color = active ? '#fff' : 'var(--muted)';
});
}
function srcFileAutoNameSftp() {
var labelEl = document.getElementById('srcFileLabel');
if (labelEl && labelEl._userEdited) return;
var host = (document.getElementById('srcFileSftpHost')||{}).value || '';
if (labelEl && host) labelEl.value = host;
}
function srcFileSftpAuthSelect(authType) {
document.getElementById('srcFileSftpAuth').value = authType;
var pwFields = document.getElementById('srcSftpPwFields');
var keyFields = document.getElementById('srcSftpKeyFields');
var btnPw = document.getElementById('srcSftpAuthPw');
var btnKey = document.getElementById('srcSftpAuthKey');
if (pwFields) pwFields.style.display = authType === 'password' ? '' : 'none';
if (keyFields) keyFields.style.display = authType === 'key' ? 'flex' : 'none';
if (btnPw) { btnPw.style.background = authType==='password'?'var(--accent)':'none'; btnPw.style.color = authType==='password'?'#fff':'var(--muted)'; }
if (btnKey) { btnKey.style.background = authType==='key'?'var(--accent)':'none'; btnKey.style.color = authType==='key'?'#fff':'var(--muted)'; }
}
function srcFileDetectSmb() { function srcFileDetectSmb() {
const p = document.getElementById('srcFilePath').value; const p = document.getElementById('srcFilePath').value;
const isSmb = p.startsWith('//') || p.startsWith('\\\\'); const isSmb = p.startsWith('//') || p.startsWith('\\\\');
@ -428,29 +476,79 @@ function srcFileAutoName() {
async function srcFileAdd() { async function srcFileAdd() {
const label = document.getElementById('srcFileLabel').value.trim(); const label = document.getElementById('srcFileLabel').value.trim();
const sourceType = (document.getElementById('srcFileSourceType')||{}).value || 'local';
const stat = document.getElementById('srcFileStatus');
const editIdEl = document.getElementById('srcFileEditId');
const existingId = editIdEl ? editIdEl.value : '';
if (!label) { stat.style.color='var(--danger)'; stat.textContent=t('m365_fsrc_name_required','Name is required.'); document.getElementById('srcFileLabel').focus(); return; }
stat.style.color='var(--muted)'; stat.textContent=t('m365_fsrc_saving','Saving...');
var body = {label, source_type: sourceType};
if (existingId) body.id = existingId;
if (sourceType === 'sftp') {
const sftpHost = document.getElementById('srcFileSftpHost').value.trim();
const sftpUser = document.getElementById('srcFileSftpUser').value.trim();
const sftpPath = document.getElementById('srcFileSftpPath').value.trim() || '/';
const sftpPort = parseInt(document.getElementById('srcFileSftpPort').value) || 22;
const sftpAuth = document.getElementById('srcFileSftpAuth').value || 'password';
if (!sftpHost) { stat.style.color='var(--danger)'; stat.textContent=t('m365_fsrc_sftp_host_required','SFTP host is required.'); return; }
if (!sftpUser) { stat.style.color='var(--danger)'; stat.textContent=t('m365_fsrc_sftp_user_required','SFTP username is required.'); return; }
Object.assign(body, {sftp_host:sftpHost, sftp_port:sftpPort, sftp_user:sftpUser, sftp_auth:sftpAuth, path:sftpPath});
if (sftpAuth === 'password') {
const sftpPw = document.getElementById('srcFileSftpPw').value;
if (sftpPw) {
try { await fetch('/api/file_sources/store_creds',{method:'POST',headers:{'Content-Type':'application/json'},body:JSON.stringify({source_type:'sftp',sftp_host:sftpHost,sftp_user:sftpUser,password:sftpPw})}); } catch(e){}
}
} else {
// Upload key file if one is selected
const keyFileEl = document.getElementById('srcFileSftpKeyFile');
const keyStatusEl = document.getElementById('srcFileSftpKeyStatus');
const keyPathEl = document.getElementById('srcFileSftpKeyPath');
if (keyFileEl && keyFileEl.files.length && !keyPathEl.value) {
try {
const fd = new FormData(); fd.append('key_file', keyFileEl.files[0]);
const kr = await fetch('/api/file_sources/upload_key',{method:'POST',body:fd});
const kd = await kr.json();
if (kd.error) { stat.style.color='var(--danger)'; stat.textContent=kd.error; return; }
keyPathEl.value = kd.key_path;
if (keyStatusEl) keyStatusEl.textContent = t('m365_fsrc_sftp_key_uploaded','Key uploaded');
} catch(e){ stat.style.color='var(--danger)'; stat.textContent=e.message; return; }
}
body.sftp_key_path = keyPathEl ? keyPathEl.value : '';
const passphrase = (document.getElementById('srcFileSftpPassphrase')||{}).value || '';
if (passphrase) {
const passphraseKey = sftpHost+':'+sftpUser+':passphrase';
try { await fetch('/api/file_sources/store_creds',{method:'POST',headers:{'Content-Type':'application/json'},body:JSON.stringify({source_type:'sftp',sftp_host:sftpHost,sftp_user:sftpUser,password:passphrase,keychain_key:passphraseKey})}); } catch(e){}
body.keychain_key = passphraseKey;
}
}
} else {
const path = document.getElementById('srcFilePath').value.trim(); const path = document.getElementById('srcFilePath').value.trim();
const smbHost = document.getElementById('srcFileSmbHost').value.trim(); const smbHost = document.getElementById('srcFileSmbHost').value.trim();
const smbUser = document.getElementById('srcFileSmbUser').value.trim(); const smbUser = document.getElementById('srcFileSmbUser').value.trim();
const smbPw = document.getElementById('srcFileSmbPw').value; const smbPw = document.getElementById('srcFileSmbPw').value;
const stat = document.getElementById('srcFileStatus');
if (!label) { stat.style.color='var(--danger)'; stat.textContent=t('m365_fsrc_name_required','Name is required.'); document.getElementById('srcFileLabel').focus(); return; }
if (!path) { stat.style.color='var(--danger)'; stat.textContent=t('m365_fsrc_path_required','Path is required.'); return; } if (!path) { stat.style.color='var(--danger)'; stat.textContent=t('m365_fsrc_path_required','Path is required.'); return; }
stat.style.color='var(--muted)'; stat.textContent=t('m365_fsrc_saving','Saving...'); Object.assign(body, {path, smb_host:smbHost, smb_user:smbUser});
if (smbPw && smbUser) { if (smbPw && smbUser) {
try { await fetch('/api/file_sources/store_creds',{method:'POST',headers:{'Content-Type':'application/json'},body:JSON.stringify({smb_host:smbHost,smb_user:smbUser,password:smbPw})}); } catch(e){} try { await fetch('/api/file_sources/store_creds',{method:'POST',headers:{'Content-Type':'application/json'},body:JSON.stringify({source_type:'smb',smb_host:smbHost,smb_user:smbUser,password:smbPw})}); } catch(e){}
} }
}
try { try {
const editId = document.getElementById('srcFileEditId');
const existingId = editId ? editId.value : '';
const body = {label, path, smb_host:smbHost, smb_user:smbUser};
if (existingId) body.id = existingId;
const r = await fetch('/api/file_sources/save',{method:'POST',headers:{'Content-Type':'application/json'},body:JSON.stringify(body)}); const r = await fetch('/api/file_sources/save',{method:'POST',headers:{'Content-Type':'application/json'},body:JSON.stringify(body)});
const d = await r.json(); const d = await r.json();
if (d.error) { stat.style.color='var(--danger)'; stat.textContent=d.error; return; } if (d.error) { stat.style.color='var(--danger)'; stat.textContent=d.error; return; }
['srcFileLabel','srcFilePath','srcFileSmbHost','srcFileSmbUser','srcFileSmbPw'].forEach(function(id){const el=document.getElementById(id);if(el){el.value='';el._userEdited=false;}}); // Reset form
if (editId) editId.value=''; ['srcFileLabel','srcFilePath','srcFileSmbHost','srcFileSmbUser','srcFileSmbPw',
'srcFileSftpHost','srcFileSftpUser','srcFileSftpPw','srcFileSftpPassphrase','srcFileSftpKeyPath'].forEach(function(id){const el=document.getElementById(id);if(el){el.value='';if(el._userEdited!==undefined)el._userEdited=false;}});
var portEl = document.getElementById('srcFileSftpPort'); if(portEl) portEl.value='22';
if (editIdEl) editIdEl.value='';
const addBtn=document.getElementById('srcFileAddBtn'); if(addBtn) addBtn.textContent=t('m365_fsrc_add_btn','Add'); const addBtn=document.getElementById('srcFileAddBtn'); if(addBtn) addBtn.textContent=t('m365_fsrc_add_btn','Add');
document.getElementById('srcFileSmbFields').style.display='none'; srcFileTypeSelect('local');
stat.style.color='var(--accent)'; stat.textContent='\u2714 '+t('m365_fsrc_saved','Source saved'); stat.style.color='var(--accent)'; stat.textContent='\u2714 '+t('m365_fsrc_saved','Source saved');
await _loadFileSources(); await _loadFileSources();
srcFileRenderList(); srcFileRenderList();
@ -462,20 +560,28 @@ function srcFileEdit(id) {
const s = S._fileSources.find(function(x){return x.id===id;}); const s = S._fileSources.find(function(x){return x.id===id;});
if (!s) return; if (!s) return;
const labelEl = document.getElementById('srcFileLabel'); const labelEl = document.getElementById('srcFileLabel');
const pathEl = document.getElementById('srcFilePath');
const hostEl = document.getElementById('srcFileSmbHost');
const userEl = document.getElementById('srcFileSmbUser');
const pwEl = document.getElementById('srcFileSmbPw');
const editId = document.getElementById('srcFileEditId'); const editId = document.getElementById('srcFileEditId');
if (labelEl) { labelEl.value = s.label||''; labelEl._userEdited = true; } if (labelEl) { labelEl.value = s.label||''; labelEl._userEdited = true; }
if (pathEl) pathEl.value = s.path||'';
if (hostEl) hostEl.value = s.smb_host||'';
if (userEl) userEl.value = s.smb_user||'';
if (pwEl) pwEl.value = s.smb_user ? '\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022' : '';
if (editId) editId.value = id; if (editId) editId.value = id;
const isSmb = (s.path||'').startsWith('//') || (s.path||'').startsWith('\\\\');
const smbFields = document.getElementById('srcFileSmbFields'); var sourceType = s.source_type || (((s.path||'').startsWith('//')||(s.path||'').startsWith('\\\\')) ? 'smb' : 'local');
if (smbFields) smbFields.style.display = isSmb ? 'flex' : 'none'; srcFileTypeSelect(sourceType);
if (sourceType === 'sftp') {
var hostEl = document.getElementById('srcFileSftpHost'); if(hostEl) hostEl.value = s.sftp_host||'';
var portEl = document.getElementById('srcFileSftpPort'); if(portEl) portEl.value = s.sftp_port||22;
var userEl = document.getElementById('srcFileSftpUser'); if(userEl) userEl.value = s.sftp_user||'';
var pathEl = document.getElementById('srcFileSftpPath'); if(pathEl) pathEl.value = s.path||'/';
var authEl = document.getElementById('srcFileSftpAuth'); if(authEl) authEl.value = s.sftp_auth||'password';
srcFileSftpAuthSelect(s.sftp_auth||'password');
if (s.sftp_key_path) { var kp = document.getElementById('srcFileSftpKeyPath'); if(kp) kp.value=s.sftp_key_path; }
} else {
var pathEl2 = document.getElementById('srcFilePath'); if(pathEl2) pathEl2.value = s.path||'';
var smbHostEl = document.getElementById('srcFileSmbHost'); if(smbHostEl) smbHostEl.value = s.smb_host||'';
var smbUserEl = document.getElementById('srcFileSmbUser'); if(smbUserEl) smbUserEl.value = s.smb_user||'';
var smbPwEl = document.getElementById('srcFileSmbPw'); if(smbPwEl) smbPwEl.value = s.smb_user ? '\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022' : '';
}
const btn = document.getElementById('srcFileAddBtn'); const btn = document.getElementById('srcFileAddBtn');
if (btn) btn.textContent = t('m365_fsrc_save_changes','Save changes'); if (btn) btn.textContent = t('m365_fsrc_save_changes','Save changes');
const stat = document.getElementById('srcFileStatus'); const stat = document.getElementById('srcFileStatus');
@ -547,9 +653,7 @@ function _renderFileSources() {
return; return;
} }
list.innerHTML = S._fileSources.map(function(s) { list.innerHTML = S._fileSources.map(function(s) {
const isSmb = s.path && (s.path.startsWith('//') || s.path.startsWith('\\\\')); const icon = _srcIcon(s);
const icon = isSmb ? '\uD83C\uDF10' : '\uD83D\uDCC1';
const userPart = s.smb_user ? ' \u00b7 \uD83D\uDC64 ' + _esc(s.smb_user) : '';
const sid = _esc(s.id || ''); const sid = _esc(s.id || '');
const slabel = _esc(s.label || s.path || ''); const slabel = _esc(s.label || s.path || '');
return '<div class="fsrc-row">' return '<div class="fsrc-row">'
@ -559,7 +663,7 @@ function _renderFileSources() {
+ '<button class="btn-scan" onclick="fsrcScan(\'' + sid + '\')">&#9654; ' + t('m365_fsrc_scan_btn','Scan') + '</button>' + '<button class="btn-scan" onclick="fsrcScan(\'' + sid + '\')">&#9654; ' + t('m365_fsrc_scan_btn','Scan') + '</button>'
+ '<button class="btn-del" onclick="fsrcDelete(\'' + sid + '\',\'' + slabel + '\')">' + t('m365_profile_delete','Delete') + '</button>' + '<button class="btn-del" onclick="fsrcDelete(\'' + sid + '\',\'' + slabel + '\')">' + t('m365_profile_delete','Delete') + '</button>'
+ '</div></div>' + '</div></div>'
+ '<div class="fsrc-row-path">' + _esc(s.path || '') + userPart + '</div>' + '<div class="fsrc-row-path">' + _srcSubtitle(s) + '</div>'
+ '</div>'; + '</div>';
}).join(''); }).join('');
} }
@ -667,6 +771,9 @@ window.getGoogleScanOptions = getGoogleScanOptions;
window.srcFileRenderList = srcFileRenderList; window.srcFileRenderList = srcFileRenderList;
window.srcFileDetectSmb = srcFileDetectSmb; window.srcFileDetectSmb = srcFileDetectSmb;
window.srcFileAutoName = srcFileAutoName; window.srcFileAutoName = srcFileAutoName;
window.srcFileAutoNameSftp = srcFileAutoNameSftp;
window.srcFileTypeSelect = srcFileTypeSelect;
window.srcFileSftpAuthSelect = srcFileSftpAuthSelect;
window.srcFileAdd = srcFileAdd; window.srcFileAdd = srcFileAdd;
window.srcFileEdit = srcFileEdit; window.srcFileEdit = srcFileEdit;
window.srcFileDelete = srcFileDelete; window.srcFileDelete = srcFileDelete;

View File

@ -38,22 +38,56 @@ function invalidateHistoryCache() {
// ── Load a session into the results grid ────────────────────────────────────── // ── Load a session into the results grid ──────────────────────────────────────
async function loadHistorySession(refScanId) { // Default landing view: every flagged item still awaiting action, across all
// refScanId: null → latest session, positive int → specific session // scans (not just the latest session). Leaves S._historyRefScanId null (live
let resolvedRef = refScanId; // mode) and shows no history banner — this is "now", not a past session.
if (resolvedRef === null) { async function loadOpenItems() {
const sessions = _sessions !== null ? _sessions : await _fetchSessions(); // Bail if a scan is running — live SSE owns the grid then.
if (!sessions.length) { if (S._m365ScanRunning || S._googleScanRunning || S._fileScanRunning) return;
// No scans in DB — nothing to show try {
const r = await fetch('/api/db/flagged');
const items = await r.json();
if (S._m365ScanRunning || S._googleScanRunning || S._fileScanRunning) return;
closeHistoryPicker();
if (!Array.isArray(items) || items.length === 0) {
S._historyRefScanId = null;
_setHistoryBanner(false);
window.loadLastScanSummary?.(); window.loadLastScanSummary?.();
return; return;
} }
resolvedRef = sessions[0].ref_scan_id;
S._historyRefScanId = null;
S.flaggedData = items;
S.filteredData = [];
const grid = document.getElementById('grid');
const emptyState = document.getElementById('emptyState');
const lastScan = document.getElementById('lastScanSummary');
if (emptyState) emptyState.style.display = 'none';
if (lastScan) lastScan.style.display = 'none';
if (grid) { grid.innerHTML = ''; grid.style.display = 'grid'; }
window.renderGrid(items);
try { window.markOverdueCards(); } catch(_) {}
try { window.loadTrend(); } catch(_) {}
_setHistoryBanner(false);
} catch(e) {
console.error('[history] failed to load open items:', e);
} }
}
async function loadHistorySession(refScanId) {
// refScanId: null → all open (unreviewed) items across every scan,
// positive int → a specific past session
if (refScanId === null) return loadOpenItems();
const resolvedRef = refScanId;
try { try {
const r = await fetch('/api/db/flagged?ref=' + resolvedRef); const r = await fetch('/api/db/flagged?ref=' + resolvedRef);
const items = await r.json(); const items = await r.json();
// Bail if a scan started while we were fetching flagged items
if (S._m365ScanRunning || S._googleScanRunning || S._fileScanRunning) return;
closeHistoryPicker(); closeHistoryPicker();
if (!Array.isArray(items) || items.length === 0) { if (!Array.isArray(items) || items.length === 0) {
@ -78,6 +112,31 @@ async function loadHistorySession(refScanId) {
try { window.markOverdueCards(); } catch(_) {} try { window.markOverdueCards(); } catch(_) {}
try { window.loadTrend(); } catch(_) {} try { window.loadTrend(); } catch(_) {}
_setHistoryBanner(true, resolvedRef); _setHistoryBanner(true, resolvedRef);
// ── Re-scan diff: append items from previous session no longer present ────
const allSessions = _sessions !== null ? _sessions : await _fetchSessions();
const idx = allSessions.findIndex(s => s.ref_scan_id === resolvedRef);
if (idx !== -1 && idx + 1 < allSessions.length) {
const prevRef = allSessions[idx + 1].ref_scan_id;
try {
const pr = await fetch('/api/db/flagged?ref=' + prevRef);
const prevItems = await pr.json();
if (Array.isArray(prevItems) && prevItems.length) {
const currentIds = new Set(items.map(f => f.id));
const resolved = prevItems.filter(f => !currentIds.has(f.id));
if (resolved.length) {
const divider = document.createElement('div');
divider.className = 'resolved-divider';
divider.textContent = resolved.length + ' ' + t('history_resolved_label', 'items no longer present');
document.getElementById('grid')?.appendChild(divider);
resolved.forEach(f => { f._resolved = true; window.appendCard(f); });
_setHistoryBanner(true, resolvedRef, resolved.length);
}
}
} catch(e) {
console.warn('[history] diff failed:', e);
}
}
} catch(e) { } catch(e) {
console.error('[history] failed to load session:', e); console.error('[history] failed to load session:', e);
} }
@ -85,7 +144,7 @@ async function loadHistorySession(refScanId) {
// ── Banner ──────────────────────────────────────────────────────────────────── // ── Banner ────────────────────────────────────────────────────────────────────
function _setHistoryBanner(visible, resolvedRef) { function _setHistoryBanner(visible, resolvedRef, resolvedCount) {
const banner = document.getElementById('historyBanner'); const banner = document.getElementById('historyBanner');
const bannerTxt = document.getElementById('historyBannerText'); const bannerTxt = document.getElementById('historyBannerText');
const latestBtn = document.getElementById('historyLatestBtn'); const latestBtn = document.getElementById('historyLatestBtn');
@ -103,6 +162,7 @@ function _setHistoryBanner(visible, resolvedRef) {
label = date + ' ' + time label = date + ' ' + time
+ (srcStr ? ' · ' + srcStr : '') + (srcStr ? ' · ' + srcStr : '')
+ ' · ' + sess.flagged_count + ' ' + t('history_items', 'items'); + ' · ' + sess.flagged_count + ' ' + t('history_items', 'items');
if (resolvedCount) label += ' · ' + resolvedCount + ' ' + t('history_resolved_badge', 'resolved');
} else { } else {
label = S.flaggedData.length + ' ' + t('history_items', 'items'); label = S.flaggedData.length + ' ' + t('history_items', 'items');
} }

View File

@ -161,10 +161,9 @@ function copyLog() {
document.querySelectorAll('#logPanel .log-line:not(#logLive)').forEach(function(d) { document.querySelectorAll('#logPanel .log-line:not(#logLive)').forEach(function(d) {
lines.push(d.textContent); lines.push(d.textContent);
}); });
navigator.clipboard.writeText(lines.join('\n')).then(function() {
const btn = document.querySelector('.log-copy-btn'); const btn = document.querySelector('.log-copy-btn');
if (btn) { btn.textContent = '✓ Copied'; setTimeout(function(){ btn.textContent = '⎘ Copy'; }, 1500); } // _copyText (viewer.js) handles HTTP contexts where navigator.clipboard is undefined.
}).catch(function() {}); if (btn) window._copyText(lines.join('\n'), btn);
} }
function _restoreLog() { function _restoreLog() {

View File

@ -137,6 +137,26 @@ function _applyProfile(profile) {
if (el) el.value = opts.min_cpr_count; if (el) el.value = opts.min_cpr_count;
} }
if (opts.ocr_lang !== undefined) {
const el = document.getElementById('optOcrLang');
if (el) el.value = opts.ocr_lang;
}
if (opts.cpr_only !== undefined) {
const el = document.getElementById('optCprOnly');
if (el) el.checked = opts.cpr_only;
}
if (opts.scan_emails !== undefined) {
const el = document.getElementById('optScanEmails');
if (el) el.checked = opts.scan_emails;
}
if (opts.scan_phones !== undefined) {
const el = document.getElementById('optScanPhones');
if (el) el.checked = opts.scan_phones;
}
// ── Date filter ─────────────────────────────────────────────────────────── // ── Date filter ───────────────────────────────────────────────────────────
const days = opts.older_than_days; const days = opts.older_than_days;
if (days !== undefined) { if (days !== undefined) {
@ -417,6 +437,10 @@ function _openEditorForProfile(profile) {
<div class="pmgmt-opt-row"><span>${t('m365_opt_scan_photos','Søg efter ansigter i billeder')}</span><label class="toggle"><input type="checkbox" id="peOptPhotos" ${opts.scan_photos ? 'checked' : ''}><span class="toggle-slider"></span></label></div> <div class="pmgmt-opt-row"><span>${t('m365_opt_scan_photos','Søg efter ansigter i billeder')}</span><label class="toggle"><input type="checkbox" id="peOptPhotos" ${opts.scan_photos ? 'checked' : ''}><span class="toggle-slider"></span></label></div>
<div class="pmgmt-opt-row"><span>${t('m365_opt_skip_gps','Ignorer GPS i billeder')}</span><label class="toggle"><input type="checkbox" id="peOptSkipGps" ${opts.skip_gps_images ? 'checked' : ''}><span class="toggle-slider"></span></label></div> <div class="pmgmt-opt-row"><span>${t('m365_opt_skip_gps','Ignorer GPS i billeder')}</span><label class="toggle"><input type="checkbox" id="peOptSkipGps" ${opts.skip_gps_images ? 'checked' : ''}><span class="toggle-slider"></span></label></div>
<div class="pmgmt-opt-row"><span style="color:var(--muted)">${t('m365_opt_min_cpr','Min. CPR-antal pr. fil')}</span><input type="number" id="peOptMinCpr" value="${opts.min_cpr_count || 1}" min="1" max="50" style="width:46px;padding:3px 6px;font-size:11px;text-align:right"></div> <div class="pmgmt-opt-row"><span style="color:var(--muted)">${t('m365_opt_min_cpr','Min. CPR-antal pr. fil')}</span><input type="number" id="peOptMinCpr" value="${opts.min_cpr_count || 1}" min="1" max="50" style="width:46px;padding:3px 6px;font-size:11px;text-align:right"></div>
<div class="pmgmt-opt-row"><span>${t('m365_opt_cpr_only','CPR-only mode')}</span><label class="toggle"><input type="checkbox" id="peOptCprOnly" ${opts.cpr_only ? 'checked' : ''}><span class="toggle-slider"></span></label></div>
<div class="pmgmt-opt-row"><span style="color:var(--muted)">${t('m365_opt_ocr_lang','OCR-sprog')}</span><select id="peOptOcrLang" style="font-size:11px;padding:2px 4px;background:var(--surface);border:1px solid var(--border);color:var(--text);border-radius:4px"><option value="dan+eng" ${(opts.ocr_lang||'dan+eng')==='dan+eng'?'selected':''}>dan+eng</option><option value="dan" ${opts.ocr_lang==='dan'?'selected':''}>dan</option><option value="eng" ${opts.ocr_lang==='eng'?'selected':''}>eng</option><option value="dan+eng+deu" ${opts.ocr_lang==='dan+eng+deu'?'selected':''}>dan+eng+deu</option><option value="dan+eng+swe" ${opts.ocr_lang==='dan+eng+swe'?'selected':''}>dan+eng+swe</option><option value="dan+eng+fra" ${opts.ocr_lang==='dan+eng+fra'?'selected':''}>dan+eng+fra</option></select></div>
<div class="pmgmt-opt-row"><span>${t('m365_opt_scan_emails','Søg efter e-mailadresser')}</span><label class="toggle"><input type="checkbox" id="peOptEmails" ${opts.scan_emails ? 'checked' : ''}><span class="toggle-slider"></span></label></div>
<div class="pmgmt-opt-row"><span>${t('m365_opt_scan_phones','Søg efter telefonnumre')}</span><label class="toggle"><input type="checkbox" id="peOptPhones" ${opts.scan_phones ? 'checked' : ''}><span class="toggle-slider"></span></label></div>
<hr style="border:none;border-top:1px solid var(--pmgmt-divider);margin:2px 0"> <hr style="border:none;border-top:1px solid var(--pmgmt-divider);margin:2px 0">
<div class="pmgmt-opt-row"><span>${t('m365_opt_retention','Opbevaringspolitik')}</span><label class="toggle"><input type="checkbox" id="peOptRetention" ${profile.retention_years ? 'checked' : ''}><span class="toggle-slider"></span></label></div> <div class="pmgmt-opt-row"><span>${t('m365_opt_retention','Opbevaringspolitik')}</span><label class="toggle"><input type="checkbox" id="peOptRetention" ${profile.retention_years ? 'checked' : ''}><span class="toggle-slider"></span></label></div>
<div style="padding:7px 8px;background:var(--bg);border-radius:6px"> <div style="padding:7px 8px;background:var(--bg);border-radius:6px">
@ -633,6 +657,10 @@ async function _pmgmtSaveFullEdit() {
scan_photos: document.getElementById('peOptPhotos')?.checked ?? false, scan_photos: document.getElementById('peOptPhotos')?.checked ?? false,
skip_gps_images: document.getElementById('peOptSkipGps')?.checked ?? false, skip_gps_images: document.getElementById('peOptSkipGps')?.checked ?? false,
min_cpr_count: parseInt(document.getElementById('peOptMinCpr')?.value) || 1, min_cpr_count: parseInt(document.getElementById('peOptMinCpr')?.value) || 1,
ocr_lang: document.getElementById('peOptOcrLang')?.value || 'dan+eng',
cpr_only: document.getElementById('peOptCprOnly')?.checked ?? false,
scan_emails: document.getElementById('peOptEmails')?.checked ?? false,
scan_phones: document.getElementById('peOptPhones')?.checked ?? false,
}, },
retention_years: document.getElementById('peOptRetention')?.checked ? (parseInt(document.getElementById('peOptRetYears')?.value) || 5) : null, retention_years: document.getElementById('peOptRetention')?.checked ? (parseInt(document.getElementById('peOptRetYears')?.value) || 5) : null,
fiscal_year_end: document.getElementById('peOptRetention')?.checked ? (document.getElementById('peOptFiscalYearEnd')?.value || '') : '', fiscal_year_end: document.getElementById('peOptRetention')?.checked ? (document.getElementById('peOptFiscalYearEnd')?.value || '') : '',

View File

@ -1,4 +1,18 @@
import { S } from './state.js'; import { S } from './state.js';
// Escape untrusted strings (filenames, account/display names, folders) before
// embedding them in innerHTML / title attributes. Scan-derived values can come
// from attacker-controlled content (e.g. a OneDrive file named with markup),
// so every such field must pass through esc() to prevent stored XSS.
function esc(s) {
return String(s == null ? '' : s)
.replace(/&/g, '&amp;')
.replace(/</g, '&lt;')
.replace(/>/g, '&gt;')
.replace(/"/g, '&quot;')
.replace(/'/g, '&#39;');
}
// ── Cards ───────────────────────────────────────────────────────────────────── // ── Cards ─────────────────────────────────────────────────────────────────────
const SOURCE_BADGES = { const SOURCE_BADGES = {
email: ['📧', 'badge-email', 'Outlook'], email: ['📧', 'badge-email', 'Outlook'],
@ -11,6 +25,31 @@ const SOURCE_BADGES = {
smb: ['🌐', 'badge-smb', 'Network'], smb: ['🌐', 'badge-smb', 'Network'],
}; };
// Build the user/group pill for a card. The group (role) badge is driven by
// user_role alone so it shows even when no display name is available — e.g.
// items from earlier scans saved before account_name was persisted. For those
// the user label is resolved best-effort from the loaded user list (by id or
// email), falling back to an email-style account_id. Returns '' when there is
// neither a label nor a role to show.
function _accountPill(f) {
const roleBadge =
f.user_role === 'student' ? '<span class="role-badge">' + t('role_student', 'Elev') + '</span>' :
f.user_role === 'staff' ? '<span class="role-badge">' + t('role_staff', 'Ansat') + '</span>' : '';
let label = f.account_name || '';
if (!label && f.account_id) {
const aid = String(f.account_id);
const u = (S._allUsers || []).find(function(u) {
return u.id === f.account_id ||
(u.email && u.email.toLowerCase() === aid.toLowerCase());
});
if (u) label = u.displayName || '';
else if (aid.includes('@')) label = aid; // an email is already human-readable
}
if (!label && !roleBadge) return '';
const title = label || f.user_role || '';
return '<span class="account-pill" title="' + esc(title) + '">' + roleBadge + (label ? esc(label) : '') + '</span>';
}
function appendCard(f) { function appendCard(f) {
const search = document.getElementById('filterSearch').value.trim().toLowerCase(); const search = document.getElementById('filterSearch').value.trim().toLowerCase();
const srcVal = document.getElementById('filterSource').value; const srcVal = document.getElementById('filterSource').value;
@ -24,7 +63,7 @@ function appendCard(f) {
: '/api/thumb?name=' + encodeURIComponent(f.name) + '&type=' + encodeURIComponent(f.source_type); : '/api/thumb?name=' + encodeURIComponent(f.name) + '&type=' + encodeURIComponent(f.source_type);
const card = document.createElement('div'); const card = document.createElement('div');
card.className = 'card' + (S.isListView ? ' list-view' : '') + (S._selectedIds.has(f.id) ? ' card-selected-bulk' : ''); card.className = 'card' + (S.isListView ? ' list-view' : '') + (S._selectedIds.has(f.id) ? ' card-selected-bulk' : '') + ((f._resolved || f._redacted || f._deleted) ? ' card-resolved' : '');
card.dataset.id = f.id; card.dataset.id = f.id;
card.onclick = (e) => { if (S._selectMode) { toggleCardSelect(f.id, e); } else { openPreview(f); } }; card.onclick = (e) => { if (S._selectMode) { toggleCardSelect(f.id, e); } else { openPreview(f); } };
@ -35,32 +74,46 @@ function appendCard(f) {
cb.onclick = (e) => { e.stopPropagation(); toggleCardSelect(f.id, e); }; cb.onclick = (e) => { e.stopPropagation(); toggleCardSelect(f.id, e); };
card.appendChild(cb); card.appendChild(cb);
const delBtn = window.VIEWER_MODE ? '' : `<button class="card-delete-btn" title="${t('m365_delete_confirm','Delete')}" onclick="event.stopPropagation();deleteItem(${JSON.stringify(f).replace(/"/g,'&quot;')},this.closest('.card'))">🗑</button>`; const delBtn = (window.VIEWER_MODE || f._resolved || f._redacted || f._deleted) ? '' : `<button class="card-delete-btn" title="${t('m365_delete_confirm','Delete')}" onclick="event.stopPropagation();deleteItem(${JSON.stringify(f).replace(/"/g,'&quot;')},this.closest('.card'))">🗑</button>`;
const _redactExts = new Set(['.docx', '.xlsx', '.txt', '.csv', '.pdf']);
const _cloudRedactExts = new Set(['.docx', '.xlsx', '.pdf']);
const _m365Types = new Set(['onedrive', 'sharepoint', 'teams']);
const _fileExt = (f.name || '').substring((f.name || '').lastIndexOf('.')).toLowerCase();
const _redactable = !window.VIEWER_MODE && !f._resolved && !f._redacted && !f._deleted && f.cpr_count > 0 && (
f.source_type === 'local' ? _redactExts.has(_fileExt) :
_m365Types.has(f.source_type) ? _cloudRedactExts.has(_fileExt) :
f.source_type === 'gdrive' ? _cloudRedactExts.has(_fileExt) :
(f.source_type === 'smb' || f.source_type === 'sftp') ? _redactExts.has(_fileExt) : false
);
const redactBtn = _redactable ? `<button class="card-redact-btn" title="${t('redact_btn','Redact CPR')}" onclick="event.stopPropagation();redactItem(${JSON.stringify(f).replace(/"/g,'&quot;')},this.closest('.card'))">✏</button>` : '';
const acctPill = _accountPill(f);
if (S.isListView) { if (S.isListView) {
card.innerHTML = ` card.innerHTML = `
<div style="font-size:24px; flex-shrink:0">${icon}</div> <div style="font-size:24px; flex-shrink:0">${icon}</div>
<div class="card-info list-info"> <div class="card-info list-info">
<div class="card-name" title="${f.name}">${f.name}</div> <div class="card-name" title="${esc(f.name)}">${esc(f.name)}</div>
<div class="card-meta">${f.size_kb} KB · ${f.modified || ''}${f.folder ? ' · 📂 ' + f.folder : ''}</div> <div class="card-meta">${f.size_kb} KB · ${esc(f.modified || '')}${f.folder ? ' · 📂 ' + esc(f.folder) : ''}</div>
<div class="card-source"><span class="source-badge ${badgeCls}">${label}</span> ${f.source || ''}${f.account_name ? ' · <span class="account-pill" title="' + f.account_name + '">' + (f.user_role === 'student' ? '<span class="role-badge">' + t('role_student','Elev') + '</span>' : f.user_role === 'staff' ? '<span class="role-badge">' + t('role_staff','Ansat') + '</span>' : '') + f.account_name + '</span>' : ''}${f.transfer_risk === 'external-recipient' ? ' <span class="role-pill" style="background:#7B2D00;color:#FFD0B0"> Ext.</span>' : f.transfer_risk ? ' <span class="role-pill" style="background:#003D7B;color:#B0D4FF">🔗</span>' : ''}</div> <div class="card-source"><span class="source-badge ${badgeCls}">${esc(label)}</span> ${esc(f.source || '')}${acctPill ? ' · ' + acctPill : ''}${f.transfer_risk === 'external-recipient' ? ' <span class="role-pill" style="background:#7B2D00;color:#FFD0B0"> Ext.</span>' : f.transfer_risk ? ' <span class="role-pill" style="background:#003D7B;color:#B0D4FF">🔗</span>' : ''}</div>
</div> </div>
<span class="cpr-badge">${f.cpr_count} CPR</span> <span class="cpr-badge">${f.cpr_count} CPR</span>
${f.email_count > 0 ? '<span class="email-badge">' + f.email_count + ' ' + t('m365_badge_emails', 'e-mail') + '</span> ' : ''}
${f.phone_count > 0 ? '<span class="phone-badge">' + f.phone_count + ' ' + t('m365_badge_phones', 'tlf.') + '</span> ' : ''}
${f.face_count > 0 ? '<span class="photo-face-badge">' + f.face_count + ' ' + t('m365_badge_faces', f.face_count === 1 ? 'face' : 'faces') + '</span> ' : ''} ${f.face_count > 0 ? '<span class="photo-face-badge">' + f.face_count + ' ' + t('m365_badge_faces', f.face_count === 1 ? 'face' : 'faces') + '</span> ' : ''}
${f.exif && f.exif.gps ? '<span class="photo-face-badge" style="background:#0a3a5a;color:#7ec8d0">🌍 GPS</span> ' : ''} ${f.exif && f.exif.gps ? '<span class="photo-face-badge" style="background:#0a3a5a;color:#7ec8d0">🌍 GPS</span> ' : ''}
${f.special_category && f.special_category.length ? '<span class="special-cat-badge">⚠ Art.9 — ' + f.special_category.filter(function(s){return s !== 'gps_location' && s !== 'exif_pii';}).join(', ') + '</span> ' : ''}${f.overdue ? '<span class="overdue-badge">🗓 Overdue</span>' : ''} ${f.special_category && f.special_category.length ? '<span class="special-cat-badge">⚠ Art.9 — ' + f.special_category.filter(function(s){return s !== 'gps_location' && s !== 'exif_pii';}).join(', ') + '</span> ' : ''}${f._deleted ? '<span class="resolved-badge" style="background:#3a1a1a;color:#ff9b9b">🗑 ' + t('delete_badge', 'Deleted') + '</span> ' : ''}${f._redacted ? '<span class="resolved-badge">✏ ' + t('redact_badge', 'Redacted') + '</span> ' : ''}${f._resolved ? '<span class="resolved-badge">✓ ' + t('history_resolved_badge', 'Resolved') + '</span> ' : ''}${f.overdue ? '<span class="overdue-badge">🗓 Overdue</span>' : ''}
${delBtn}`; ${delBtn}${redactBtn}`;
} else { } else {
card.innerHTML = ` card.innerHTML = `
<div class="thumb-wrap"><img src="${src}" alt="${f.name}" loading="lazy"></div> <div class="thumb-wrap"><img src="${src}" alt="${esc(f.name)}" loading="lazy"></div>
<div class="card-info"> <div class="card-info">
<div class="card-name" title="${f.name}">${f.name}</div> <div class="card-name" title="${esc(f.name)}">${esc(f.name)}</div>
<div class="card-meta">${f.size_kb} KB · ${f.modified || ''}</div> <div class="card-meta">${f.size_kb} KB · ${esc(f.modified || '')}</div>
${f.folder ? `<div class="card-meta" style="font-size:10px" title="${f.folder}">📂 ${f.folder}</div>` : ''} ${f.folder ? `<div class="card-meta" style="font-size:10px" title="${esc(f.folder)}">📂 ${esc(f.folder)}</div>` : ''}
<div class="card-source"><span class="source-badge ${badgeCls}">${label}</span>${f.account_name ? ' <span class="account-pill" title="' + f.account_name + '">' + (f.user_role === "student" ? '<span class="role-badge">' + t("role_student","Elev") + "</span>" : f.user_role === "staff" ? '<span class="role-badge">' + t("role_staff","Ansat") + "</span>" : "") + f.account_name + '</span>' : ''}${f.transfer_risk === "external-recipient" ? ' <span class="role-pill" style="background:#7B2D00;color:#FFD0B0"> Ext.</span>' : f.transfer_risk ? ' <span class="role-pill" style="background:#003D7B;color:#B0D4FF">🔗</span>' : ''}</div> <div class="card-source"><span class="source-badge ${badgeCls}">${esc(label)}</span>${acctPill ? ' ' + acctPill : ''}${f.transfer_risk === "external-recipient" ? ' <span class="role-pill" style="background:#7B2D00;color:#FFD0B0"> Ext.</span>' : f.transfer_risk ? ' <span class="role-pill" style="background:#003D7B;color:#B0D4FF">🔗</span>' : ''}</div>
<span class="cpr-badge">${f.cpr_count} CPR</span>${f.face_count > 0 ? ' <span class="photo-face-badge">' + f.face_count + ' ' + t('m365_badge_faces', f.face_count === 1 ? 'face' : 'faces') + '</span>' : ''}${f.exif && f.exif.gps ? ' <span class="photo-face-badge" style="background:#0a3a5a;color:#7ec8d0">🌍 GPS</span>' : ''}${f.overdue ? ' <span class="overdue-badge">🗓 Overdue</span>' : ''} <span class="cpr-badge">${f.cpr_count} CPR</span>${f.email_count > 0 ? ' <span class="email-badge">' + f.email_count + ' ' + t('m365_badge_emails', 'e-mail') + '</span>' : ''}${f.phone_count > 0 ? ' <span class="phone-badge">' + f.phone_count + ' ' + t('m365_badge_phones', 'tlf.') + '</span>' : ''}${f.face_count > 0 ? ' <span class="photo-face-badge">' + f.face_count + ' ' + t('m365_badge_faces', f.face_count === 1 ? 'face' : 'faces') + '</span>' : ''}${f.exif && f.exif.gps ? ' <span class="photo-face-badge" style="background:#0a3a5a;color:#7ec8d0">🌍 GPS</span>' : ''}${f._deleted ? ' <span class="resolved-badge" style="background:#3a1a1a;color:#ff9b9b">🗑 ' + t('delete_badge', 'Deleted') + '</span>' : ''}${f._redacted ? ' <span class="resolved-badge"> ' + t('redact_badge', 'Redacted') + '</span>' : ''}${f._resolved ? ' <span class="resolved-badge"> ' + t('history_resolved_badge', 'Resolved') + '</span>' : ''}${f.overdue ? ' <span class="overdue-badge">🗓 Overdue</span>' : ''}
</div> </div>
${delBtn}`; ${delBtn}${redactBtn}`;
} }
grid.appendChild(card); grid.appendChild(card);
} }
@ -69,6 +122,17 @@ function renderGrid(files) {
const grid = document.getElementById('grid'); const grid = document.getElementById('grid');
grid.innerHTML = ''; grid.innerHTML = '';
files.forEach(f => appendCard(f)); files.forEach(f => appendCard(f));
// Whenever results are rendered, the landing/last-scan cards must be hidden —
// the live scan_file_flagged path shows the grid but does not clear them, so
// results would otherwise appear underneath the still-visible landing page
// until a manual refresh. Centralised here so every render path is covered.
if (files && files.length) {
const es = document.getElementById('emptyState');
if (es) es.style.display = 'none';
const ls = document.getElementById('lastScanSummary');
if (ls) ls.style.display = 'none';
if (grid) grid.style.display = S.isListView ? 'block' : 'grid';
}
_updateBulkBar(); _updateBulkBar();
updateDispositionStats(); updateDispositionStats();
} }
@ -91,22 +155,30 @@ async function openPreview(f) {
panel.classList.remove('hidden'); panel.classList.remove('hidden');
const _savedW = sessionStorage.getItem('gdpr_preview_width'); const _savedW = sessionStorage.getItem('gdpr_preview_width');
if (_savedW) panel.style.width = _savedW + 'px'; if (_savedW) panel.style.width = _savedW + 'px';
// Opening the panel narrows .grid-area and reflows the grid to fewer columns,
// moving the selected card to a new row. Defer the scroll by two frames so it
// runs against the settled layout, and centre the card so it stays visible.
if (cardEl) requestAnimationFrame(() => requestAnimationFrame(() =>
cardEl.scrollIntoView({ behavior: 'smooth', block: 'center' })));
title.textContent = f.name; title.textContent = f.name;
frame.style.display = 'none'; frame.style.display = 'none';
loading.style.display = 'flex'; loading.style.display = 'flex';
loading.textContent = 'Loading preview…'; loading.textContent = 'Loading preview…';
meta.innerHTML = [ meta.innerHTML = [
f.account_name ? `<span style="font-weight:500">👤 ${f.account_name}</span>` : '', f.account_name ? `<span style="font-weight:500">👤 ${esc(f.account_name)}</span>` : '',
f.source ? `<span>${f.source}</span>` : '', f.source ? `<span>${esc(f.source)}</span>` : '',
f.size_kb ? `<span>${f.size_kb} KB</span>` : '', f.size_kb ? `<span>${f.size_kb} KB</span>` : '',
f.modified ? `<span>${f.modified}</span>` : '', f.modified ? `<span>${esc(f.modified)}</span>` : '',
f.cpr_count ? `<span style="color:var(--danger)">${f.cpr_count} CPR</span>` : '', f.cpr_count ? `<span style="color:var(--danger)">${f.cpr_count} CPR</span>` : '',
f.email_count ? `<span style="color:#7ec8f0">${f.email_count} ${t('m365_badge_emails','e-mail')}</span>` : '',
f.phone_count ? `<span style="color:#7eeac0">${f.phone_count} ${t('m365_badge_phones','tlf.')}</span>` : '',
f.url ? `<button class="preview-open-btn" onclick="window.open('${f.url}','_blank')">${t("m365_preview_open","Open in M365 ↗")}</button>` : '', f.url ? `<button class="preview-open-btn" onclick="window.open('${f.url}','_blank')">${t("m365_preview_open","Open in M365 ↗")}</button>` : '',
].filter(Boolean).join(''); ].filter(Boolean).join('');
_previewItemId = f.id; _previewItemId = f.id;
loadDisposition(f.id); // load disposition for this item (#6) loadDisposition(f.id);
_loadRelated(f);
try { try {
const r = await fetch('/api/preview/' + encodeURIComponent(f.id) const r = await fetch('/api/preview/' + encodeURIComponent(f.id)
@ -172,6 +244,44 @@ async function openPreview(f) {
} }
} }
// ── Related documents (CPR cross-reference) ───────────────────────────────────
async function _loadRelated(f) {
const el = document.getElementById('previewRelated');
if (!el) return;
if (!f.cpr_count) { el.style.display = 'none'; return; }
const ref = S._historyRefScanId ? `&ref=${S._historyRefScanId}` : '';
try {
const r = await fetch(`/api/db/related/${encodeURIComponent(f.id)}?${ref}`);
const items = await r.json();
if (f.id !== _previewItemId) return; // stale
if (!items.length) { el.style.display = 'none'; return; }
const rows = items.map(item => {
const shared = item.shared_cprs ?? '';
const badge = shared ? `<span style="font-size:9px;padding:1px 5px;border-radius:10px;background:var(--danger);color:#fff;font-weight:500;flex-shrink:0">${shared} CPR</span>` : '';
const src = item.source ? `<span style="color:var(--muted);font-size:10px;flex-shrink:0">${esc(item.source)}</span>` : '';
return `<div onclick="window._openRelated('${item.id.replace(/'/g,"\\'")}',${JSON.stringify(item).replace(/"/g,'&quot;')})"
style="display:flex;align-items:center;gap:6px;padding:4px 0;cursor:pointer;border-radius:4px"
onmouseover="this.style.background='var(--surface)'" onmouseout="this.style.background=''">
<span style="flex:1;font-size:11px;overflow:hidden;text-overflow:ellipsis;white-space:nowrap" title="${esc(item.name)}">${esc(item.name)}</span>
${src}${badge}
</div>`;
}).join('');
el.innerHTML = `<div style="font-size:10px;font-weight:600;color:var(--muted);margin-bottom:4px;text-transform:uppercase;letter-spacing:.04em">${t('m365_related_docs','Related documents')} <span style="font-weight:400">(${items.length})</span></div>${rows}`;
el.style.display = 'block';
} catch(e) {
el.style.display = 'none';
}
}
window._openRelated = function(id, itemData) {
const cached = (S.flaggedData || []).find(x => x.id === id);
openPreview(cached || itemData);
};
// ── Retention policy (#1) ──────────────────────────────────────────────────── // ── Retention policy (#1) ────────────────────────────────────────────────────
function toggleRetentionPanel() { function toggleRetentionPanel() {
@ -296,9 +406,9 @@ async function runSubjectLookup() {
_dsubItems = d.items; _dsubItems = d.items;
resultsEl.innerHTML = d.items.map(item => ` resultsEl.innerHTML = d.items.map(item => `
<div class="dsub-result-row"> <div class="dsub-result-row">
<div class="dsub-result-name" title="${item.name}">${item.name}</div> <div class="dsub-result-name" title="${esc(item.name)}">${esc(item.name)}</div>
<div class="dsub-result-meta">${item.source_type || ""}</div> <div class="dsub-result-meta">${esc(item.source_type || "")}</div>
<div class="dsub-result-meta">${item.modified || ""}</div> <div class="dsub-result-meta">${esc(item.modified || "")}</div>
<div class="dsub-result-meta" style="color:var(--danger)">${item.cpr_count} CPR</div> <div class="dsub-result-meta" style="color:var(--danger)">${item.cpr_count} CPR</div>
</div> </div>
`).join(""); `).join("");
@ -326,10 +436,13 @@ async function deleteSubjectItems() {
document.getElementById("dsubDeleteBtn").style.display = "none"; document.getElementById("dsubDeleteBtn").style.display = "none";
document.getElementById("dsubResults").innerHTML = ""; document.getElementById("dsubResults").innerHTML = "";
_dsubItems = []; _dsubItems = [];
// Refresh grid // Keep the deleted items in the grid (marked, greyed, buttons hidden)
S.flaggedData = S.flaggedData.filter(f => !ids.includes(f.id)); // until the next scan run — only those the server actually deleted.
S.filteredData = S.filteredData.filter(f => !ids.includes(f.id)); const deletedSet = new Set(d.deleted_ids || ids);
renderGrid(); const _mark = (x) => { if (deletedSet.has(x.id)) x._deleted = true; };
S.flaggedData.forEach(_mark);
S.filteredData.forEach(_mark);
renderGrid(S.filteredData.length ? S.filteredData : S.flaggedData);
updateStats(); updateStats();
} catch(e) { } catch(e) {
statusEl.textContent = "Delete failed: " + e.message; statusEl.textContent = "Delete failed: " + e.message;
@ -536,9 +649,13 @@ async function deleteItem(f, cardEl) {
}); });
const d = await r.json(); const d = await r.json();
if (d.ok) { if (d.ok) {
S.flaggedData = S.flaggedData.filter(x => x.id !== f.id); // Keep the deleted item in the grid (marked, greyed, action buttons
S.filteredData = S.filteredData.filter(x => x.id !== f.id); // hidden) until the next scan run, so the operator can see what was
if (cardEl) cardEl.remove(); // handled. The grid is rebuilt on the next scan, clearing these.
const _mark = (x) => { if (x.id === f.id) x._deleted = true; };
S.flaggedData.forEach(_mark);
S.filteredData.forEach(_mark);
renderGrid(S.filteredData.length ? S.filteredData : S.flaggedData);
updateStats(); updateStats();
log(t('m365_log_deleted', 'Deleted:') + ' ' + f.name, 'ok'); log(t('m365_log_deleted', 'Deleted:') + ' ' + f.name, 'ok');
if (_previewItemId === f.id) closePreview(); if (_previewItemId === f.id) closePreview();
@ -550,6 +667,36 @@ async function deleteItem(f, cardEl) {
} }
} }
async function redactItem(f, cardEl) {
if (!confirm(t('redact_confirm', 'Redact all CPR numbers in') + ' "' + f.name + '"?\n\n' + t('redact_warning', 'CPR numbers will be replaced with █ characters. This cannot be undone.'))) return;
if (cardEl) { cardEl.style.opacity = '0.5'; cardEl.style.pointerEvents = 'none'; }
try {
const r = await fetch('/api/redact_item', {
method: 'POST', headers: {'Content-Type': 'application/json'},
body: JSON.stringify({id: f.id, source_type: f.source_type})
});
const d = await r.json();
if (d.ok) {
// Keep the redacted item in the grid (marked, greyed, action buttons
// hidden) until the next scan run, so the operator can see what was
// handled. The grid is rebuilt on the next scan, clearing these.
const _mark = (x) => { if (x.id === f.id) x._redacted = true; };
S.flaggedData.forEach(_mark);
S.filteredData.forEach(_mark);
renderGrid(S.filteredData.length ? S.filteredData : S.flaggedData);
updateStats();
log(t('redact_done', 'Redacted') + ' ' + f.name + ' (' + (d.redacted || 0) + ' ' + t('redact_spans', 'CPR spans') + ')', 'ok');
if (_previewItemId === f.id) closePreview();
} else {
if (cardEl) { cardEl.style.opacity = ''; cardEl.style.pointerEvents = ''; }
log(t('redact_failed', 'Redaction failed:') + ' ' + (d.error || '?'), 'err');
}
} catch(e) {
if (cardEl) { cardEl.style.opacity = ''; cardEl.style.pointerEvents = ''; }
log(t('redact_failed', 'Redaction failed:') + ' ' + e.message, 'err');
}
}
// ── Bulk delete modal ───────────────────────────────────────────────────────── // ── Bulk delete modal ─────────────────────────────────────────────────────────
function openBulkDelete() { function openBulkDelete() {
@ -573,6 +720,7 @@ function _bdFilters() {
function _bdMatches() { function _bdMatches() {
const f = _bdFilters(); const f = _bdFilters();
return S.flaggedData.filter(x => { return S.flaggedData.filter(x => {
if (x._deleted || x._redacted) return false; // already handled this session
if (f.source_type && x.source_type !== f.source_type) return false; if (f.source_type && x.source_type !== f.source_type) return false;
if (x.cpr_count < f.min_cpr) return false; if (x.cpr_count < f.min_cpr) return false;
if (f.older_than_date && x.modified > f.older_than_date) return false; if (f.older_than_date && x.modified > f.older_than_date) return false;
@ -625,25 +773,34 @@ function _ensureSSE() {
function _sseWatchdog() { function _sseWatchdog() {
fetch('/api/scan/status').then(function(r) { return r.json(); }).then(function(status) { fetch('/api/scan/status').then(function(r) { return r.json(); }).then(function(status) {
if (status.running) { var anyRunning = status.running || status.google_running;
if (anyRunning) {
// A scan is in progress — make sure SSE is connected and progress UI is visible // A scan is in progress — make sure SSE is connected and progress UI is visible
_ensureSSE(); _ensureSSE();
if (!S._m365ScanRunning && !S._googleScanRunning && !S._fileScanRunning) { if (status.running && !S._m365ScanRunning && !S._googleScanRunning && !S._fileScanRunning) {
document.getElementById('scanBtn').disabled = true; document.getElementById('scanBtn').disabled = true;
document.getElementById('stopBtn').style.display = 'inline-block'; document.getElementById('stopBtn').style.display = 'inline-block';
// /api/scan/status checks the M365 lock — if running=true it's an M365 scan // status.running reflects the M365 + file lock; treat as an M365 reconnect
S._m365ScanRunning = true; _renderProgressSegments(); S._m365ScanRunning = true; _renderProgressSegments();
document.getElementById('progressFile').textContent = t('m365_sse_reconnecting', 'Reconnecting to running scan…'); document.getElementById('progressFile').textContent = t('m365_sse_reconnecting', 'Reconnecting to running scan…');
log(t('m365_sse_reconnecting', 'Reconnecting to running scan…')); log(t('m365_sse_reconnecting', 'Reconnecting to running scan…'));
} }
} else if (!S._historyRefScanId && !(S.flaggedData && S.flaggedData.length)) {
// No scan of any kind is running (authoritative, both locks free) and
// nothing is shown yet — restore the last saved session from the DB.
// Retried on every poll, not one-shot: the initial attempt can be blocked
// by running flags that SSE replay of a *completed* scan set but never
// cleared, and sse_replay_done only fires for a non-empty buffer (so it
// never retries after a server restart clears the replay buffer).
// Both locks are confirmed free, so clear any stale flags first.
S._m365ScanRunning = false;
S._googleScanRunning = false;
S._fileScanRunning = false;
window.loadHistorySession?.(null);
} }
if (!_initialStatusChecked) {
_initialStatusChecked = true; _initialStatusChecked = true;
if (!status.running) window.loadHistorySession?.(null); // Keep polling even when idle — the SSE connection may have died and we
} // need to detect the next scheduled scan (SSE is only opened on demand).
// When no scan is running, we still keep polling — the SSE connection
// may have died and we need to detect the *next* scheduled scan.
// The SSE itself is only opened/reopened when a scan is detected.
}).catch(function(err) { }).catch(function(err) {
// Status endpoint unavailable — server might be restarting // Status endpoint unavailable — server might be restarting
console.warn('[SSE] status poll failed:', err); console.warn('[SSE] status poll failed:', err);
@ -778,9 +935,12 @@ async function executeBulkDelete() {
}); });
const d = await r.json(); const d = await r.json();
if (d.ok) { if (d.ok) {
const deletedSet = new Set(matches.map(x => x.id)); // Keep the deleted items in the grid (marked, greyed, buttons hidden)
S.flaggedData = S.flaggedData.filter(x => !deletedSet.has(x.id)); // until the next scan run — only those the server actually deleted.
S.filteredData = S.filteredData.filter(x => !deletedSet.has(x.id)); const deletedSet = new Set(d.deleted_ids || matches.map(x => x.id));
const _mark = (x) => { if (deletedSet.has(x.id)) x._deleted = true; };
S.flaggedData.forEach(_mark);
S.filteredData.forEach(_mark);
renderGrid(S.filteredData.length ? S.filteredData : S.flaggedData); renderGrid(S.filteredData.length ? S.filteredData : S.flaggedData);
updateStats(); updateStats();
prog.innerHTML = `<span style="color:var(--ok,#4c4)">✓ ${d.deleted} ${t('m365_bulk_deleted', 'deleted')}</span>` + prog.innerHTML = `<span style="color:var(--ok,#4c4)">✓ ${d.deleted} ${t('m365_bulk_deleted', 'deleted')}</span>` +
@ -1005,6 +1165,7 @@ window.loadDisposition = loadDisposition;
window.saveDisposition = saveDisposition; window.saveDisposition = saveDisposition;
window.closePreview = closePreview; window.closePreview = closePreview;
window.deleteItem = deleteItem; window.deleteItem = deleteItem;
window.redactItem = redactItem;
window.openBulkDelete = openBulkDelete; window.openBulkDelete = openBulkDelete;
window.closeBulkDelete = closeBulkDelete; window.closeBulkDelete = closeBulkDelete;
window._bdFilters = _bdFilters; window._bdFilters = _bdFilters;

View File

@ -67,7 +67,7 @@ async function doImportDB() {
} }
if (mode === 'replace') { if (mode === 'replace') {
if (!confirm(t('m365_db_import_replace_confirm', if (!confirm(t('m365_db_import_replace_confirm',
'Replace mode will erase ALL existing scan data and restore from the archive.\n\nMake sure you have a manual backup of ~/.gdpr_scanner.db.\n\nProceed?'))) return; 'Replace mode will erase ALL existing scan data and restore from the archive.\n\nMake sure you have a manual backup of ~/.gdprscanner/scanner.db.\n\nProceed?'))) return;
} }
btn.disabled = true; btn.disabled = true;
stat.style.color = 'var(--muted)'; stat.style.color = 'var(--muted)';
@ -127,6 +127,10 @@ function buildScanPayload() {
scan_photos: document.getElementById('optScanPhotos') ? document.getElementById('optScanPhotos').checked : false, scan_photos: document.getElementById('optScanPhotos') ? document.getElementById('optScanPhotos').checked : false,
skip_gps_images: document.getElementById('optSkipGps') ? document.getElementById('optSkipGps').checked : false, skip_gps_images: document.getElementById('optSkipGps') ? document.getElementById('optSkipGps').checked : false,
min_cpr_count: document.getElementById('optMinCpr') ? (parseInt(document.getElementById('optMinCpr').value) || 1) : 1, min_cpr_count: document.getElementById('optMinCpr') ? (parseInt(document.getElementById('optMinCpr').value) || 1) : 1,
ocr_lang: document.getElementById('optOcrLang')?.value || 'dan+eng',
cpr_only: document.getElementById('optCprOnly') ? document.getElementById('optCprOnly').checked : false,
scan_emails: document.getElementById('optScanEmails') ? document.getElementById('optScanEmails').checked : false,
scan_phones: document.getElementById('optScanPhones') ? document.getElementById('optScanPhones').checked : false,
retention_enabled: document.getElementById('optRetention') ? document.getElementById('optRetention').checked : false, retention_enabled: document.getElementById('optRetention') ? document.getElementById('optRetention').checked : false,
retention_years: parseInt(document.getElementById('optRetentionYears')?.value) || 5, retention_years: parseInt(document.getElementById('optRetentionYears')?.value) || 5,
fiscal_year_end: document.getElementById('optFiscalYearEnd')?.value || '', fiscal_year_end: document.getElementById('optFiscalYearEnd')?.value || '',
@ -134,26 +138,39 @@ function buildScanPayload() {
return { sources, fileSources, allSources, googleSources, user_ids, options }; return { sources, fileSources, allSources, googleSources, user_ids, options };
} }
async function checkCheckpoint() { async function checkCheckpoint(onNoCheckpoint) {
const payload = buildScanPayload(); const payload = buildScanPayload();
if (!payload.sources.length && !payload.fileSources.length) return; const banner = document.getElementById('resumeBanner');
if (payload.sources.length && !payload.user_ids.length) return; const hasSources = payload.sources.length > 0 || payload.fileSources.length > 0 || payload.googleSources.length > 0;
if (!hasSources) {
if (banner) banner.style.display = 'none';
onNoCheckpoint?.(); return;
}
// M365 sources without users — scan button will handle the alert
if (payload.sources.length && !payload.user_ids.length && !payload.googleSources.length) {
if (banner) banner.style.display = 'none';
onNoCheckpoint?.(); return;
}
// Collect Google user emails for server-side checkpoint key computation
const googleUserEmails = payload.googleSources.length > 0
? (S._allUsers || []).filter(u => u.selected !== false && (u.platform === 'google' || u.platform === 'both')).map(u => u.email || u.id).filter(Boolean)
: [];
try { try {
const r = await fetch('/api/scan/checkpoint', { const r = await fetch('/api/scan/checkpoint', {
method: 'POST', headers: {'Content-Type':'application/json'}, method: 'POST', headers: {'Content-Type':'application/json'},
body: JSON.stringify(payload) body: JSON.stringify({...payload, googleUserEmails})
}); });
const d = await r.json(); const d = await r.json();
const banner = document.getElementById('resumeBanner');
if (d.exists) { if (d.exists) {
const ts = d.started_at ? new Date(d.started_at * 1000).toLocaleString([], {dateStyle:'short', timeStyle:'short'}) : ''; const ts = d.started_at ? new Date(d.started_at * 1000).toLocaleString([], {dateStyle:'short', timeStyle:'short'}) : '';
document.getElementById('resumeBannerText').textContent = document.getElementById('resumeBannerText').textContent =
t('m365_resume_banner', `Previous scan interrupted (${d.scanned_count} scanned, ${d.flagged_count} found${ts ? ' — ' + ts : ''})`); t('m365_resume_banner', `Previous scan interrupted (${d.scanned_count} scanned, ${d.flagged_count} found${ts ? ' — ' + ts : ''})`);
banner.style.display = 'flex'; if (banner) banner.style.display = 'flex';
} else { } else {
banner.style.display = 'none'; if (banner) banner.style.display = 'none';
onNoCheckpoint?.();
} }
} catch(e) { /* ignore */ } } catch(e) { onNoCheckpoint?.(); }
} }
async function clearCheckpointAndScan() { async function clearCheckpointAndScan() {
@ -171,8 +188,7 @@ async function checkDeltaStatus() {
const row = document.getElementById('deltaStatusRow'); const row = document.getElementById('deltaStatusRow');
const txt = document.getElementById('deltaStatusText'); const txt = document.getElementById('deltaStatusText');
if (d.exists) { if (d.exists) {
const src = d.count === 1 ? '1 source' : `${d.count} sources`; txt.textContent = t('m365_delta_tokens_saved', 'Tokens saved for {n} source(s)').replace('{n}', d.count);
txt.textContent = t('m365_delta_tokens_saved', `Tokens saved for ${src}`);
row.style.display = 'flex'; row.style.display = 'flex';
row.style.alignItems = 'center'; row.style.alignItems = 'center';
} else { } else {
@ -467,9 +483,15 @@ function _attachScanListeners(source) {
window.invalidateHistoryCache?.(); window.invalidateHistoryCache?.();
}); });
// sse_replay_done marks end of buffer replay — log a note so the user knows // sse_replay_done marks end of buffer replay — log a note so the user knows
// earlier events above were replayed from an already-running scan // earlier events above were replayed from an already-running scan.
// Also retry loadHistorySession if it bailed during replay: scan_phase events
// from a completed scan's replay temporarily set running flags to true, causing
// the watchdog's loadHistorySession call to bail before scan_done clears them.
source.addEventListener('sse_replay_done', function() { source.addEventListener('sse_replay_done', function() {
log(t('m365_sse_replay_note', 'Live log resumed \u2014 earlier entries replayed from running scan.')); log(t('m365_sse_replay_note', 'Live log resumed \u2014 earlier entries replayed from running scan.'));
if (!S._m365ScanRunning && !S._googleScanRunning && !S._fileScanRunning && !S._historyRefScanId) {
window.loadHistorySession?.(null);
}
}); });
} }
@ -562,6 +584,22 @@ function startScan(resume) {
S._userStartedScan = true; S._userStartedScan = true;
_ensureSSE(); _ensureSSE();
// Revert to idle if every scan type that was supposed to start got rejected.
// Called after each 409 so we don't leave the UI stuck in "running" state
// while the previous scan's thread finishes winding down.
function _onScanConflict(label) {
log(label + ' ' + t('scan_already_running_err', 'already running — previous scan still stopping. Please wait and try again.'), 'err');
if (label === 'm365') S._m365ScanRunning = false;
if (label === 'file') S._fileScanRunning = false;
if (label === 'google') S._googleScanRunning = false;
if (!S._m365ScanRunning && !S._googleScanRunning && !S._fileScanRunning) {
document.getElementById('scanBtn').disabled = false;
document.getElementById('stopBtn').style.display = 'none';
if (S.es) { S.es.close(); S.es = null; }
S._userStartedScan = false;
}
}
setTimeout(() => { setTimeout(() => {
// Fire M365 scan if any M365 sources are selected // Fire M365 scan if any M365 sources are selected
if (sources.length > 0) { if (sources.length > 0) {
@ -570,7 +608,7 @@ function startScan(resume) {
body: JSON.stringify({sources, user_ids, options, resume: !!resume, body: JSON.stringify({sources, user_ids, options, resume: !!resume,
profile_id: S._activeProfileId || null}) profile_id: S._activeProfileId || null})
}).then(r => { }).then(r => {
if (r.status === 409) { log('Scan already running', 'err'); } if (r.status === 409) { _onScanConflict('m365'); }
}).catch(e => { log('Scan start failed: ' + e, 'err'); }); }).catch(e => { log('Scan start failed: ' + e, 'err'); });
} }
@ -588,7 +626,13 @@ function startScan(resume) {
scan_photos: options.scan_photos || false, scan_photos: options.scan_photos || false,
skip_gps_images: options.skip_gps_images || false, skip_gps_images: options.skip_gps_images || false,
min_cpr_count: options.min_cpr_count || 1, min_cpr_count: options.min_cpr_count || 1,
scan_emails: options.scan_emails || false,
scan_phones: options.scan_phones || false,
cpr_only: options.cpr_only || false,
ocr_lang: options.ocr_lang || 'dan+eng',
})) }))
}).then(r => {
if (r.status === 409) { _onScanConflict('file'); }
}).catch(e => { log('File scan error: ' + e, 'err'); }); }).catch(e => { log('File scan error: ' + e, 'err'); });
}); });
@ -611,7 +655,7 @@ function startScan(resume) {
options: options options: options
}) })
}).then(r => { }).then(r => {
if (r.status === 409) { log('Google scan already running', 'err'); } if (r.status === 409) { _onScanConflict('google'); }
}).catch(e => { log('Google scan error: ' + e, 'err'); }); }).catch(e => { log('Google scan error: ' + e, 'err'); });
} }

View File

@ -18,19 +18,19 @@ function schedLoad() {
var descEl = document.getElementById('schedDesc_' + js.id); var descEl = document.getElementById('schedDesc_' + js.id);
if (!descEl) return; if (!descEl) return;
var j2 = _schedJobs.find(function(x){ return x.id === js.id; }); var j2 = _schedJobs.find(function(x){ return x.id === js.id; });
var freqLabel = !j2 ? '' : (j2.frequency === 'weekly' ? 'Weekly' : j2.frequency === 'monthly' ? 'Monthly' : 'Daily'); var freqLabel = !j2 ? '' : (j2.frequency === 'weekly' ? t('m365_sched_freq_weekly','Weekly') : j2.frequency === 'monthly' ? t('m365_sched_freq_monthly','Monthly') : t('m365_sched_freq_daily','Daily'));
var timeStr = !j2 ? '' : String(j2.hour||0).padStart(2,'0') + ':' + String(j2.minute||0).padStart(2,'0'); var timeStr = !j2 ? '' : String(j2.hour||0).padStart(2,'0') + ':' + String(j2.minute||0).padStart(2,'0');
var base = freqLabel + ' ' + timeStr; var base = freqLabel + ' ' + timeStr;
var runBtn = document.getElementById('schedRunBtn_' + js.id); var runBtn = document.getElementById('schedRunBtn_' + js.id);
if (js.is_running) { if (js.is_running) {
descEl.textContent = base + ' \u00b7 Running...'; descEl.textContent = base + ' \u00b7 ' + t('m365_sched_running','Running...');
if (runBtn) { runBtn.style.borderColor='#22c55e'; runBtn.style.color='#22c55e'; } if (runBtn) { runBtn.style.borderColor='#22c55e'; runBtn.style.color='#22c55e'; }
} else if (js.next_run) { } else if (js.next_run) {
var dt = new Date(js.next_run); var dt = new Date(js.next_run);
descEl.textContent = base + ' \u00b7 Next: ' + dt.toLocaleString(undefined,{month:'short',day:'numeric',hour:'2-digit',minute:'2-digit'}); descEl.textContent = base + ' \u00b7 ' + t('m365_sched_next','Next') + ': ' + dt.toLocaleString(undefined,{month:'short',day:'numeric',hour:'2-digit',minute:'2-digit'});
if (runBtn) { runBtn.style.borderColor='var(--border)'; runBtn.style.color='var(--muted)'; } if (runBtn) { runBtn.style.borderColor='var(--border)'; runBtn.style.color='var(--muted)'; }
} else { } else {
descEl.textContent = base + (js.enabled ? '' : ' \u00b7 Disabled'); descEl.textContent = base + (js.enabled ? '' : ' \u00b7 ' + t('m365_sched_disabled','Disabled'));
if (runBtn) { runBtn.style.borderColor='var(--border)'; runBtn.style.color='var(--muted)'; } if (runBtn) { runBtn.style.borderColor='var(--border)'; runBtn.style.color='var(--muted)'; }
} }
}); });
@ -41,20 +41,23 @@ function schedRenderJobs() {
var list = document.getElementById('schedJobList'); var list = document.getElementById('schedJobList');
if (!list) return; if (!list) return;
if (!_schedJobs.length) { if (!_schedJobs.length) {
list.innerHTML = '<div style="font-size:11px;color:var(--muted);padding:4px 0">No scheduled scans yet.</div>'; list.innerHTML = '<div style="font-size:11px;color:var(--muted);padding:4px 0">' + t('m365_sched_no_jobs','No scheduled scans yet.') + '</div>';
return; return;
} }
list.innerHTML = _schedJobs.map(function(j) { list.innerHTML = _schedJobs.map(function(j) {
var sid = _esc(j.id); var sid = _esc(j.id);
var sname = _esc(j.name || 'Unnamed'); var sname = _esc(j.name || 'Unnamed');
var freqLabel = j.frequency === 'weekly' ? 'Weekly' : j.frequency === 'monthly' ? 'Monthly' : 'Daily'; var freqLabel = j.frequency === 'weekly' ? t('m365_sched_freq_weekly','Weekly') : j.frequency === 'monthly' ? t('m365_sched_freq_monthly','Monthly') : t('m365_sched_freq_daily','Daily');
var timeStr = String(j.hour||0).padStart(2,'0') + ':' + String(j.minute||0).padStart(2,'0'); var timeStr = String(j.hour||0).padStart(2,'0') + ':' + String(j.minute||0).padStart(2,'0');
var desc = freqLabel + ' ' + timeStr; var desc = freqLabel + ' ' + timeStr;
var chk = j.enabled ? ' checked' : ''; var chk = j.enabled ? ' checked' : '';
var roBadge = j.report_only
? '<span style="font-size:9px;padding:1px 5px;border-radius:10px;background:#E8F4FD;color:#2980B9;border:1px solid #AED6F1;margin-left:4px">' + t('m365_sched_report_only','Report only') + '</span>'
: '';
return '<div style="display:flex;align-items:center;gap:6px;padding:5px 6px;border:1px solid var(--border);border-radius:6px;background:var(--surface)">' return '<div style="display:flex;align-items:center;gap:6px;padding:5px 6px;border:1px solid var(--border);border-radius:6px;background:var(--surface)">'
+ '<label class="toggle" style="flex:unset;margin:0"><input type="checkbox"'+chk+' onchange="schedToggleEnabled(\''+sid+'\',this.checked)"><span class="toggle-slider"></span></label>' + '<label class="toggle" style="flex:unset;margin:0"><input type="checkbox"'+chk+' onchange="schedToggleEnabled(\''+sid+'\',this.checked)"><span class="toggle-slider"></span></label>'
+ '<div style="flex:1;min-width:0">' + '<div style="flex:1;min-width:0">'
+ '<div style="font-size:12px;font-weight:600;white-space:nowrap;overflow:hidden;text-overflow:ellipsis">'+sname+'</div>' + '<div style="font-size:12px;font-weight:600;white-space:nowrap;overflow:hidden;text-overflow:ellipsis">'+sname+roBadge+'</div>'
+ '<div id="schedDesc_'+sid+'" style="font-size:10px;color:var(--muted)">'+desc+'</div>' + '<div id="schedDesc_'+sid+'" style="font-size:10px;color:var(--muted)">'+desc+'</div>'
+ '</div>' + '</div>'
+ '<button onclick="schedRunJob(\''+sid+'\')" id="schedRunBtn_'+sid+'" style="background:none;border:1px solid var(--border);color:var(--muted);padding:2px 7px;border-radius:4px;font-size:10px;cursor:pointer" title="Run now">&#9654;</button>' + '<button onclick="schedRunJob(\''+sid+'\')" id="schedRunBtn_'+sid+'" style="background:none;border:1px solid var(--border);color:var(--muted);padding:2px 7px;border-radius:4px;font-size:10px;cursor:pointer" title="Run now">&#9654;</button>'
@ -89,6 +92,8 @@ function schedAddJob() {
document.getElementById('schedMinute').value = 0; document.getElementById('schedMinute').value = 0;
document.getElementById('schedAutoEmail').checked = false; document.getElementById('schedAutoEmail').checked = false;
document.getElementById('schedAutoRetention').checked = false; document.getElementById('schedAutoRetention').checked = false;
document.getElementById('schedReportOnly').checked = false;
schedToggleReportOnly();
var titleEl = document.getElementById('schedEditorTitle'); var titleEl = document.getElementById('schedEditorTitle');
if (titleEl) titleEl.textContent = t('m365_sched_editor_new', 'New scheduled scan'); if (titleEl) titleEl.textContent = t('m365_sched_editor_new', 'New scheduled scan');
schedPopulateProfiles(''); schedPopulateProfiles('');
@ -111,6 +116,8 @@ function schedEditJob(id) {
document.getElementById('schedMinute').value = j.minute != null ? j.minute : 0; document.getElementById('schedMinute').value = j.minute != null ? j.minute : 0;
document.getElementById('schedAutoEmail').checked = !!j.auto_email; document.getElementById('schedAutoEmail').checked = !!j.auto_email;
document.getElementById('schedAutoRetention').checked = !!j.auto_retention; document.getElementById('schedAutoRetention').checked = !!j.auto_retention;
document.getElementById('schedReportOnly').checked = !!j.report_only;
schedToggleReportOnly();
var titleEl = document.getElementById('schedEditorTitle'); var titleEl = document.getElementById('schedEditorTitle');
if (titleEl) titleEl.textContent = t('m365_sched_editor_edit', 'Edit scheduled scan'); if (titleEl) titleEl.textContent = t('m365_sched_editor_edit', 'Edit scheduled scan');
schedPopulateProfiles(j.profile_id || ''); schedPopulateProfiles(j.profile_id || '');
@ -123,6 +130,19 @@ function schedCancelEdit() {
document.getElementById('schedJobEditor').style.display = 'none'; document.getElementById('schedJobEditor').style.display = 'none';
} }
function schedToggleReportOnly() {
var ro = !!(document.getElementById('schedReportOnly') || {}).checked;
var profileRow = document.getElementById('schedProfileRow');
var hint = document.getElementById('schedReportOnlyHint');
if (profileRow) profileRow.style.opacity = ro ? '0.4' : '';
if (hint) hint.style.display = ro ? 'block' : 'none';
// Enforce auto_email when switching to report-only
if (ro) {
var ae = document.getElementById('schedAutoEmail');
if (ae) ae.checked = true;
}
}
function schedSaveJob() { function schedSaveJob() {
var name = document.getElementById('schedName').value.trim(); var name = document.getElementById('schedName').value.trim();
if (!name) { if (!name) {
@ -144,6 +164,7 @@ function schedSaveJob() {
profile_id: document.getElementById('schedProfile').value, profile_id: document.getElementById('schedProfile').value,
auto_email: document.getElementById('schedAutoEmail').checked, auto_email: document.getElementById('schedAutoEmail').checked,
auto_retention: document.getElementById('schedAutoRetention').checked, auto_retention: document.getElementById('schedAutoRetention').checked,
report_only: document.getElementById('schedReportOnly').checked,
}; };
var st = document.getElementById('schedSaveStatus'); var st = document.getElementById('schedSaveStatus');
st.style.color = 'var(--muted)'; st.textContent = 'Saving...'; st.style.color = 'var(--muted)'; st.textContent = 'Saving...';
@ -217,7 +238,7 @@ function schedLoadHistory() {
if (!el) return; if (!el) return;
fetch('/api/scheduler/history?limit=10').then(function(r){ return r.json(); }).then(function(d) { fetch('/api/scheduler/history?limit=10').then(function(r){ return r.json(); }).then(function(d) {
var runs = d.runs || []; var runs = d.runs || [];
if (!runs.length) { el.innerHTML = '<em>No scheduled runs yet</em>'; return; } if (!runs.length) { el.innerHTML = '<em>' + t('m365_sched_no_runs','No scheduled runs yet') + '</em>'; return; }
var html = ''; var html = '';
runs.forEach(function(r) { runs.forEach(function(r) {
var ts = r.started_at ? new Date(r.started_at * 1000).toLocaleString() : '-'; var ts = r.started_at ? new Date(r.started_at * 1000).toLocaleString() : '-';
@ -293,15 +314,17 @@ function stLoadSmtp() {
const set = function(id, val) { const el=document.getElementById(id); if(el) el.value=val||''; }; const set = function(id, val) { const el=document.getElementById(id); if(el) el.value=val||''; };
set('st-smtpHost', d.host); set('st-smtpHost', d.host);
set('st-smtpPort', d.port || 587); set('st-smtpPort', d.port || 587);
set('st-smtpUser', d.user); set('st-smtpUser', d.username);
set('st-smtpFrom', d.from_addr); set('st-smtpFrom', d.from_addr);
set('st-smtpTo', Array.isArray(d.recipients) ? d.recipients.join(', ') : (d.recipients||'')); set('st-smtpTo', Array.isArray(d.recipients) ? d.recipients.join(', ') : (d.recipients||''));
const tls = document.getElementById('st-smtpTls'); const tls = document.getElementById('st-smtpTls');
if (tls) tls.checked = d.starttls !== false; if (tls) tls.checked = d.use_tls !== false;
const pw = document.getElementById('st-smtpPw'); const pw = document.getElementById('st-smtpPw');
if (pw) pw.value = d.has_password ? '\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022' : ''; if (pw) pw.value = d.has_password ? '\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022' : '';
const ae = document.getElementById('st-smtpAutoEmail'); const ae = document.getElementById('st-smtpAutoEmail');
if (ae) ae.checked = !!d.auto_email_manual; if (ae) ae.checked = !!d.auto_email_manual;
const ps = document.getElementById('st-smtpPreferSmtp');
if (ps) ps.checked = !!d.prefer_smtp;
}).catch(function(){}); }).catch(function(){});
} }
@ -312,11 +335,15 @@ async function stSmtpSave() {
const body = { const body = {
host: document.getElementById('st-smtpHost').value.trim(), host: document.getElementById('st-smtpHost').value.trim(),
port: parseInt(document.getElementById('st-smtpPort').value) || 587, port: parseInt(document.getElementById('st-smtpPort').value) || 587,
user: document.getElementById('st-smtpUser').value.trim(), // Backend (routes/email.py) reads these exact keys — `username`/`use_tls`,
// not `user`/`starttls`. Sending the wrong keys leaves username empty so
// server.login() is skipped and the SMTP server rejects the send.
username: document.getElementById('st-smtpUser').value.trim(),
from_addr: document.getElementById('st-smtpFrom').value.trim(), from_addr: document.getElementById('st-smtpFrom').value.trim(),
recipients: document.getElementById('st-smtpTo').value.split(/[,;]/).map(function(s){return s.trim();}).filter(Boolean), recipients: document.getElementById('st-smtpTo').value.split(/[,;]/).map(function(s){return s.trim();}).filter(Boolean),
starttls: document.getElementById('st-smtpTls').checked, use_tls: document.getElementById('st-smtpTls').checked,
auto_email_manual: !!(document.getElementById('st-smtpAutoEmail') || {}).checked, auto_email_manual: !!(document.getElementById('st-smtpAutoEmail') || {}).checked,
prefer_smtp: !!(document.getElementById('st-smtpPreferSmtp') || {}).checked,
}; };
if (pw !== null) body.password = pw; if (pw !== null) body.password = pw;
st.style.color = 'var(--muted)'; st.textContent = t('m365_smtp_saving','Saving...'); st.style.color = 'var(--muted)'; st.textContent = t('m365_smtp_saving','Saving...');
@ -437,6 +464,7 @@ window.schedSaveJob = schedSaveJob;
window.schedDeleteJob = schedDeleteJob; window.schedDeleteJob = schedDeleteJob;
window.schedRunJob = schedRunJob; window.schedRunJob = schedRunJob;
window.schedToggleFreqRows = schedToggleFreqRows; window.schedToggleFreqRows = schedToggleFreqRows;
window.schedToggleReportOnly = schedToggleReportOnly;
window.schedPopulateProfiles = schedPopulateProfiles; window.schedPopulateProfiles = schedPopulateProfiles;
window.schedLoadHistory = schedLoadHistory; window.schedLoadHistory = schedLoadHistory;
window.schedUpdateSidebarIndicator = schedUpdateSidebarIndicator; window.schedUpdateSidebarIndicator = schedUpdateSidebarIndicator;

View File

@ -62,13 +62,14 @@ function renderSourcesPanel() {
S._pendingGoogleSources = null; S._pendingGoogleSources = null;
} }
// File sources (local / SMB) — one entry per saved source // File sources (local / SMB / SFTP) — one entry per saved source
if (S._fileSources.length > 0) { if (S._fileSources.length > 0) {
html += '<div style="margin:6px 0 2px;font-size:10px;color:var(--muted);text-transform:uppercase;letter-spacing:.04em">' html += '<div style="margin:6px 0 2px;font-size:10px;color:var(--muted);text-transform:uppercase;letter-spacing:.04em">'
+ '<hr style="border:none;border-top:1px solid var(--border);margin:1px 0 2px">'; + '<hr style="border:none;border-top:1px solid var(--border);margin:1px 0 2px">';
S._fileSources.forEach(function(s) { S._fileSources.forEach(function(s) {
const isSmb = s.path && (s.path.startsWith('//') || s.path.startsWith('\\\\')); const isSftp = s.source_type === 'sftp';
const icon = isSmb ? '\uD83C\uDF10' : '\uD83D\uDCC1'; const isSmb = !isSftp && s.path && (s.path.startsWith('//') || s.path.startsWith('\\\\'));
const icon = isSftp ? '\uD83D\uDD12' : (isSmb ? '\uD83C\uDF10' : '\uD83D\uDCC1');
const label = s.label || s.path || s.id; const label = s.label || s.path || s.id;
const isChecked = (s.id in checked) ? checked[s.id] : true; const isChecked = (s.id in checked) ? checked[s.id] : true;
html += '<label class="source-check">' html += '<label class="source-check">'
@ -236,17 +237,209 @@ function closeSettings() {
} }
function switchSettingsTab(tab) { function switchSettingsTab(tab) {
['general','security','scheduler','email','database'].forEach(function(t) { ['general','security','scheduler','email','database','auditlog','ai'].forEach(function(t) {
var cap = t.charAt(0).toUpperCase() + t.slice(1); var cap = t.charAt(0).toUpperCase() + t.slice(1);
var pane = document.getElementById('stPane' + cap); var pane = document.getElementById('stPane' + cap);
var btn = document.getElementById('stTab' + cap); var btn = document.getElementById('stTab' + cap);
if (pane) pane.classList.toggle('active', t === tab); if (pane) pane.classList.toggle('active', t === tab);
if (btn) btn.classList.toggle('active', t === tab); if (btn) btn.classList.toggle('active', t === tab);
}); });
if (tab === 'general') stLoadUpdateSettings();
if (tab === 'security') { stLoadPinStatus(); if (typeof stLoadViewerPinStatus === 'function') stLoadViewerPinStatus(); if (typeof stLoadInterfacePinStatus === 'function') stLoadInterfacePinStatus(); } if (tab === 'security') { stLoadPinStatus(); if (typeof stLoadViewerPinStatus === 'function') stLoadViewerPinStatus(); if (typeof stLoadInterfacePinStatus === 'function') stLoadInterfacePinStatus(); }
if (tab === 'email') stLoadSmtp(); if (tab === 'email') stLoadSmtp();
if (tab === 'database') stLoadDbStats(); if (tab === 'database') stLoadDbStats();
if (tab === 'scheduler') schedLoad(); if (tab === 'scheduler') schedLoad();
if (tab === 'auditlog') stLoadAuditLog();
if (tab === 'ai') stLoadAiSettings();
}
async function stLoadAuditLog() {
const tbody = document.getElementById('stAuditTableBody');
if (!tbody) return;
tbody.innerHTML = `<tr><td colspan="4" style="padding:8px;color:var(--muted)">${t('m365_audit_loading')}</td></tr>`;
try {
const rows = await fetch('/api/audit_log?limit=200').then(r => r.json());
if (!Array.isArray(rows) || !rows.length) {
tbody.innerHTML = `<tr><td colspan="4" style="padding:8px;color:var(--muted)">${t('m365_audit_empty')}</td></tr>`;
return;
}
tbody.innerHTML = rows.map(function(r) {
const d = new Date(r.ts * 1000);
const ts = d.toLocaleDateString() + ' ' + d.toLocaleTimeString();
return '<tr style="border-bottom:1px solid var(--border)">'
+ '<td style="padding:4px 8px;white-space:nowrap;color:var(--muted);font-size:11px">' + window._escHtml(ts) + '</td>'
+ '<td style="padding:4px 8px"><span style="font-family:monospace;background:var(--bg);border:1px solid var(--border);border-radius:3px;padding:1px 4px;font-size:11px">' + window._escHtml(r.action) + '</span></td>'
+ '<td style="padding:4px 8px;color:var(--text);font-size:12px">' + window._escHtml(r.detail) + '</td>'
+ '<td style="padding:4px 8px;color:var(--muted);font-size:11px">' + window._escHtml(r.ip) + '</td>'
+ '</tr>';
}).join('');
} catch(e) {
tbody.innerHTML = '<tr><td colspan="4" style="padding:8px;color:var(--danger)">' + window._escHtml(String(e)) + '</td></tr>';
}
}
// ── AI / Claude NER settings ─────────────────────────────────────────────────
async function stLoadAiSettings() {
try {
const cfg = await fetch('/api/settings/claude').then(r => r.json());
const cb = document.getElementById('aiEnabled');
if (cb) cb.checked = !!cfg.enabled;
const ks = document.getElementById('aiKeyStatus');
if (ks) ks.textContent = cfg.api_key_set
? t('m365_ai_key_set', 'API key saved')
: t('m365_ai_key_not_set', 'No API key saved');
} catch(e) { /* ignore */ }
}
async function stAiSave() {
const enabled = !!(document.getElementById('aiEnabled') || {}).checked;
const keyVal = (document.getElementById('aiApiKey') || {}).value || '';
const status = document.getElementById('aiStatus');
const payload = { enabled };
if (keyVal) payload.api_key = keyVal;
try {
await fetch('/api/settings/claude', {
method: 'POST',
headers: {'Content-Type': 'application/json'},
body: JSON.stringify(payload),
});
if (status) { status.textContent = t('m365_ai_saved', 'Saved'); status.style.color = 'var(--success)'; }
if (keyVal) {
const inp = document.getElementById('aiApiKey');
if (inp) inp.value = '';
const ks = document.getElementById('aiKeyStatus');
if (ks) ks.textContent = t('m365_ai_key_set', 'API key saved');
}
setTimeout(function() { if (status) status.textContent = ''; }, 2000);
} catch(e) {
if (status) { status.textContent = String(e); status.style.color = 'var(--danger)'; }
}
}
async function stAiTest() {
const status = document.getElementById('aiStatus');
if (status) { status.textContent = t('m365_ai_testing', 'Testing…'); status.style.color = 'var(--muted)'; }
try {
const res = await fetch('/api/settings/claude/test', { method: 'POST' }).then(r => r.json());
if (status) {
status.textContent = res.ok
? t('m365_ai_test_ok', 'API key valid')
: (t('m365_ai_test_fail', 'Test failed') + ': ' + (res.error || ''));
status.style.color = res.ok ? 'var(--success)' : 'var(--danger)';
}
} catch(e) {
if (status) { status.textContent = String(e); status.style.color = 'var(--danger)'; }
}
}
// ── Software updates ─────────────────────────────────────────────────────────
async function stLoadUpdateSettings() {
try {
const cfg = await fetch('/api/update/settings').then(r => r.json());
const grp = document.getElementById('stUpdateGroup');
if (grp) grp.style.display = cfg.supported ? '' : 'none';
const cb = document.getElementById('stAutoUpdate');
if (cb) cb.checked = !!cfg.auto_update;
} catch(e) { /* ignore */ }
}
async function stSaveAutoUpdate() {
const cb = document.getElementById('stAutoUpdate');
try {
await fetch('/api/update/settings', {
method: 'POST',
headers: {'Content-Type': 'application/json'},
body: JSON.stringify({ auto_update: !!(cb && cb.checked) }),
});
} catch(e) { /* ignore */ }
}
async function stCheckUpdate() {
const status = document.getElementById('stUpdateStatus');
const commits = document.getElementById('stUpdateCommits');
const applyBtn = document.getElementById('stApplyUpdateBtn');
if (status) { status.textContent = t('m365_update_checking', 'Checking…'); status.style.color = 'var(--muted)'; }
if (commits) commits.style.display = 'none';
if (applyBtn) applyBtn.style.display = 'none';
try {
const res = await fetch('/api/update/check').then(r => r.json());
if (!status) return;
if (res.error) {
status.textContent = t('m365_update_failed', 'Update check failed') + ': ' + res.error;
status.style.color = 'var(--danger)';
} else if (res.up_to_date) {
status.textContent = t('m365_update_uptodate', 'You are running the latest version.') + ' (' + res.current + ')';
status.style.color = 'var(--success)';
} else {
status.textContent = t('m365_update_available', 'Update available') + ': ' + res.current + ' → ' + res.latest;
status.style.color = 'var(--accent)';
if (commits && res.commits && res.commits.length) {
commits.innerHTML = res.commits.map(function(c) { return window._escHtml(c); }).join('<br>');
commits.style.display = '';
}
if (applyBtn) applyBtn.style.display = '';
}
} catch(e) {
if (status) { status.textContent = String(e); status.style.color = 'var(--danger)'; }
}
}
async function stApplyUpdate() {
const status = document.getElementById('stUpdateStatus');
const applyBtn = document.getElementById('stApplyUpdateBtn');
const checkBtn = document.getElementById('stCheckUpdateBtn');
if (applyBtn) applyBtn.disabled = true;
if (checkBtn) checkBtn.disabled = true;
if (status) { status.textContent = t('m365_update_installing', 'Installing update — the app will restart…'); status.style.color = 'var(--muted)'; }
try {
const res = await fetch('/api/update/apply', { method: 'POST' }).then(r => r.json());
if (!res.ok) {
const msg = res.code === 'scan_running'
? t('m365_update_scan_running', 'Cannot update while a scan is running.')
: (res.error || 'Update failed');
if (status) { status.textContent = msg; status.style.color = 'var(--danger)'; }
if (applyBtn) applyBtn.disabled = false;
if (checkBtn) checkBtn.disabled = false;
return;
}
if (!res.updated) { // already up to date
if (status) { status.textContent = t('m365_update_uptodate', 'You are running the latest version.'); status.style.color = 'var(--success)'; }
if (applyBtn) { applyBtn.disabled = false; applyBtn.style.display = 'none'; }
if (checkBtn) checkBtn.disabled = false;
return;
}
_stWaitForRestart();
} catch(e) {
if (status) { status.textContent = String(e); status.style.color = 'var(--danger)'; }
if (applyBtn) applyBtn.disabled = false;
if (checkBtn) checkBtn.disabled = false;
}
}
// Poll until the server has gone down and come back, then reload the page.
function _stWaitForRestart() {
let tries = 0, sawDown = false;
const iv = setInterval(async function() {
tries++;
try {
await fetch('/api/about', { cache: 'no-store' }).then(r => { if (!r.ok) throw new Error(); });
if (sawDown || tries >= 5) { clearInterval(iv); location.reload(); }
} catch(e) {
sawDown = true;
}
if (tries > 90) clearInterval(iv); // give up after ~3 minutes
}, 2000);
}
function stAiToggleKey() {
const inp = document.getElementById('aiApiKey');
const btn = document.getElementById('aiShowKeyBtn');
if (!inp) return;
const show = inp.type === 'password';
inp.type = show ? 'text' : 'password';
if (btn) btn.textContent = show ? t('m365_ai_hide_key', 'Hide') : t('m365_ai_show_key', 'Show');
} }
// ── Window exports (HTML handlers + cross-module calls) ───────────────────── // ── Window exports (HTML handlers + cross-module calls) ─────────────────────
@ -265,5 +458,14 @@ window.confirmPinPrompt = confirmPinPrompt;
window.openSettings = openSettings; window.openSettings = openSettings;
window.closeSettings = closeSettings; window.closeSettings = closeSettings;
window.switchSettingsTab = switchSettingsTab; window.switchSettingsTab = switchSettingsTab;
window.stLoadAuditLog = stLoadAuditLog;
window.stLoadAiSettings = stLoadAiSettings;
window.stAiSave = stAiSave;
window.stAiTest = stAiTest;
window.stAiToggleKey = stAiToggleKey;
window.stLoadUpdateSettings = stLoadUpdateSettings;
window.stSaveAutoUpdate = stSaveAutoUpdate;
window.stCheckUpdate = stCheckUpdate;
window.stApplyUpdate = stApplyUpdate;
window._M365_SOURCES = _M365_SOURCES; window._M365_SOURCES = _M365_SOURCES;
window._pinCallback = _pinCallback; window._pinCallback = _pinCallback;

View File

@ -176,7 +176,7 @@ async function loadLastScanSummary() {
try { try {
const r = await fetch('/api/db/stats'); const r = await fetch('/api/db/stats');
const d = await r.json(); const d = await r.json();
if (!d.scan_id || S.flaggedData.length > 0) return; if (!d.scan_id || S.flaggedData.length > 0 || S._m365ScanRunning || S._googleScanRunning || S._fileScanRunning) return;
const panel = document.getElementById('lastScanSummary'); const panel = document.getElementById('lastScanSummary');
const empty = document.getElementById('emptyState'); const empty = document.getElementById('emptyState');
if (!panel || !empty) return; if (!panel || !empty) return;

View File

@ -2,18 +2,32 @@
// Share button → modal to create, copy, and revoke read-only viewer links. // Share button → modal to create, copy, and revoke read-only viewer links.
import { S } from './state.js'; import { S } from './state.js';
let _shareBaseUrl = null; // cached so Copy buttons can build the URL synchronously
async function _getShareBaseUrl() { async function _getShareBaseUrl() {
// Use the machine's LAN IP so links work for remote users, not just localhost. if (_shareBaseUrl) return _shareBaseUrl;
// The LAN-IP probe exists only to fix links when the operator browses the
// app at localhost — those would be unusable for remote users. Any other
// origin (LAN IP, or a reverse-proxied HTTPS hostname) is already routable,
// and rewriting it to http://<LAN-IP> would bypass the proxy's TLS.
const host = window.location.hostname;
if (window.location.protocol === 'https:' ||
(host !== 'localhost' && host !== '127.0.0.1' && host !== '[::1]')) {
_shareBaseUrl = window.location.origin;
return _shareBaseUrl;
}
try { try {
const r = await fetch('/api/local_ip'); const r = await fetch('/api/local_ip');
if (r.ok) { if (r.ok) {
const d = await r.json(); const d = await r.json();
if (d.ip && d.ip !== '127.0.0.1') { if (d.ip && d.ip !== '127.0.0.1') {
return 'http://' + d.ip + ':' + window.location.port; _shareBaseUrl = 'http://' + d.ip + ':' + window.location.port;
return _shareBaseUrl;
} }
} }
} catch(e) {} } catch(e) {}
return window.location.origin; _shareBaseUrl = window.location.origin;
return _shareBaseUrl;
} }
// ── User autocomplete for Share modal ──────────────────────────────────────── // ── User autocomplete for Share modal ────────────────────────────────────────
@ -124,9 +138,7 @@ function _shareScopeTypeChanged() {
if (type === 'user') _initUserAutocomplete(); if (type === 'user') _initUserAutocomplete();
} }
function openShareModal() { function _resetShareForm() {
document.getElementById('shareBackdrop').classList.add('open');
document.getElementById('shareNewLinkRow').style.display = 'none';
document.getElementById('shareLabel').value = ''; document.getElementById('shareLabel').value = '';
document.getElementById('shareExpiry').value = '30'; document.getElementById('shareExpiry').value = '30';
const scopeType = document.getElementById('shareScopeType'); const scopeType = document.getElementById('shareScopeType');
@ -136,6 +148,13 @@ function openShareModal() {
if (scopeUser) scopeUser.value = ''; if (scopeUser) scopeUser.value = '';
const scopeDrop = document.getElementById('shareScopeUserDropdown'); const scopeDrop = document.getElementById('shareScopeUserDropdown');
if (scopeDrop) scopeDrop.style.display = 'none'; if (scopeDrop) scopeDrop.style.display = 'none';
const vf = document.getElementById('shareValidFrom'); if (vf) vf.value = '';
const vt = document.getElementById('shareValidTo'); if (vt) vt.value = '';
}
function openShareModal() {
document.getElementById('shareBackdrop').classList.add('open');
_resetShareForm();
_renderTokenList(); _renderTokenList();
fetch('/api/viewer/pin').then(function(r){ return r.json(); }).then(function(d) { fetch('/api/viewer/pin').then(function(r){ return r.json(); }).then(function(d) {
const el = document.getElementById('sharePinStatus'); const el = document.getElementById('sharePinStatus');
@ -147,7 +166,7 @@ function closeShareModal() {
document.getElementById('shareBackdrop').classList.remove('open'); document.getElementById('shareBackdrop').classList.remove('open');
} }
async function _renderTokenList() { async function _renderTokenList(highlightToken) {
const list = document.getElementById('shareTokenList'); const list = document.getElementById('shareTokenList');
list.innerHTML = '<div style="font-size:12px;color:var(--muted);padding:4px 0">' + t('lbl_loading', 'Loading…') + '</div>'; list.innerHTML = '<div style="font-size:12px;color:var(--muted);padding:4px 0">' + t('lbl_loading', 'Loading…') + '</div>';
try { try {
@ -180,11 +199,18 @@ async function _renderTokenList() {
const userBadge = userLbl const userBadge = userLbl
? '<span style="font-size:9px;padding:1px 5px;border-radius:10px;background:var(--muted);color:#fff;margin-left:5px;font-weight:600;vertical-align:middle;max-width:140px;overflow:hidden;text-overflow:ellipsis;white-space:nowrap;display:inline-block">' + userLbl + '</span>' ? '<span style="font-size:9px;padding:1px 5px;border-radius:10px;background:var(--muted);color:#fff;margin-left:5px;font-weight:600;vertical-align:middle;max-width:140px;overflow:hidden;text-overflow:ellipsis;white-space:nowrap;display:inline-block">' + userLbl + '</span>'
: ''; : '';
const dateFrom = tok.scope?.valid_from || '';
const dateTo = tok.scope?.valid_to || '';
const dateBadge = (dateFrom || dateTo)
? '<span style="font-size:9px;padding:1px 5px;border-radius:10px;background:rgba(80,160,80,.25);color:var(--text);margin-left:5px;font-weight:600;vertical-align:middle">' +
(dateFrom || '…') + ' ' + (dateTo || '…') +
'</span>'
: '';
row.innerHTML = row.innerHTML =
'<div style="flex:1;min-width:0">' + '<div style="flex:1;min-width:0">' +
'<div style="font-weight:500;color:var(--text);overflow:hidden;text-overflow:ellipsis;white-space:nowrap">' + '<div style="font-weight:500;color:var(--text);overflow:hidden;text-overflow:ellipsis;white-space:nowrap">' +
(tok.label || '<span style="color:var(--muted);font-style:italic">' + t('share_unlabelled', 'Unlabelled') + '</span>') + (tok.label || '<span style="color:var(--muted);font-style:italic">' + t('share_unlabelled', 'Unlabelled') + '</span>') +
roleBadge + userBadge + roleBadge + userBadge + dateBadge +
'</div>' + '</div>' +
'<div style="font-size:10px;color:var(--muted);margin-top:1px">' + '<div style="font-size:10px;color:var(--muted);margin-top:1px">' +
t('share_expires_prefix', 'Expires:') + ' ' + expires + ' &nbsp;·&nbsp; ' + t('share_last_used', 'Last used:') + ' ' + lastUsed + t('share_expires_prefix', 'Expires:') + ' ' + expires + ' &nbsp;·&nbsp; ' + t('share_last_used', 'Last used:') + ' ' + lastUsed +
@ -195,6 +221,17 @@ async function _renderTokenList() {
'<button title="' + t('share_revoke', 'Revoke') + '" onclick="revokeToken(\'' + tok.token + '\',this.closest(\'div[style]\'))" ' + '<button title="' + t('share_revoke', 'Revoke') + '" onclick="revokeToken(\'' + tok.token + '\',this.closest(\'div[style]\'))" ' +
'style="height:24px;padding:0 8px;background:none;border:1px solid var(--danger);color:var(--danger);border-radius:4px;font-size:11px;cursor:pointer;flex-shrink:0">' + t('share_revoke', 'Revoke') + '</button>'; 'style="height:24px;padding:0 8px;background:none;border:1px solid var(--danger);color:var(--danger);border-radius:4px;font-size:11px;cursor:pointer;flex-shrink:0">' + t('share_revoke', 'Revoke') + '</button>';
list.appendChild(row); list.appendChild(row);
// Briefly highlight a freshly created link so it is easy to find and copy.
if (highlightToken && tok.token === highlightToken) {
row.style.transition = 'border-color .3s, background .3s';
row.style.borderColor = 'var(--accent)';
row.style.background = 'rgba(80,160,80,.18)';
setTimeout(function() { row.scrollIntoView({block: 'nearest'}); }, 0);
setTimeout(function() {
row.style.borderColor = 'var(--border)';
row.style.background = 'var(--bg)';
}, 2500);
}
}); });
} catch(e) { } catch(e) {
list.innerHTML = '<div style="font-size:12px;color:var(--danger);padding:4px 0">' + t('share_load_error', 'Failed to load links.') + '</div>'; list.innerHTML = '<div style="font-size:12px;color:var(--danger);padding:4px 0">' + t('share_load_error', 'Failed to load links.') + '</div>';
@ -205,6 +242,8 @@ async function createShareLink() {
const label = document.getElementById('shareLabel').value.trim(); const label = document.getElementById('shareLabel').value.trim();
const expiry = document.getElementById('shareExpiry').value; const expiry = document.getElementById('shareExpiry').value;
const scopeType = document.getElementById('shareScopeType')?.value || ''; const scopeType = document.getElementById('shareScopeType')?.value || '';
const validFrom = document.getElementById('shareValidFrom')?.value || '';
const validTo = document.getElementById('shareValidTo')?.value || '';
const body = {label}; const body = {label};
if (expiry) body.expires_days = parseInt(expiry); if (expiry) body.expires_days = parseInt(expiry);
if (scopeType === 'role') { if (scopeType === 'role') {
@ -223,6 +262,11 @@ async function createShareLink() {
body.scope = { user: [email], display_name: email }; body.scope = { user: [email], display_name: email };
} }
} }
if (validFrom || validTo) {
if (!body.scope) body.scope = {};
if (validFrom) body.scope.valid_from = validFrom;
if (validTo) body.scope.valid_to = validTo;
}
try { try {
const r = await fetch('/api/viewer/tokens', { const r = await fetch('/api/viewer/tokens', {
method: 'POST', headers: {'Content-Type':'application/json'}, method: 'POST', headers: {'Content-Type':'application/json'},
@ -230,48 +274,51 @@ async function createShareLink() {
}); });
if (!r.ok) throw new Error('Server error ' + r.status); if (!r.ok) throw new Error('Server error ' + r.status);
const entry = await r.json(); const entry = await r.json();
const url = (await _getShareBaseUrl()) + '/view?token=' + encodeURIComponent(entry.token); // The new link appears in the active-links list below (each row has its
const urlInput = document.getElementById('shareNewLinkUrl'); // own Copy button) — reset the form and highlight the just-created row
urlInput.value = url; // rather than leaving a stale link preview in the create box.
document.getElementById('shareNewLinkRow').style.display = 'block'; _resetShareForm();
document.getElementById('shareCopyBtn').textContent = t('log_copy', 'Copy'); _renderTokenList(entry.token);
document.getElementById('shareLabel').value = '';
_renderTokenList();
} catch(e) { } catch(e) {
alert(t('share_create_error', 'Failed to create link:') + ' ' + e.message); alert(t('share_create_error', 'Failed to create link:') + ' ' + e.message);
} }
} }
function copyShareLink() {
const url = document.getElementById('shareNewLinkUrl').value;
_copyText(url, document.getElementById('shareCopyBtn'));
}
async function copyTokenLink(token, btn) { async function copyTokenLink(token, btn) {
const url = (await _getShareBaseUrl()) + '/view?token=' + encodeURIComponent(token); const url = (await _getShareBaseUrl()) + '/view?token=' + encodeURIComponent(token);
_copyText(url, btn); _copyText(url, btn);
} }
function _copyText(text, btn) { function _copyText(text, btn) {
navigator.clipboard.writeText(text).then(() => { const done = () => {
const orig = btn.textContent; const orig = btn.textContent;
btn.textContent = t('share_copied', 'Copied!'); btn.textContent = t('share_copied', 'Copied!');
setTimeout(() => { btn.textContent = orig; }, 1800); setTimeout(() => { btn.textContent = orig; }, 1800);
}).catch(() => { };
// Fallback for HTTP contexts // Fallback for HTTP contexts, where navigator.clipboard is undefined
// (the Clipboard API only exists in secure contexts — HTTPS or localhost).
const fallback = () => {
let ok = false;
try { try {
const ta = document.createElement('textarea'); const ta = document.createElement('textarea');
ta.value = text; ta.value = text;
ta.style.position = 'fixed'; ta.style.opacity = '0'; ta.style.position = 'fixed'; ta.style.opacity = '0';
ta.setAttribute('readonly', '');
document.body.appendChild(ta); document.body.appendChild(ta);
ta.focus();
ta.select(); ta.select();
document.execCommand('copy'); ok = document.execCommand('copy');
document.body.removeChild(ta); document.body.removeChild(ta);
const orig = btn.textContent; } catch(_) { ok = false; }
btn.textContent = t('share_copied', 'Copied!'); if (ok) done();
setTimeout(() => { btn.textContent = orig; }, 1800); // Last resort: show the link in a prompt so it can be copied manually.
} catch(_) {} else prompt(t('share_copy_link_prompt', 'Copy link:'), text);
}); };
if (navigator.clipboard && navigator.clipboard.writeText) {
navigator.clipboard.writeText(text).then(done).catch(fallback);
} else {
fallback();
}
} }
async function revokeToken(token, rowEl) { async function revokeToken(token, rowEl) {
@ -284,12 +331,6 @@ async function revokeToken(token, rowEl) {
if (!list.children.length) { if (!list.children.length) {
list.innerHTML = '<div style="font-size:12px;color:var(--muted);padding:4px 0">' + t('share_no_links', 'No active links.') + '</div>'; list.innerHTML = '<div style="font-size:12px;color:var(--muted);padding:4px 0">' + t('share_no_links', 'No active links.') + '</div>';
} }
// Hide the copy row if the just-revoked token was the last created
const newRow = document.getElementById('shareNewLinkRow');
if (newRow) {
const shownUrl = document.getElementById('shareNewLinkUrl')?.value || '';
if (shownUrl.includes(token)) newRow.style.display = 'none';
}
} catch(e) { } catch(e) {
alert(t('share_revoke_error', 'Failed to revoke:') + ' ' + e.message); alert(t('share_revoke_error', 'Failed to revoke:') + ' ' + e.message);
} }
@ -458,7 +499,7 @@ window._shareScopeTypeChanged = _shareScopeTypeChanged;
window.openShareModal = openShareModal; window.openShareModal = openShareModal;
window.closeShareModal = closeShareModal; window.closeShareModal = closeShareModal;
window.createShareLink = createShareLink; window.createShareLink = createShareLink;
window.copyShareLink = copyShareLink; window._copyText = _copyText;
window.copyTokenLink = copyTokenLink; window.copyTokenLink = copyTokenLink;
window.revokeToken = revokeToken; window.revokeToken = revokeToken;
window.stLoadViewerPinStatus = stLoadViewerPinStatus; window.stLoadViewerPinStatus = stLoadViewerPinStatus;

View File

@ -197,7 +197,7 @@
.filter-clear:hover { border-color: var(--danger); color: var(--danger); } .filter-clear:hover { border-color: var(--danger); color: var(--danger); }
/* Grid */ /* Grid */
.grid-area { flex: 1; overflow-y: auto; padding: 24px; min-width: 0; scrollbar-width: thin; scrollbar-color: var(--border) transparent; } .grid-area { flex: 1; overflow-y: auto; overflow-anchor: none; padding: 24px; min-width: 0; scrollbar-width: thin; scrollbar-color: var(--border) transparent; }
.grid-area::-webkit-scrollbar { width: 4px; } .grid-area::-webkit-scrollbar { width: 4px; }
.grid-area::-webkit-scrollbar-track { background: transparent; } .grid-area::-webkit-scrollbar-track { background: transparent; }
.grid-area::-webkit-scrollbar-thumb { background: var(--border); border-radius: 2px; } .grid-area::-webkit-scrollbar-thumb { background: var(--border); border-radius: 2px; }
@ -234,7 +234,7 @@
.preview-meta { padding: 10px 14px; border-top: 1px solid var(--border); font-size: 11px; color: var(--muted); display: flex; gap: 10px; flex-wrap: wrap; flex-shrink: 0; } .preview-meta { padding: 10px 14px; border-top: 1px solid var(--border); font-size: 11px; color: var(--muted); display: flex; gap: 10px; flex-wrap: wrap; flex-shrink: 0; }
.preview-open-btn { margin-left: auto; background: var(--accent); color: #fff; border: none; border-radius: 5px; padding: 4px 10px; font-size: 11px; cursor: pointer; white-space: nowrap; } .preview-open-btn { margin-left: auto; background: var(--accent); color: #fff; border: none; border-radius: 5px; padding: 4px 10px; font-size: 11px; cursor: pointer; white-space: nowrap; }
.card.selected { outline: 2px solid var(--accent); outline-offset: 2px; } .card.selected { outline: 2px solid var(--accent); outline-offset: 2px; }
.card { background: var(--surface); border: 1px solid var(--border); border-radius: 10px; overflow: hidden; cursor: pointer; transition: border-color .15s, box-shadow .15s; } .card { position: relative; background: var(--surface); border: 1px solid var(--border); border-radius: 10px; overflow: hidden; cursor: pointer; transition: border-color .15s, box-shadow .15s; }
.card:hover { border-color: var(--accent); box-shadow: 0 0 0 1px var(--accent); } .card:hover { border-color: var(--accent); box-shadow: 0 0 0 1px var(--accent); }
.card.list-view { display: flex; align-items: center; gap: 12px; padding: 10px 14px; border-radius: 8px; } .card.list-view { display: flex; align-items: center; gap: 12px; padding: 10px 14px; border-radius: 8px; }
.thumb-wrap { aspect-ratio: 7/9; overflow: hidden; background: var(--bg); } .thumb-wrap { aspect-ratio: 7/9; overflow: hidden; background: var(--bg); }
@ -253,6 +253,9 @@
.card-delete-btn { position:absolute; top:6px; right:6px; background:rgba(0,0,0,0.45); color:#fff; border:none; border-radius:50%; width:22px; height:22px; font-size:13px; line-height:22px; text-align:center; cursor:pointer; opacity:0.35; transition:opacity .15s; padding:0; z-index:1; } .card-delete-btn { position:absolute; top:6px; right:6px; background:rgba(0,0,0,0.45); color:#fff; border:none; border-radius:50%; width:22px; height:22px; font-size:13px; line-height:22px; text-align:center; cursor:pointer; opacity:0.35; transition:opacity .15s; padding:0; z-index:1; }
.card:hover .card-delete-btn { opacity:1; } .card:hover .card-delete-btn { opacity:1; }
.card.list-view .card-delete-btn { position:static; opacity:1; background:transparent; color:var(--muted); flex-shrink:0; } .card.list-view .card-delete-btn { position:static; opacity:1; background:transparent; color:var(--muted); flex-shrink:0; }
.card-redact-btn { position:absolute; top:6px; right:32px; background:rgba(0,80,40,0.55); color:#7effc0; border:none; border-radius:50%; width:22px; height:22px; font-size:12px; line-height:22px; text-align:center; cursor:pointer; opacity:0.35; transition:opacity .15s; padding:0; z-index:1; }
.card:hover .card-redact-btn { opacity:1; }
.card.list-view .card-redact-btn { position:static; opacity:1; background:transparent; color:#7effc0; flex-shrink:0; }
/* Per-card checkbox (select mode) */ /* Per-card checkbox (select mode) */
.card-cb { position:absolute; top:6px; left:6px; width:16px; height:16px; margin:0; cursor:pointer; z-index:2; .card-cb { position:absolute; top:6px; left:6px; width:16px; height:16px; margin:0; cursor:pointer; z-index:2;
@ -358,17 +361,17 @@
.settings-backdrop.open { display:flex; } .settings-backdrop.open { display:flex; }
.settings-modal { .settings-modal {
background:var(--surface); border:1px solid var(--border); background:var(--surface); border:1px solid var(--border);
border-radius:10px; width:min(540px,96vw); border-radius:10px; width:min(720px,96vw);
display:flex; flex-direction:column; overflow:hidden; display:flex; flex-direction:column; overflow:hidden;
font-size:12px; color:var(--text); font-size:12px; color:var(--text);
} }
.settings-header { padding:16px 20px 0; display:flex; align-items:center; justify-content:space-between; } .settings-header { padding:16px 20px 0; display:flex; align-items:center; justify-content:space-between; }
.settings-header h2 { font-size:14px; font-weight:700; margin:0; } .settings-header h2 { font-size:14px; font-weight:700; margin:0; }
.settings-tabs { display:flex; border-bottom:1px solid var(--border); padding:0 20px; margin-top:12px; } .settings-tabs { display:flex; border-bottom:1px solid var(--border); padding:0 20px; margin-top:12px; flex-wrap:wrap; }
.settings-tab { .settings-tab {
height:36px; padding:0 14px; font-size:12px; cursor:pointer; border:none; height:36px; padding:0 14px; font-size:12px; cursor:pointer; border:none;
background:none; color:var(--muted); border-bottom:2px solid transparent; background:none; color:var(--muted); border-bottom:2px solid transparent;
margin-bottom:-1px; font-weight:500; margin-bottom:-1px; font-weight:500; white-space:nowrap;
} }
.settings-tab.active { color:var(--accent); border-bottom-color:var(--accent); font-weight:600; } .settings-tab.active { color:var(--accent); border-bottom-color:var(--accent); font-weight:600; }
.settings-body { padding:16px 20px; overflow-y:auto; max-height:65vh; display:flex; flex-direction:column; gap:14px; } .settings-body { padding:16px 20px; overflow-y:auto; max-height:65vh; display:flex; flex-direction:column; gap:14px; }
@ -491,6 +494,18 @@
.overdue-badge { font-size: 9px; padding: 1px 5px; border-radius: 10px; .overdue-badge { font-size: 9px; padding: 1px 5px; border-radius: 10px;
background: #7c3200; color: #ffb347; font-weight: 600; white-space: nowrap; } background: #7c3200; color: #ffb347; font-weight: 600; white-space: nowrap; }
[data-theme="light"] .overdue-badge { background: #fff3e0; color: #c55a00; } [data-theme="light"] .overdue-badge { background: #fff3e0; color: #c55a00; }
.resolved-badge { font-size: 9px; padding: 1px 5px; border-radius: 10px;
background: #1a3a28; color: #7effc0; font-weight: 600; white-space: nowrap; }
[data-theme="light"] .resolved-badge { background: #d0f5ea; color: #005a3a; }
.card-resolved { opacity: 0.6; }
.resolved-divider { grid-column: 1 / -1; padding: 8px 2px; font-size: 11px;
color: var(--muted); border-top: 1px dashed var(--border); text-align: center; }
.email-badge { font-size: 9px; padding: 1px 5px; border-radius: 10px;
background: #1a3a5c; color: #7ec8f0; font-weight: 500; white-space: nowrap; }
[data-theme="light"] .email-badge { background: #d0eaff; color: #004a80; }
.phone-badge { font-size: 9px; padding: 1px 5px; border-radius: 10px;
background: #1a4030; color: #7eeac0; font-weight: 500; white-space: nowrap; }
[data-theme="light"] .phone-badge { background: #d0f5ea; color: #005a3a; }
.badge-email { background: rgba(139,68,173,.2); color: #b87fd8; } .badge-email { background: rgba(139,68,173,.2); color: #b87fd8; }
.badge-onedrive { background: rgba(0,120,212,.2); color: #5ba4e8; } .badge-onedrive { background: rgba(0,120,212,.2); color: #5ba4e8; }
.badge-sharepoint { background: rgba(0,160,100,.2); color: #2ecc71; } .badge-sharepoint { background: rgba(0,160,100,.2); color: #2ecc71; }

View File

@ -110,6 +110,7 @@ document.addEventListener('DOMContentLoaded', applyI18n);
<div id="deltaStatusRow" style="display:none;font-size:10px;padding:3px 0 2px;color:var(--muted)"> <div id="deltaStatusRow" style="display:none;font-size:10px;padding:3px 0 2px;color:var(--muted)">
<span id="deltaStatusText"></span> <span id="deltaStatusText"></span>
<button onclick="clearDeltaTokens()" style="background:none;border:none;color:var(--danger);font-size:10px;cursor:pointer;padding:0 0 0 6px" data-i18n="m365_delta_clear">Clear tokens</button> <button onclick="clearDeltaTokens()" style="background:none;border:none;color:var(--danger);font-size:10px;cursor:pointer;padding:0 0 0 6px" data-i18n="m365_delta_clear">Clear tokens</button>
<span class="hint-wrap"><span class="hint-icon" onclick="toggleHint(this)">?</span><span class="hint-bubble" data-i18n="m365_delta_tokens_hint">Saved change-tokens let delta scans fetch only items modified since the last scan. Clear tokens forces the next scan to be a full scan.</span></span>
</div> </div>
<!-- Photo / biometric scan (#9) --> <!-- Photo / biometric scan (#9) -->
@ -137,6 +138,45 @@ document.addEventListener('DOMContentLoaded', applyI18n);
style="width:46px;padding:3px 6px;font-size:11px;text-align:right"> style="width:46px;padding:3px 6px;font-size:11px;text-align:right">
</div> </div>
<!-- OCR language -->
<div class="toggle-row">
<span class="toggle-label" style="flex:1">
<span data-i18n="m365_opt_ocr_lang">OCR language</span><span class="hint-wrap"><span class="hint-icon" onclick="toggleHint(this)">?</span><span class="hint-bubble" data-i18n="m365_opt_ocr_lang_hint">Tesseract language pack(s) used when scanning scanned PDFs and images. Must match installed language packs.</span></span>
</span>
<select id="optOcrLang" style="font-size:11px;padding:2px 4px;background:var(--surface);border:1px solid var(--border);color:var(--text);border-radius:4px">
<option value="dan+eng">dan+eng</option>
<option value="dan">dan</option>
<option value="eng">eng</option>
<option value="dan+eng+deu">dan+eng+deu</option>
<option value="dan+eng+swe">dan+eng+swe</option>
<option value="dan+eng+fra">dan+eng+fra</option>
</select>
</div>
<!-- CPR-only mode -->
<div class="toggle-row">
<span class="toggle-label" style="flex:1">
<span data-i18n="m365_opt_cpr_only">CPR-only mode</span><span class="hint-wrap"><span class="hint-icon" onclick="toggleHint(this)">?</span><span class="hint-bubble" data-i18n="m365_opt_cpr_only_hint">Only flag files that contain CPR numbers. Files with only email addresses, phone numbers, faces, or EXIF metadata are ignored.</span></span>
</span>
<label class="toggle"><input type="checkbox" id="optCprOnly"><span class="toggle-slider"></span></label>
</div>
<!-- Scan for email addresses -->
<div class="toggle-row">
<span class="toggle-label" style="flex:1">
<span data-i18n="m365_opt_scan_emails">Scan for email addresses</span><span class="hint-wrap"><span class="hint-icon" onclick="toggleHint(this)">?</span><span class="hint-bubble" data-i18n="m365_opt_scan_emails_hint">Flags files that contain email addresses. Off by default — email addresses are very common and may produce many results.</span></span>
</span>
<label class="toggle"><input type="checkbox" id="optScanEmails"><span class="toggle-slider"></span></label>
</div>
<!-- Scan for phone numbers -->
<div class="toggle-row">
<span class="toggle-label" style="flex:1">
<span data-i18n="m365_opt_scan_phones">Scan for phone numbers</span><span class="hint-wrap"><span class="hint-icon" onclick="toggleHint(this)">?</span><span class="hint-bubble" data-i18n="m365_opt_scan_phones_hint">Flags files containing Danish phone numbers (8 digits). Useful for finding contact lists and parent correspondence.</span></span>
</span>
<label class="toggle"><input type="checkbox" id="optScanPhones"><span class="toggle-slider"></span></label>
</div>
<!-- Retention policy (suggestion #1) --> <!-- Retention policy (suggestion #1) -->
<div class="toggle-row"> <div class="toggle-row">
<span class="toggle-label" style="flex:1"> <span class="toggle-label" style="flex:1">
@ -286,7 +326,7 @@ document.addEventListener('DOMContentLoaded', applyI18n);
<!-- Topbar --> <!-- Topbar -->
<div class="topbar"> <div class="topbar">
<span id="viewerBrand" style="display:none;font-size:15px;font-weight:600;color:var(--text);white-space:nowrap;margin-right:6px">🔍 GDPRScanner</span> <span id="viewerBrand" style="display:none;font-size:15px;font-weight:600;color:var(--text);white-space:nowrap;margin-right:6px">🔍 GDPRScanner</span>
<button class="scan-btn" id="scanBtn" onclick="startScan()" data-i18n="m365_btn_scan">Scan</button> <button class="scan-btn" id="scanBtn" onclick="checkCheckpoint(() => startScan(false))" data-i18n="m365_btn_scan">Scan</button>
<button class="stop-btn" id="stopBtn" style="display:none" onclick="stopScan()" data-i18n="m365_btn_stop">Stop</button> <button class="stop-btn" id="stopBtn" style="display:none" onclick="stopScan()" data-i18n="m365_btn_stop">Stop</button>
<!-- Profile selector (15c) --> <!-- Profile selector (15c) -->
@ -335,7 +375,7 @@ document.addEventListener('DOMContentLoaded', applyI18n);
<button id="historyPickerBtn" type="button" onclick="openHistoryPicker()" style="height:24px;padding:0 10px;background:none;border:1px solid var(--border);color:var(--muted);border-radius:4px;font-size:11px;cursor:pointer" data-i18n="history_btn_sessions">Sessions</button> <button id="historyPickerBtn" type="button" onclick="openHistoryPicker()" style="height:24px;padding:0 10px;background:none;border:1px solid var(--border);color:var(--muted);border-radius:4px;font-size:11px;cursor:pointer" data-i18n="history_btn_sessions">Sessions</button>
<div id="historyDropdown" style="display:none;position:absolute;right:0;top:calc(100% + 4px);background:var(--surface);border:1px solid var(--border);border-radius:6px;z-index:9999;width:300px;max-height:260px;overflow-y:auto;box-shadow:0 4px 12px rgba(0,0,0,.25)"></div> <div id="historyDropdown" style="display:none;position:absolute;right:0;top:calc(100% + 4px);background:var(--surface);border:1px solid var(--border);border-radius:6px;z-index:9999;width:300px;max-height:260px;overflow-y:auto;box-shadow:0 4px 12px rgba(0,0,0,.25)"></div>
</div> </div>
<button id="historyLatestBtn" type="button" onclick="loadHistorySession(null)" style="display:none;height:24px;padding:0 10px;background:none;border:1px solid var(--accent);color:var(--accent);border-radius:4px;font-size:11px;cursor:pointer;flex-shrink:0" data-i18n="history_btn_latest">Latest scan</button> <button id="historyLatestBtn" type="button" onclick="loadHistorySession(null)" style="display:none;height:24px;padding:0 10px;background:none;border:1px solid var(--accent);color:var(--accent);border-radius:4px;font-size:11px;cursor:pointer;flex-shrink:0" data-i18n="history_btn_latest">Open items</button>
</div> </div>
<!-- Filter bar — full width, above grid + preview --> <!-- Filter bar — full width, above grid + preview -->
@ -462,6 +502,8 @@ document.addEventListener('DOMContentLoaded', applyI18n);
<iframe id="previewFrame" sandbox="allow-scripts allow-same-origin allow-forms allow-popups" style="display:none"></iframe> <iframe id="previewFrame" sandbox="allow-scripts allow-same-origin allow-forms allow-popups" style="display:none"></iframe>
</div> </div>
<div class="preview-meta" id="previewMeta"></div> <div class="preview-meta" id="previewMeta"></div>
<!-- Related documents -->
<div id="previewRelated" style="display:none;padding:8px 14px 4px;border-top:1px solid var(--border)"></div>
<!-- Disposition widget (#6) --> <!-- Disposition widget (#6) -->
<div class="disposition-row" id="dispositionRow" style="display:none"> <div class="disposition-row" id="dispositionRow" style="display:none">
<span class="disposition-label" data-i18n="m365_disposition_label">Disposition</span> <span class="disposition-label" data-i18n="m365_disposition_label">Disposition</span>
@ -574,6 +616,8 @@ document.addEventListener('DOMContentLoaded', applyI18n);
<button class="settings-tab" id="stTabScheduler" onclick="switchSettingsTab('scheduler')" data-i18n="m365_settings_tab_scheduler">Scheduler</button> <button class="settings-tab" id="stTabScheduler" onclick="switchSettingsTab('scheduler')" data-i18n="m365_settings_tab_scheduler">Scheduler</button>
<button class="settings-tab" id="stTabEmail" onclick="switchSettingsTab('email')" data-i18n="m365_settings_tab_email">Email report</button> <button class="settings-tab" id="stTabEmail" onclick="switchSettingsTab('email')" data-i18n="m365_settings_tab_email">Email report</button>
<button class="settings-tab" id="stTabDatabase" onclick="switchSettingsTab('database')" data-i18n="m365_settings_tab_database">Database</button> <button class="settings-tab" id="stTabDatabase" onclick="switchSettingsTab('database')" data-i18n="m365_settings_tab_database">Database</button>
<button class="settings-tab" id="stTabAuditlog" onclick="switchSettingsTab('auditlog')" data-i18n="m365_settings_tab_auditlog">Audit Log</button>
<button class="settings-tab" id="stTabAi" onclick="switchSettingsTab('ai')" data-i18n="m365_settings_tab_ai">AI / NER</button>
</div> </div>
<div class="settings-body"> <div class="settings-body">
@ -598,6 +642,19 @@ document.addEventListener('DOMContentLoaded', applyI18n);
<div class="settings-about-row"><span>Requests</span><span id="st-about-requests" style="color:var(--muted)"></span></div> <div class="settings-about-row"><span>Requests</span><span id="st-about-requests" style="color:var(--muted)"></span></div>
<div class="settings-about-row"><span>openpyxl</span><span id="st-about-openpyxl" style="color:var(--muted)"></span></div> <div class="settings-about-row"><span>openpyxl</span><span id="st-about-openpyxl" style="color:var(--muted)"></span></div>
</div> </div>
<div class="settings-group" id="stUpdateGroup" style="display:none">
<div class="settings-group-title" data-i18n="m365_settings_updates">Software update</div>
<div id="stUpdateStatus" style="font-size:11px;color:var(--muted);margin-bottom:8px" data-i18n="m365_update_idle">Check whether a newer version is available.</div>
<div id="stUpdateCommits" style="display:none;font-size:11px;color:var(--muted);font-family:monospace;line-height:1.6;background:var(--bg);border:1px solid var(--border);border-radius:6px;padding:6px 10px;margin-bottom:8px;max-height:120px;overflow-y:auto"></div>
<div style="display:flex;align-items:center;gap:10px;margin-bottom:10px">
<label class="toggle" style="flex:unset"><input type="checkbox" id="stAutoUpdate" onchange="stSaveAutoUpdate()"><span class="toggle-slider"></span></label>
<span style="font-size:12px" data-i18n="m365_update_auto">Install updates automatically (checked daily — the app restarts itself)</span>
</div>
<div style="display:flex;justify-content:flex-end;gap:8px">
<button type="button" onclick="stCheckUpdate()" id="stCheckUpdateBtn" style="height:26px;padding:0 14px;background:none;border:1px solid var(--border);color:var(--text);border-radius:6px;font-size:12px;cursor:pointer;box-sizing:border-box" data-i18n="m365_update_check">Check for updates</button>
<button type="button" onclick="stApplyUpdate()" id="stApplyUpdateBtn" style="display:none;height:26px;padding:0 14px;background:var(--accent);color:#fff;border:none;border-radius:6px;font-size:12px;cursor:pointer;font-weight:600;box-sizing:border-box" data-i18n="m365_update_install">Install update</button>
</div>
</div>
</div> </div>
<!-- ── Security pane ─────────────────────────────────────────────────── --> <!-- ── Security pane ─────────────────────────────────────────────────── -->
@ -715,12 +772,19 @@ document.addEventListener('DOMContentLoaded', applyI18n);
<input id="schedMinute" type="number" min="0" max="59" value="0" style="width:50px"> <input id="schedMinute" type="number" min="0" max="59" value="0" style="width:50px">
</div> </div>
</div> </div>
<div class="settings-row"> <div class="settings-row" id="schedProfileRow">
<label data-i18n="m365_sched_profile">Profile</label> <label data-i18n="m365_sched_profile">Profile</label>
<select id="schedProfile" style="flex:1;height:26px;padding:0 8px;border:1px solid var(--border);border-radius:5px;background:var(--surface);color:var(--text);font-size:12px;box-sizing:border-box"> <select id="schedProfile" style="flex:1;height:26px;padding:0 8px;border:1px solid var(--border);border-radius:5px;background:var(--surface);color:var(--text);font-size:12px;box-sizing:border-box">
<option value="" data-i18n="m365_sched_profile_last">Last saved settings</option> <option value="" data-i18n="m365_sched_profile_last">Last saved settings</option>
</select> </select>
</div> </div>
<div class="settings-row">
<label data-i18n="m365_sched_report_only">Report only</label>
<label class="toggle" style="flex:unset"><input type="checkbox" id="schedReportOnly" onchange="schedToggleReportOnly()"><span class="toggle-slider"></span></label>
</div>
<div class="settings-row" id="schedReportOnlyHint" style="display:none">
<span style="font-size:10px;color:var(--muted);line-height:1.4" data-i18n="m365_sched_report_only_hint">Email the latest scan results without running a new scan. Requires scan results in the database.</span>
</div>
<div class="settings-row"> <div class="settings-row">
<label data-i18n="m365_sched_auto_email">Email report automatically</label> <label data-i18n="m365_sched_auto_email">Email report automatically</label>
<label class="toggle" style="flex:unset"><input type="checkbox" id="schedAutoEmail"><span class="toggle-slider"></span></label> <label class="toggle" style="flex:unset"><input type="checkbox" id="schedAutoEmail"><span class="toggle-slider"></span></label>
@ -781,6 +845,10 @@ document.addEventListener('DOMContentLoaded', applyI18n);
<label data-i18n="m365_smtp_auto_email_manual">Email report after manual scan</label> <label data-i18n="m365_smtp_auto_email_manual">Email report after manual scan</label>
<label class="toggle" style="flex:unset"><input type="checkbox" id="st-smtpAutoEmail"><span class="toggle-slider"></span></label> <label class="toggle" style="flex:unset"><input type="checkbox" id="st-smtpAutoEmail"><span class="toggle-slider"></span></label>
</div> </div>
<div class="settings-row">
<label data-i18n="m365_smtp_prefer_smtp">Always send via SMTP (skip Microsoft Graph)</label>
<label class="toggle" style="flex:unset"><input type="checkbox" id="st-smtpPreferSmtp"><span class="toggle-slider"></span></label>
</div>
<div style="display:flex;justify-content:flex-end;gap:8px;margin-top:4px"> <div style="display:flex;justify-content:flex-end;gap:8px;margin-top:4px">
<div id="st-smtpStatus" style="flex:1;font-size:11px;color:var(--muted);align-self:center"></div> <div id="st-smtpStatus" style="flex:1;font-size:11px;color:var(--muted);align-self:center"></div>
<button onclick="stSmtpSave()" style="background:none;border:1px solid var(--border);color:var(--muted);height:26px;padding:0 12px;border-radius:6px;font-size:12px;cursor:pointer;box-sizing:border-box" data-i18n="btn_save">Save</button> <button onclick="stSmtpSave()" style="background:none;border:1px solid var(--border);color:var(--muted);height:26px;padding:0 12px;border-radius:6px;font-size:12px;cursor:pointer;box-sizing:border-box" data-i18n="btn_save">Save</button>
@ -808,6 +876,56 @@ document.addEventListener('DOMContentLoaded', applyI18n);
</div> </div>
</div> </div>
<!-- ── Audit Log pane ─────────────────────────────────────────────────── -->
<div class="settings-pane" id="stPaneAuditlog">
<div class="settings-group">
<div class="settings-group-title" data-i18n="m365_audit_title">Compliance Audit Log</div>
<div style="overflow-x:auto">
<table id="stAuditTable" style="width:100%;border-collapse:collapse;font-size:12px">
<thead>
<tr style="text-align:left">
<th style="padding:4px 8px;border-bottom:1px solid var(--border);color:var(--muted);font-weight:500" data-i18n="m365_audit_col_time">Time</th>
<th style="padding:4px 8px;border-bottom:1px solid var(--border);color:var(--muted);font-weight:500" data-i18n="m365_audit_col_action">Action</th>
<th style="padding:4px 8px;border-bottom:1px solid var(--border);color:var(--muted);font-weight:500" data-i18n="m365_audit_col_detail">Detail</th>
<th style="padding:4px 8px;border-bottom:1px solid var(--border);color:var(--muted);font-weight:500" data-i18n="m365_audit_col_ip">IP</th>
</tr>
</thead>
<tbody id="stAuditTableBody">
<tr><td colspan="4" style="padding:8px;color:var(--muted)" data-i18n="m365_audit_loading">Loading…</td></tr>
</tbody>
</table>
</div>
</div>
</div>
<div class="settings-pane" id="stPaneAi">
<div class="settings-group">
<div class="settings-group-title" data-i18n="m365_ai_title">AI-Enhanced NER</div>
<p style="margin:0 0 12px;font-size:12px;color:var(--muted)" data-i18n="m365_ai_desc">Use Claude AI instead of spaCy for name, address, and organisation detection. Significantly more accurate on Danish text — especially hyphenated surnames and foreign-origin names. Requires an Anthropic API key; charged per token.</p>
<div style="display:flex;align-items:center;gap:10px;margin-bottom:14px">
<label class="toggle" style="flex-shrink:0">
<input type="checkbox" id="aiEnabled">
<span class="toggle-track"></span>
</label>
<span style="font-size:13px" data-i18n="m365_ai_enable">Enable Claude NER</span>
</div>
<div style="margin-bottom:12px">
<label style="font-size:12px;color:var(--muted);display:block;margin-bottom:4px" data-i18n="m365_ai_api_key_label">Anthropic API key</label>
<div style="display:flex;gap:6px">
<input type="password" id="aiApiKey" placeholder="sk-ant-…" autocomplete="off" style="flex:1;height:26px;padding:0 8px;border:1px solid var(--border);border-radius:6px;background:var(--bg);color:var(--text);font-size:12px;box-sizing:border-box">
<button type="button" onclick="stAiToggleKey()" id="aiShowKeyBtn" style="height:26px;padding:0 10px;border:1px solid var(--border);background:none;color:var(--muted);border-radius:6px;font-size:12px;cursor:pointer" data-i18n="m365_ai_show_key">Show</button>
</div>
<span id="aiKeyStatus" style="font-size:11px;color:var(--muted);margin-top:4px;display:block"></span>
</div>
<div style="display:flex;gap:8px;align-items:center;flex-wrap:wrap">
<button type="button" onclick="stAiSave()" style="height:26px;padding:0 14px;background:var(--accent);color:#fff;border:none;border-radius:6px;font-size:12px;cursor:pointer" data-i18n="btn_save">Save</button>
<button type="button" onclick="stAiTest()" style="height:26px;padding:0 14px;background:none;border:1px solid var(--border);color:var(--text);border-radius:6px;font-size:12px;cursor:pointer" data-i18n="m365_ai_test">Test key</button>
<span id="aiStatus" style="font-size:12px"></span>
</div>
<p style="margin:14px 0 0;font-size:11px;color:var(--muted)" data-i18n="m365_ai_model_note">Model: claude-haiku-4-5 · billed at Anthropic token rates · results cached per document.</p>
</div>
</div>
</div><!-- /.settings-body --> </div><!-- /.settings-body -->
<div class="settings-footer"> <div class="settings-footer">
<button onclick="closeSettings()" style="background:none;border:1px solid var(--border);color:var(--muted);height:26px;padding:0 14px;border-radius:6px;font-size:12px;cursor:pointer;box-sizing:border-box" data-i18n="btn_close">Close</button> <button onclick="closeSettings()" style="background:none;border:1px solid var(--border);color:var(--muted);height:26px;padding:0 14px;border-radius:6px;font-size:12px;cursor:pointer;box-sizing:border-box" data-i18n="btn_close">Close</button>
@ -958,6 +1076,16 @@ document.addEventListener('DOMContentLoaded', applyI18n);
<input id="shareScopeUser" type="text" autocomplete="off" data-i18n-placeholder="share_scope_user_placeholder" placeholder="alice@school.dk" style="width:100%;box-sizing:border-box;font-size:12px;padding:5px 8px;background:var(--surface);border:1px solid var(--border);border-radius:5px;color:var(--text)"> <input id="shareScopeUser" type="text" autocomplete="off" data-i18n-placeholder="share_scope_user_placeholder" placeholder="alice@school.dk" style="width:100%;box-sizing:border-box;font-size:12px;padding:5px 8px;background:var(--surface);border:1px solid var(--border);border-radius:5px;color:var(--text)">
<div id="shareScopeUserDropdown" style="display:none;position:absolute;top:100%;left:0;right:0;margin-top:2px;background:var(--surface);border:1px solid var(--border);border-radius:6px;z-index:9999;max-height:220px;overflow-y:auto;box-shadow:0 4px 12px rgba(0,0,0,.3)"></div> <div id="shareScopeUserDropdown" style="display:none;position:absolute;top:100%;left:0;right:0;margin-top:2px;background:var(--surface);border:1px solid var(--border);border-radius:6px;z-index:9999;max-height:220px;overflow-y:auto;box-shadow:0 4px 12px rgba(0,0,0,.3)"></div>
</div> </div>
<div style="display:flex;gap:6px;flex:1.5;min-width:200px">
<div style="flex:1">
<div style="font-size:11px;color:var(--muted);margin-bottom:3px" data-i18n="share_date_from">Items from</div>
<input id="shareValidFrom" type="date" style="width:100%;box-sizing:border-box;font-size:12px;padding:5px 6px;background:var(--surface);border:1px solid var(--border);border-radius:5px;color:var(--text)">
</div>
<div style="flex:1">
<div style="font-size:11px;color:var(--muted);margin-bottom:3px" data-i18n="share_date_to">Items until</div>
<input id="shareValidTo" type="date" style="width:100%;box-sizing:border-box;font-size:12px;padding:5px 6px;background:var(--surface);border:1px solid var(--border);border-radius:5px;color:var(--text)">
</div>
</div>
<div style="width:100px"> <div style="width:100px">
<div style="font-size:11px;color:var(--muted);margin-bottom:3px" data-i18n="share_expires_in">Expires in</div> <div style="font-size:11px;color:var(--muted);margin-bottom:3px" data-i18n="share_expires_in">Expires in</div>
<select id="shareExpiry" style="width:100%;font-size:12px;padding:5px 6px;background:var(--surface);border:1px solid var(--border);border-radius:5px;color:var(--text)"> <select id="shareExpiry" style="width:100%;font-size:12px;padding:5px 6px;background:var(--surface);border:1px solid var(--border);border-radius:5px;color:var(--text)">
@ -970,13 +1098,6 @@ document.addEventListener('DOMContentLoaded', applyI18n);
</div> </div>
<button onclick="createShareLink()" style="height:30px;padding:0 14px;background:var(--accent);color:#fff;border:none;border-radius:5px;font-size:12px;cursor:pointer;flex-shrink:0" data-i18n="share_create">Create</button> <button onclick="createShareLink()" style="height:30px;padding:0 14px;background:var(--accent);color:#fff;border:none;border-radius:5px;font-size:12px;cursor:pointer;flex-shrink:0" data-i18n="share_create">Create</button>
</div> </div>
<div id="shareNewLinkRow" style="display:none;margin-top:10px">
<div style="font-size:11px;color:var(--muted);margin-bottom:4px" data-i18n="share_copy_link_prompt">Copy link:</div>
<div style="display:flex;gap:6px;align-items:center">
<input id="shareNewLinkUrl" type="text" readonly style="flex:1;font-size:11px;padding:5px 8px;background:var(--bg2,var(--bg));border:1px solid var(--border);border-radius:5px;color:var(--text);min-width:0">
<button onclick="copyShareLink()" id="shareCopyBtn" style="height:26px;padding:0 10px;background:none;border:1px solid var(--border);color:var(--muted);border-radius:5px;font-size:11px;cursor:pointer;flex-shrink:0" data-i18n="log_copy">Copy</button>
</div>
</div>
</div> </div>
<!-- Existing tokens --> <!-- Existing tokens -->
@ -1219,30 +1340,93 @@ document.addEventListener('DOMContentLoaded', applyI18n);
<div class="srcmgmt-group"> <div class="srcmgmt-group">
<div class="srcmgmt-group-title" data-i18n="m365_file_sources_add">Add source</div> <div class="srcmgmt-group-title" data-i18n="m365_file_sources_add">Add source</div>
<div class="fsrc-form" style="border-color:var(--border)"> <div class="fsrc-form" style="border-color:var(--border)">
<!-- Source type selector -->
<div class="fsrc-form-row"> <div class="fsrc-form-row">
<label>Name <span style="color:var(--accent)">*</span></label> <label>Type</label>
<input id="srcFileLabel" type="text" placeholder="e.g. Teacher files, NAS archive" maxlength="80" autocomplete="off"> <div style="display:flex;background:var(--bg);border:1px solid var(--border);border-radius:6px;overflow:hidden">
<button type="button" id="srcTypeLocal" onclick="srcFileTypeSelect('local')" style="flex:1;border:none;padding:3px 8px;font-size:11px;cursor:pointer;background:var(--accent);color:#fff" data-i18n="m365_fsrc_type_local">Local folder</button>
<button type="button" id="srcTypeSmb" onclick="srcFileTypeSelect('smb')" style="flex:1;border:none;border-left:1px solid var(--border);padding:3px 8px;font-size:11px;cursor:pointer;background:none;color:var(--muted)" data-i18n="m365_fsrc_type_smb">Network (SMB)</button>
<button type="button" id="srcTypeSftp" onclick="srcFileTypeSelect('sftp')" style="flex:1;border:none;border-left:1px solid var(--border);padding:3px 8px;font-size:11px;cursor:pointer;background:none;color:var(--muted)" data-i18n="m365_fsrc_type_sftp">SFTP</button>
</div> </div>
</div>
<input type="hidden" id="srcFileSourceType" value="local">
<div class="fsrc-form-row"> <div class="fsrc-form-row">
<label><span data-i18n="m365_fsrc_name">Name</span> <span style="color:var(--accent)">*</span></label>
<input id="srcFileLabel" type="text" data-i18n-placeholder="m365_fsrc_name_placeholder" placeholder="e.g. Teacher files, NAS archive" maxlength="80" autocomplete="off">
</div>
<!-- Local / SMB path field -->
<div id="srcFilePathRow" class="fsrc-form-row">
<label data-i18n="m365_fsrc_path">Path</label> <label data-i18n="m365_fsrc_path">Path</label>
<input id="srcFilePath" type="text" placeholder="~/Documents or //nas/shares" oninput="srcFileDetectSmb(); srcFileAutoName()"> <input id="srcFilePath" type="text" data-i18n-placeholder="m365_fsrc_path_placeholder" placeholder="~/Documents or //nas/shares" oninput="srcFileDetectSmb(); srcFileAutoName()">
</div> </div>
<div id="srcFileSmbFields" style="display:none;flex-direction:column;gap:6px"> <div id="srcFileSmbFields" style="display:none;flex-direction:column;gap:6px">
<div style="font-size:10px;color:var(--accent)" data-i18n="m365_fsrc_smb_detected">SMB/CIFS network share detected</div> <div style="font-size:10px;color:var(--accent)" data-i18n="m365_fsrc_smb_detected">SMB/CIFS network share detected</div>
<div class="fsrc-form-row"> <div class="fsrc-form-row">
<label data-i18n="m365_fsrc_smb_host">SMB host</label> <label data-i18n="m365_fsrc_smb_host">SMB host</label>
<input id="srcFileSmbHost" type="text" placeholder="nas.school.dk"> <input id="srcFileSmbHost" type="text" data-i18n-placeholder="m365_fsrc_smb_host_placeholder" placeholder="nas.school.dk">
</div> </div>
<div class="fsrc-form-row"> <div class="fsrc-form-row">
<label data-i18n="m365_fsrc_smb_user">Username</label> <label data-i18n="m365_fsrc_smb_user">Username</label>
<input id="srcFileSmbUser" type="text" placeholder="DOMAIN\\username"> <input id="srcFileSmbUser" type="text" data-i18n-placeholder="m365_fsrc_smb_user_placeholder" placeholder="DOMAIN\\username">
</div> </div>
<div class="fsrc-form-row"> <div class="fsrc-form-row">
<label data-i18n="m365_fsrc_smb_pw">Password</label> <label data-i18n="m365_fsrc_smb_pw">Password</label>
<input id="srcFileSmbPw" type="password" placeholder="Stored in OS keychain"> <input id="srcFileSmbPw" type="password" data-i18n-placeholder="m365_fsrc_pw_keychain_placeholder" placeholder="Stored in OS keychain">
</div> </div>
<div style="font-size:10px;color:var(--muted)" data-i18n="m365_fsrc_smb_pw_hint">Saved to OS keychain — never stored in a file.</div> <div style="font-size:10px;color:var(--muted)" data-i18n="m365_fsrc_smb_pw_hint">Saved to OS keychain — never stored in a file.</div>
</div> </div>
<!-- SFTP fields -->
<div id="srcFileSftpFields" style="display:none;flex-direction:column;gap:6px">
<div class="fsrc-form-row">
<label data-i18n="m365_fsrc_sftp_host">SFTP host</label>
<input id="srcFileSftpHost" type="text" data-i18n-placeholder="m365_fsrc_sftp_host_placeholder" placeholder="sftp.school.dk" oninput="srcFileAutoNameSftp()">
</div>
<div class="fsrc-form-row">
<label data-i18n="m365_fsrc_sftp_port">Port</label>
<input id="srcFileSftpPort" type="number" value="22" min="1" max="65535" style="width:70px">
</div>
<div class="fsrc-form-row">
<label data-i18n="m365_fsrc_sftp_user">Username</label>
<input id="srcFileSftpUser" type="text" data-i18n-placeholder="m365_fsrc_sftp_user_placeholder" placeholder="backup_user">
</div>
<div class="fsrc-form-row">
<label data-i18n="m365_fsrc_sftp_remote_path">Remote path</label>
<input id="srcFileSftpPath" type="text" data-i18n-placeholder="m365_fsrc_sftp_path_placeholder" placeholder="/var/data" value="/">
</div>
<!-- Auth type toggle -->
<div class="fsrc-form-row">
<label data-i18n="m365_fsrc_sftp_auth">Auth</label>
<div style="display:flex;background:var(--bg);border:1px solid var(--border);border-radius:6px;overflow:hidden">
<button type="button" id="srcSftpAuthPw" onclick="srcFileSftpAuthSelect('password')" style="flex:1;border:none;padding:3px 8px;font-size:11px;cursor:pointer;background:var(--accent);color:#fff" data-i18n="m365_fsrc_sftp_auth_password">Password</button>
<button type="button" id="srcSftpAuthKey" onclick="srcFileSftpAuthSelect('key')" style="flex:1;border:none;border-left:1px solid var(--border);padding:3px 8px;font-size:11px;cursor:pointer;background:none;color:var(--muted)" data-i18n="m365_fsrc_sftp_auth_key">SSH key</button>
</div>
</div>
<input type="hidden" id="srcFileSftpAuth" value="password">
<!-- Password auth -->
<div id="srcSftpPwFields">
<div class="fsrc-form-row">
<label data-i18n="m365_fsrc_sftp_pw">Password</label>
<input id="srcFileSftpPw" type="password" data-i18n-placeholder="m365_fsrc_pw_keychain_placeholder" placeholder="Stored in OS keychain">
</div>
<div style="font-size:10px;color:var(--muted)" data-i18n="m365_fsrc_sftp_pw_hint">Password is saved to the OS keychain — never stored in a file.</div>
</div>
<!-- Key auth -->
<div id="srcSftpKeyFields" style="display:none;flex-direction:column;gap:6px">
<div class="fsrc-form-row">
<label data-i18n="m365_fsrc_sftp_key_upload">Private key</label>
<div style="display:flex;gap:6px;align-items:center">
<input id="srcFileSftpKeyFile" type="file" accept=".pem,.key,.pub,*" style="flex:1;font-size:11px">
<span id="srcFileSftpKeyStatus" style="font-size:10px;color:var(--muted)"></span>
</div>
</div>
<input type="hidden" id="srcFileSftpKeyPath" value="">
<div class="fsrc-form-row">
<label data-i18n="m365_fsrc_sftp_passphrase">Passphrase</label>
<input id="srcFileSftpPassphrase" type="password" data-i18n-placeholder="m365_fsrc_sftp_passphrase_placeholder" placeholder="Leave blank if key has no passphrase">
</div>
<div style="font-size:10px;color:var(--muted)" data-i18n="m365_fsrc_sftp_passphrase_hint">Passphrase is saved to the OS keychain — never stored in a file.</div>
</div>
</div>
<div style="display:flex;align-items:center;gap:8px"> <div style="display:flex;align-items:center;gap:8px">
<input type="hidden" id="srcFileEditId" value=""> <input type="hidden" id="srcFileEditId" value="">
<div id="srcFileStatus" style="flex:1;font-size:11px;color:var(--muted)"></div> <div id="srcFileStatus" style="flex:1;font-size:11px;color:var(--muted)"></div>
@ -1273,26 +1457,26 @@ document.addEventListener('DOMContentLoaded', applyI18n);
<div class="fsrc-form" id="fsrcForm"> <div class="fsrc-form" id="fsrcForm">
<div style="font-size:11px;font-weight:600;color:var(--text)" data-i18n="m365_file_sources_add">Add source</div> <div style="font-size:11px;font-weight:600;color:var(--text)" data-i18n="m365_file_sources_add">Add source</div>
<div class="fsrc-form-row"> <div class="fsrc-form-row">
<label data-i18n="m365_fsrc_label">Name <span style="color:var(--accent)">*</span></label> <label><span data-i18n="m365_fsrc_name">Name</span> <span style="color:var(--accent)">*</span></label>
<input id="fsrcLabel" type="text" placeholder="e.g. Teacher files, NAS archive" maxlength="80" autocomplete="off"> <input id="fsrcLabel" type="text" data-i18n-placeholder="m365_fsrc_name_placeholder" placeholder="e.g. Teacher files, NAS archive" maxlength="80" autocomplete="off">
</div> </div>
<div class="fsrc-form-row"> <div class="fsrc-form-row">
<label data-i18n="m365_fsrc_path">Path</label> <label data-i18n="m365_fsrc_path">Path</label>
<input id="fsrcPath" type="text" placeholder="~/Documents or //nas/shares" oninput="fsrcDetectSmb(); fsrcAutoName()"> <input id="fsrcPath" type="text" data-i18n-placeholder="m365_fsrc_path_placeholder" placeholder="~/Documents or //nas/shares" oninput="fsrcDetectSmb(); fsrcAutoName()">
</div> </div>
<div id="fsrcSmbFields" class="fsrc-smb-fields" style="display:none;flex-direction:column;gap:6px"> <div id="fsrcSmbFields" class="fsrc-smb-fields" style="display:none;flex-direction:column;gap:6px">
<div style="font-size:10px;color:var(--accent);margin:-2px 0 2px" data-i18n="m365_fsrc_smb_detected">SMB/CIFS network share detected</div> <div style="font-size:10px;color:var(--accent);margin:-2px 0 2px" data-i18n="m365_fsrc_smb_detected">SMB/CIFS network share detected</div>
<div class="fsrc-form-row"> <div class="fsrc-form-row">
<label data-i18n="m365_fsrc_smb_host">SMB host</label> <label data-i18n="m365_fsrc_smb_host">SMB host</label>
<input id="fsrcSmbHost" type="text" placeholder="nas.school.dk"> <input id="fsrcSmbHost" type="text" data-i18n-placeholder="m365_fsrc_smb_host_placeholder" placeholder="nas.school.dk">
</div> </div>
<div class="fsrc-form-row"> <div class="fsrc-form-row">
<label data-i18n="m365_fsrc_smb_user">Username</label> <label data-i18n="m365_fsrc_smb_user">Username</label>
<input id="fsrcSmbUser" type="text" placeholder="DOMAIN\\username or username"> <input id="fsrcSmbUser" type="text" data-i18n-placeholder="m365_fsrc_smb_user_edit_placeholder" placeholder="DOMAIN\\username or username">
</div> </div>
<div class="fsrc-form-row"> <div class="fsrc-form-row">
<label data-i18n="m365_fsrc_smb_pw">Password</label> <label data-i18n="m365_fsrc_smb_pw">Password</label>
<input id="fsrcSmbPw" type="password" placeholder="Stored in OS keychain"> <input id="fsrcSmbPw" type="password" data-i18n-placeholder="m365_fsrc_pw_keychain_placeholder" placeholder="Stored in OS keychain">
</div> </div>
<div style="font-size:10px;color:var(--muted)" data-i18n="m365_fsrc_smb_pw_hint">Password is saved to the OS keychain — never stored in a file.</div> <div style="font-size:10px;color:var(--muted)" data-i18n="m365_fsrc_smb_pw_hint">Password is saved to the OS keychain — never stored in a file.</div>
</div> </div>
@ -1351,7 +1535,7 @@ document.addEventListener('DOMContentLoaded', applyI18n);
<option value="replace" data-i18n="m365_db_import_replace">Replace (full restore)</option> <option value="replace" data-i18n="m365_db_import_replace">Replace (full restore)</option>
</select> </select>
</div> </div>
<div id="importDbReplaceWarn" style="display:none;background:#7c1a0060;border:1px solid var(--danger);border-radius:6px;padding:8px 10px;font-size:11px;color:#ff7070;line-height:1.5" data-i18n="m365_db_import_replace_warn">⚠ Replace mode will erase all existing scan data before restoring. Make sure you have a backup of ~/.gdpr_scanner.db first.</div> <div id="importDbReplaceWarn" style="display:none;background:#7c1a0060;border:1px solid var(--danger);border-radius:6px;padding:8px 10px;font-size:11px;color:#ff7070;line-height:1.5" data-i18n="m365_db_import_replace_warn">⚠ Replace mode will erase all existing scan data before restoring. Make sure you have a backup of ~/.gdprscanner/scanner.db first.</div>
<div id="importDbStatus" style="min-height:16px;font-size:11px;color:var(--muted)"></div> <div id="importDbStatus" style="min-height:16px;font-size:11px;color:var(--muted)"></div>
<div style="display:flex;justify-content:flex-end;gap:8px;padding-top:4px;border-top:1px solid var(--border)"> <div style="display:flex;justify-content:flex-end;gap:8px;padding-top:4px;border-top:1px solid var(--border)">
<button onclick="closeImportDBModal()" style="background:none;border:1px solid var(--border);color:var(--muted);padding:5px 14px;border-radius:6px;font-size:12px;cursor:pointer" data-i18n="btn_close">Close</button> <button onclick="closeImportDBModal()" style="background:none;border:1px solid var(--border);color:var(--muted);padding:5px 14px;border-radius:6px;font-size:12px;cursor:pointer" data-i18n="btn_close">Close</button>

View File

@ -252,3 +252,36 @@ class TestFernet:
def test_decrypt_empty_returns_empty(self): def test_decrypt_empty_returns_empty(self):
result = app_config._decrypt_password("") result = app_config._decrypt_password("")
assert result == "" assert result == ""
class TestSmtpConfigLegacyKeys:
"""SMTP config saved by the older settings tab used `user`/`starttls`;
readers expect `username`/`use_tls`. _load_smtp_config must normalise them."""
def test_legacy_keys_normalised_on_load(self, tmp_path, monkeypatch):
import json
p = tmp_path / "smtp.json"
p.write_text(json.dumps({
"host": "smtp.gmail.com", "port": 587,
"user": "netadmin@adm.example.dk", # legacy key
"starttls": True, # legacy key
"from_addr": "netadmin@adm.example.dk",
"recipients": ["a@example.dk"],
}), encoding="utf-8")
monkeypatch.setattr(app_config, "_SMTP_CONFIG_PATH", p)
cfg = app_config._load_smtp_config()
assert cfg["username"] == "netadmin@adm.example.dk"
assert cfg["use_tls"] is True
def test_canonical_keys_take_precedence(self, tmp_path, monkeypatch):
import json
p = tmp_path / "smtp.json"
p.write_text(json.dumps({
"username": "canonical@example.dk",
"user": "legacy@example.dk",
}), encoding="utf-8")
monkeypatch.setattr(app_config, "_SMTP_CONFIG_PATH", p)
cfg = app_config._load_smtp_config()
assert cfg["username"] == "canonical@example.dk"

View File

@ -22,7 +22,7 @@ import checkpoint
@pytest.fixture(autouse=True) @pytest.fixture(autouse=True)
def _isolate(tmp_path, monkeypatch): def _isolate(tmp_path, monkeypatch):
"""Redirect all disk writes to a temp dir for each test.""" """Redirect all disk writes to a temp dir for each test."""
monkeypatch.setattr(checkpoint, "_CHECKPOINT_PATH", tmp_path / "checkpoint.json") monkeypatch.setattr(checkpoint, "_DATA_DIR", tmp_path)
monkeypatch.setattr(checkpoint, "_DELTA_PATH", tmp_path / "delta.json") monkeypatch.setattr(checkpoint, "_DELTA_PATH", tmp_path / "delta.json")

View File

@ -265,3 +265,71 @@ class TestExportImport:
tgt.import_db(str(export_path), mode="replace") tgt.import_db(str(export_path), mode="replace")
results = tgt.lookup_data_subject("290472-1234") results = tgt.lookup_data_subject("290472-1234")
assert len(results) >= 1 assert len(results) >= 1
# ─────────────────────────────────────────────────────────────────────────────
# Orphan-scan recovery (crash / kill / mid-scan restart)
# ─────────────────────────────────────────────────────────────────────────────
class TestOrphanScanRecovery:
def _start_unfinished_scan(self, db, item_id):
"""Begin a scan and save an item but never call finish_scan."""
sid = db.begin_scan({"sources": ["email"], "user_ids": []})
db.save_item(sid, _make_card(item_id=item_id))
return sid
def test_unfinished_scan_items_hidden_until_recovery(self, tmp_db):
self._start_unfinished_scan(tmp_db, "orphan-1")
# Not finalised → invisible to the open-items view
assert tmp_db.get_open_items() == []
def test_recovery_finalises_and_reveals_items(self, tmp_db):
self._start_unfinished_scan(tmp_db, "orphan-1")
self._start_unfinished_scan(tmp_db, "orphan-2")
recovered = tmp_db.finalize_orphan_scans()
assert recovered == 2
ids = {row["id"] for row in tmp_db.get_open_items()}
assert ids == {"orphan-1", "orphan-2"}
def test_recovery_leaves_finished_scans_untouched(self, tmp_db):
sid = tmp_db.begin_scan({"sources": ["email"], "user_ids": []})
tmp_db.save_item(sid, _make_card(item_id="done-1"))
tmp_db.finish_scan(sid, total_scanned=1)
before = tmp_db._connect().execute(
"SELECT finished_at FROM scans WHERE id=?", (sid,)
).fetchone()[0]
assert tmp_db.finalize_orphan_scans() == 0 # nothing to recover
after = tmp_db._connect().execute(
"SELECT finished_at FROM scans WHERE id=?", (sid,)
).fetchone()[0]
assert after == before # finished_at not rewritten
def test_recovery_is_idempotent(self, tmp_db):
self._start_unfinished_scan(tmp_db, "orphan-1")
assert tmp_db.finalize_orphan_scans() == 1
assert tmp_db.finalize_orphan_scans() == 0
# ─────────────────────────────────────────────────────────────────────────────
# account_name persistence (user/group badge data)
# ─────────────────────────────────────────────────────────────────────────────
class TestAccountNamePersistence:
def test_account_name_round_trips(self, tmp_db):
sid = tmp_db.begin_scan({"sources": ["email"], "user_ids": []})
tmp_db.save_item(sid, _make_card(item_id="an-1")) # account_name="Test User"
tmp_db.finish_scan(sid, total_scanned=1)
row = [r for r in tmp_db.get_open_items() if r["id"] == "an-1"][0]
assert row.get("account_name") == "Test User"
def test_account_name_column_exists(self, tmp_db):
cols = [r[1] for r in tmp_db._connect().execute(
"PRAGMA table_info(flagged_items)").fetchall()]
assert "account_name" in cols

311
tests/test_google_scan.py Normal file
View File

@ -0,0 +1,311 @@
"""
Route and engine tests for the Google Workspace scan module.
Covers:
- GET /api/google/scan/users auth guard, user list, error propagation
- POST /api/google/scan/start auth guard, concurrency lock, successful start, lock release
- POST /api/google/scan/cancel abort signal
- _run_google_scan no-connector broadcast, CPR hit flagging, source_type tagging
"""
from __future__ import annotations
import threading
import time
from unittest.mock import MagicMock
import pytest
# ── Fixtures ──────────────────────────────────────────────────────────────────
@pytest.fixture(scope="module")
def flask_app():
import gdpr_scanner
gdpr_scanner.app.config["TESTING"] = True
gdpr_scanner.app.config["WTF_CSRF_ENABLED"] = False
return gdpr_scanner.app
@pytest.fixture()
def client(flask_app):
with flask_app.test_client() as c:
yield c
@pytest.fixture()
def mock_google_connector(monkeypatch):
from routes import state
conn = MagicMock()
conn.list_users.return_value = []
monkeypatch.setattr(state, "google_connector", conn)
return conn
@pytest.fixture(autouse=True)
def clean_google_state():
yield
from routes import state
# Release the Google scan lock if a test left it acquired
acquired = state._google_scan_lock.acquire(blocking=False)
if acquired:
state._google_scan_lock.release()
state._google_scan_abort.clear()
# ── GET /api/google/scan/users ────────────────────────────────────────────────
class TestGoogleScanUsers:
def test_not_connected_returns_401(self, client, monkeypatch):
from routes import state
monkeypatch.setattr(state, "google_connector", None)
r = client.get("/api/google/scan/users")
assert r.status_code == 401
assert r.json["error"] == "not connected"
def test_returns_user_list(self, client, mock_google_connector):
mock_google_connector.list_users.return_value = [
{"id": "1", "email": "alice@test.dk", "displayName": "Alice", "userRole": "student"},
]
r = client.get("/api/google/scan/users")
assert r.status_code == 200
assert len(r.json["users"]) == 1
assert r.json["users"][0]["email"] == "alice@test.dk"
def test_returns_empty_list_when_no_users(self, client, mock_google_connector):
mock_google_connector.list_users.return_value = []
r = client.get("/api/google/scan/users")
assert r.status_code == 200
assert r.json["users"] == []
def test_connector_error_returns_500(self, client, mock_google_connector):
mock_google_connector.list_users.side_effect = Exception("Admin SDK unavailable")
r = client.get("/api/google/scan/users")
assert r.status_code == 500
assert "error" in r.json
# ── POST /api/google/scan/start ───────────────────────────────────────────────
class TestGoogleScanStart:
def test_not_connected_returns_401(self, client, monkeypatch):
from routes import state
monkeypatch.setattr(state, "google_connector", None)
r = client.post("/api/google/scan/start", json={})
assert r.status_code == 401
assert "not connected" in r.json["error"]
def test_already_running_returns_409(self, client, mock_google_connector):
from routes import state
state._google_scan_lock.acquire()
try:
r = client.post("/api/google/scan/start", json={})
assert r.status_code == 409
assert "already running" in r.json["error"]
finally:
state._google_scan_lock.release()
def test_starts_successfully(self, client, mock_google_connector, monkeypatch):
import routes.google_scan
monkeypatch.setattr(routes.google_scan, "_run_google_scan", lambda opts: None)
r = client.post("/api/google/scan/start", json={})
assert r.status_code == 200
assert r.json["status"] == "started"
def test_abort_event_cleared_on_start(self, client, mock_google_connector, monkeypatch):
import routes.google_scan
from routes import state
state._google_scan_abort.set()
monkeypatch.setattr(routes.google_scan, "_run_google_scan", lambda opts: None)
client.post("/api/google/scan/start", json={})
assert not state._google_scan_abort.is_set()
def test_lock_released_after_scan_completes(self, client, mock_google_connector, monkeypatch):
import routes.google_scan
from routes import state
done = threading.Event()
def _fake_scan(opts):
time.sleep(0.02)
done.set()
monkeypatch.setattr(routes.google_scan, "_run_google_scan", _fake_scan)
r = client.post("/api/google/scan/start", json={})
assert r.status_code == 200
assert done.wait(timeout=3), "Scan thread did not complete in time"
time.sleep(0.05) # allow finally block to run
acquired = state._google_scan_lock.acquire(blocking=False)
assert acquired, "Lock was not released after scan completed"
state._google_scan_lock.release()
@pytest.mark.filterwarnings("ignore::pytest.PytestUnhandledThreadExceptionWarning")
def test_lock_released_on_scan_exception(self, client, mock_google_connector, monkeypatch):
import routes.google_scan
from routes import state
done = threading.Event()
def _failing_scan(opts):
done.set()
raise RuntimeError("simulated crash")
monkeypatch.setattr(routes.google_scan, "_run_google_scan", _failing_scan)
r = client.post("/api/google/scan/start", json={})
assert r.status_code == 200
assert done.wait(timeout=3), "Scan thread did not complete in time"
time.sleep(0.05)
acquired = state._google_scan_lock.acquire(blocking=False)
assert acquired, "Lock was not released after scan raised an exception"
state._google_scan_lock.release()
# ── POST /api/google/scan/cancel ─────────────────────────────────────────────
class TestGoogleScanCancel:
def test_sets_abort_event(self, client):
from routes import state
state._google_scan_abort.clear()
r = client.post("/api/google/scan/cancel")
assert r.status_code == 200
assert r.json["status"] == "cancelling"
assert state._google_scan_abort.is_set()
def test_idempotent_when_not_running(self, client):
r = client.post("/api/google/scan/cancel")
assert r.status_code == 200
assert r.json["status"] == "cancelling"
# ── _run_google_scan engine ───────────────────────────────────────────────────
class TestRunGoogleScan:
"""
Unit-tests for _run_google_scan() called synchronously with all heavy
dependencies mocked: broadcast, _scan_bytes, DB, checkpoint I/O.
"""
def _setup_mocks(self, monkeypatch, conn, scan_bytes_result=None):
import gdpr_scanner
import checkpoint
import scan_engine
import gdpr_db
from routes import state
events = []
monkeypatch.setattr(state, "google_connector", conn)
monkeypatch.setattr(gdpr_scanner, "broadcast",
lambda evt, data=None: events.append((evt, data or {})))
monkeypatch.setattr(gdpr_scanner, "_scan_bytes",
lambda data, name, **kw: scan_bytes_result or {
"cprs": [], "pii_counts": None, "emails": [], "phones": []
})
monkeypatch.setattr(checkpoint, "_load_checkpoint", lambda *a, **kw: None)
monkeypatch.setattr(checkpoint, "_save_checkpoint", lambda *a, **kw: None)
monkeypatch.setattr(checkpoint, "_clear_checkpoint", lambda *a, **kw: None)
monkeypatch.setattr(checkpoint, "_load_delta_tokens", lambda: {})
monkeypatch.setattr(checkpoint, "_save_delta_tokens", lambda *a: None)
monkeypatch.setattr(scan_engine, "_with_disposition", lambda card, db: card)
monkeypatch.setattr(gdpr_db, "get_db", lambda *a, **kw: None)
gdpr_scanner.flagged_items.clear()
return events
def _run(self, monkeypatch, conn, options, scan_bytes_result=None):
import gdpr_scanner
import routes.google_scan as gs
events = self._setup_mocks(monkeypatch, conn, scan_bytes_result)
gs._run_google_scan(options)
gdpr_scanner.flagged_items.clear()
return events
def test_no_connector_broadcasts_error_and_done(self, monkeypatch):
import gdpr_scanner
import routes.google_scan as gs
from routes import state
events = []
monkeypatch.setattr(state, "google_connector", None)
monkeypatch.setattr(gdpr_scanner, "broadcast",
lambda evt, data=None: events.append((evt, data or {})))
gs._run_google_scan({"sources": ["gmail"], "user_emails": ["a@b.dk"], "options": {}})
assert any(evt == "scan_error" for evt, _ in events)
assert any(evt == "google_scan_done" for evt, _ in events)
def test_gmail_item_with_cpr_is_flagged(self, monkeypatch):
conn = MagicMock()
conn.list_users.return_value = []
conn.iter_gmail_messages.return_value = [
({"id": "msg1", "name": "report.txt", "size": 1024, "lastModifiedDateTime": "2026-01-01"}, b"content"),
]
cpr_result = {"cprs": [{"formatted": "010101-1234"}], "pii_counts": None, "emails": [], "phones": []}
events = self._run(monkeypatch, conn,
{"sources": ["gmail"], "user_emails": ["a@test.dk"], "options": {}},
scan_bytes_result=cpr_result)
flagged = [d for evt, d in events if evt == "scan_file_flagged"]
assert len(flagged) == 1
def test_gmail_item_source_type_is_gmail(self, monkeypatch):
conn = MagicMock()
conn.list_users.return_value = []
conn.iter_gmail_messages.return_value = [
({"id": "msg2", "name": "invoice.txt", "size": 512, "lastModifiedDateTime": "2026-01-01"}, b"data"),
]
cpr_result = {"cprs": [{"formatted": "020202-2345"}], "pii_counts": None, "emails": [], "phones": []}
events = self._run(monkeypatch, conn,
{"sources": ["gmail"], "user_emails": ["a@test.dk"], "options": {}},
scan_bytes_result=cpr_result)
flagged = [d for evt, d in events if evt == "scan_file_flagged"]
assert flagged[0]["source_type"] == "gmail"
def test_gmail_item_without_pii_not_flagged(self, monkeypatch):
conn = MagicMock()
conn.list_users.return_value = []
conn.iter_gmail_messages.return_value = [
({"id": "msg3", "name": "memo.txt", "size": 100}, b"hello world"),
]
events = self._run(monkeypatch, conn,
{"sources": ["gmail"], "user_emails": ["a@test.dk"], "options": {}})
assert not any(evt == "scan_file_flagged" for evt, _ in events)
def test_gdrive_item_source_type_is_gdrive(self, monkeypatch):
conn = MagicMock()
conn.list_users.return_value = []
conn.iter_gmail_messages.return_value = []
conn.iter_drive_files.return_value = [
({"id": "file1", "name": "doc.docx", "size": 2048, "lastModifiedDateTime": "2026-01-01"}, b"data"),
]
cpr_result = {"cprs": [{"formatted": "030303-3456"}], "pii_counts": None, "emails": [], "phones": []}
events = self._run(monkeypatch, conn,
{"sources": ["gmail", "gdrive"], "user_emails": ["a@test.dk"], "options": {}},
scan_bytes_result=cpr_result)
gdrive = [d for evt, d in events if evt == "scan_file_flagged" and d.get("source_type") == "gdrive"]
assert len(gdrive) == 1
def test_scan_done_always_broadcast(self, monkeypatch):
conn = MagicMock()
conn.list_users.return_value = []
conn.iter_gmail_messages.return_value = []
events = self._run(monkeypatch, conn,
{"sources": ["gmail"], "user_emails": ["a@test.dk"], "options": {}})
done = [d for evt, d in events if evt == "google_scan_done"]
assert len(done) == 1
assert "flagged_count" in done[0]
assert "total_scanned" in done[0]
def test_scan_done_counts_are_correct(self, monkeypatch):
conn = MagicMock()
conn.list_users.return_value = []
conn.iter_gmail_messages.return_value = [
({"id": "m1", "name": "a.txt", "size": 100}, b"x"),
({"id": "m2", "name": "b.txt", "size": 100}, b"y"),
]
cpr_result = {"cprs": [{"formatted": "040404-4567"}], "pii_counts": None, "emails": [], "phones": []}
events = self._run(monkeypatch, conn,
{"sources": ["gmail"], "user_emails": ["a@test.dk"], "options": {}},
scan_bytes_result=cpr_result)
done = next(d for evt, d in events if evt == "google_scan_done")
assert done["total_scanned"] == 2
assert done["flagged_count"] == 2

View File

@ -270,6 +270,49 @@ class TestFlaggedScopeEnforcement:
ids = {row["id"] for row in r.get_json()} ids = {row["id"] for row in r.get_json()}
assert "ci1" in ids assert "ci1" in ids
def test_no_ref_returns_open_items_across_all_sessions(self, client, db_patch):
# Two scans in separate session windows. The default (no-ref) view must
# surface unactioned items from BOTH, not just the latest session.
old_id = _seed_scan(db_patch, [_item("o1")])
db_patch._connect().execute(
"UPDATE scans SET started_at = started_at - 400 WHERE id = ?", (old_id,)
)
db_patch._connect().commit()
_seed_scan(db_patch, [_item("o2")])
r = client.get("/api/db/flagged")
ids = {row["id"] for row in r.get_json()}
assert ids == {"o1", "o2"}
def test_no_ref_excludes_items_with_a_disposition(self, client, db_patch):
_seed_scan(db_patch, [_item("d1"), _item("d2")])
db_patch.set_disposition("d1", "kept")
r = client.get("/api/db/flagged")
ids = {row["id"] for row in r.get_json()}
assert "d2" in ids # untouched → still open
assert "d1" not in ids # action taken → hidden
def test_no_ref_unreviewed_disposition_stays_open(self, client, db_patch):
_seed_scan(db_patch, [_item("u1")])
db_patch.set_disposition("u1", "unreviewed")
r = client.get("/api/db/flagged")
ids = {row["id"] for row in r.get_json()}
assert "u1" in ids # 'unreviewed' status is not an action
def test_no_ref_dedupes_rescanned_item_to_latest(self, client, db_patch):
# Same item flagged by two scans → appears once.
old_id = _seed_scan(db_patch, [_item("k1")])
db_patch._connect().execute(
"UPDATE scans SET started_at = started_at - 400 WHERE id = ?", (old_id,)
)
db_patch._connect().commit()
_seed_scan(db_patch, [_item("k1")])
rows = [row for row in client.get("/api/db/flagged").get_json() if row["id"] == "k1"]
assert len(rows) == 1
def test_ref_param_loads_historical_session(self, client, db_patch): def test_ref_param_loads_historical_session(self, client, db_patch):
# Push first scan >300 s into the past so it occupies its own session window. # Push first scan >300 s into the past so it occupies its own session window.
old_id = _seed_scan(db_patch, [_item("h1")]) old_id = _seed_scan(db_patch, [_item("h1")])

View File

@ -97,6 +97,22 @@ class TestScanStatus:
assert "scan_id" in data assert "scan_id" in data
assert data["scan_id"] is None assert data["scan_id"] is None
def test_idle_reports_google_not_running(self, client):
# The refresh/restore path relies on google_running being reported
# separately — running alone misses live Google scans.
data = client.get("/api/scan/status").get_json()
assert data["google_running"] is False
def test_google_lock_held_reports_google_running(self, client):
from routes import state
assert state._google_scan_lock.acquire(blocking=False)
try:
data = client.get("/api/scan/status").get_json()
assert data["google_running"] is True
assert data["running"] is False # M365/file lock still free
finally:
state._google_scan_lock.release()
# --------------------------------------------------------------------------- # ---------------------------------------------------------------------------
# /api/scan/start # /api/scan/start

222
tests/test_updates.py Normal file
View File

@ -0,0 +1,222 @@
"""
Tests for the software-update routes (routes/updates.py).
All git interaction is mocked no test touches the real repository,
the network, or restarts the process.
"""
from __future__ import annotations
import subprocess
import pytest
@pytest.fixture(scope="module")
def flask_app():
import gdpr_scanner
gdpr_scanner.app.config["TESTING"] = True
return gdpr_scanner.app
@pytest.fixture()
def client(flask_app):
with flask_app.test_client() as c:
yield c
def _cp(returncode=0, stdout="", stderr=""):
return subprocess.CompletedProcess(args=[], returncode=returncode,
stdout=stdout, stderr=stderr)
def _fake_git(*, local="aaaaaaa1", remote="aaaaaaa1", branch="main",
fetch_rc=0, dirty=False, reqs_changed=False, merge_rc=0,
commits=""):
"""Build a _git() replacement dispatching on the git subcommand."""
calls = []
def fake(*args, timeout=None):
calls.append(args)
if args[:2] == ("rev-parse", "--abbrev-ref"):
return _cp(stdout=branch + "\n")
if args == ("rev-parse", "HEAD"):
return _cp(stdout=local + "\n")
if args[0] == "rev-parse":
return _cp(stdout=remote + "\n")
if args[0] == "fetch":
return _cp(returncode=fetch_rc, stderr="fetch failed" if fetch_rc else "")
if args[0] == "log":
return _cp(stdout=commits)
if args[0] == "diff-index":
return _cp(returncode=1 if dirty else 0)
if args[0] == "diff":
return _cp(returncode=1 if reqs_changed else 0)
if args[0] == "merge":
return _cp(returncode=merge_rc, stderr="not a fast-forward" if merge_rc else "")
if args[0] == "stash":
return _cp()
raise AssertionError(f"unexpected git call: {args}")
fake.calls = calls
return fake
@pytest.fixture(autouse=True)
def supported(monkeypatch):
import routes.updates as upd
monkeypatch.setattr(upd, "_supported", lambda: True)
@pytest.fixture(autouse=True)
def no_audit(monkeypatch):
import gdpr_db
monkeypatch.setattr(gdpr_db, "log_audit_event", lambda *a, **k: None)
# ── /api/update/check ─────────────────────────────────────────────────────────
def test_check_unsupported(client, monkeypatch):
import routes.updates as upd
monkeypatch.setattr(upd, "_supported", lambda: False)
r = client.get("/api/update/check")
assert r.status_code == 200
assert r.get_json() == {"supported": False}
def test_check_up_to_date(client, monkeypatch):
import routes.updates as upd
monkeypatch.setattr(upd, "_git", _fake_git())
d = client.get("/api/update/check").get_json()
assert d["supported"] and d["up_to_date"]
assert d["commits"] == []
def test_check_update_available(client, monkeypatch):
import routes.updates as upd
monkeypatch.setattr(upd, "_git", _fake_git(
local="aaaaaaa1", remote="bbbbbbb2",
commits="bbbbbbb2 Fix thing\nccccccc3 Add thing\n"))
d = client.get("/api/update/check").get_json()
assert d["up_to_date"] is False
assert d["current"] == "aaaaaaa"
assert d["latest"] == "bbbbbbb"
assert len(d["commits"]) == 2
def test_check_fetch_failure(client, monkeypatch):
import routes.updates as upd
monkeypatch.setattr(upd, "_git", _fake_git(fetch_rc=1))
d = client.get("/api/update/check").get_json()
assert d["supported"] is True
assert "fetch failed" in d["error"]
# ── /api/update/apply ─────────────────────────────────────────────────────────
def test_apply_up_to_date_is_noop(client, monkeypatch):
import routes.updates as upd
monkeypatch.setattr(upd, "_git", _fake_git())
monkeypatch.setattr(upd, "_schedule_restart", lambda *a, **k: pytest.fail("must not restart"))
r = client.post("/api/update/apply")
assert r.status_code == 200
d = r.get_json()
assert d["ok"] is True and d["updated"] is False
def test_apply_refused_while_scan_running(client, monkeypatch):
import routes.updates as upd
from routes import state
monkeypatch.setattr(upd, "_git", _fake_git(remote="bbbbbbb2"))
monkeypatch.setattr(upd, "_schedule_restart", lambda *a, **k: pytest.fail("must not restart"))
assert state._scan_lock.acquire(blocking=False)
try:
r = client.post("/api/update/apply")
finally:
state._scan_lock.release()
assert r.status_code == 409
assert r.get_json()["code"] == "scan_running"
def test_apply_happy_path(client, monkeypatch):
import routes.updates as upd
fake = _fake_git(remote="bbbbbbb2", commits="bbbbbbb2 Fix\n")
monkeypatch.setattr(upd, "_git", fake)
restarts = []
monkeypatch.setattr(upd, "_schedule_restart", lambda *a, **k: restarts.append(1))
r = client.post("/api/update/apply")
assert r.status_code == 200
d = r.get_json()
assert d["ok"] and d["updated"] and d["restarting"]
assert d["from"] == "aaaaaaa" and d["to"] == "bbbbbbb"
assert restarts == [1]
assert ("merge", "--ff-only", "origin/main") in fake.calls
# tree was clean — no stash
assert not any(c[0] == "stash" for c in fake.calls)
def test_apply_stashes_dirty_tree(client, monkeypatch):
import routes.updates as upd
fake = _fake_git(remote="bbbbbbb2", dirty=True)
monkeypatch.setattr(upd, "_git", fake)
monkeypatch.setattr(upd, "_schedule_restart", lambda *a, **k: None)
r = client.post("/api/update/apply")
assert r.status_code == 200
assert any(c[0] == "stash" for c in fake.calls)
def test_apply_merge_failure(client, monkeypatch):
import routes.updates as upd
monkeypatch.setattr(upd, "_git", _fake_git(remote="bbbbbbb2", merge_rc=1))
monkeypatch.setattr(upd, "_schedule_restart", lambda *a, **k: pytest.fail("must not restart"))
r = client.post("/api/update/apply")
assert r.status_code == 409
d = r.get_json()
assert d["code"] == "merge_failed"
assert "fast-forward" in d["error"]
def test_apply_installs_requirements_when_changed(client, monkeypatch):
import routes.updates as upd
fake = _fake_git(remote="bbbbbbb2", reqs_changed=True)
monkeypatch.setattr(upd, "_git", fake)
monkeypatch.setattr(upd, "_schedule_restart", lambda *a, **k: None)
pip_calls = []
monkeypatch.setattr(upd.subprocess, "run",
lambda cmd, **kw: pip_calls.append(cmd) or _cp())
r = client.post("/api/update/apply")
assert r.status_code == 200
assert len(pip_calls) == 1
assert "pip" in pip_calls[0] and "-r" in pip_calls[0]
# ── Restart fd hygiene ────────────────────────────────────────────────────────
def test_mark_fds_cloexec_unmarks_inheritable_socket():
"""Werkzeug sets the listening socket inheritable; the restart must undo
that or the socket leaks through execv and squats on the port."""
import socket
import routes.updates as upd
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
try:
s.set_inheritable(True)
assert s.get_inheritable() is True
upd._mark_fds_cloexec()
assert s.get_inheritable() is False
finally:
s.close()
# ── /api/update/settings ──────────────────────────────────────────────────────
def test_settings_roundtrip(client, monkeypatch):
import routes.updates as upd
store = {"auto_update": False}
monkeypatch.setattr(upd, "get_update_config", lambda: dict(store))
monkeypatch.setattr(upd, "save_update_config",
lambda v: store.__setitem__("auto_update", bool(v)))
d = client.get("/api/update/settings").get_json()
assert d == {"supported": True, "auto_update": False}
r = client.post("/api/update/settings", json={"auto_update": True})
assert r.get_json() == {"ok": True}
assert store["auto_update"] is True
d = client.get("/api/update/settings").get_json()
assert d["auto_update"] is True

83
update_gdpr.sh Executable file
View File

@ -0,0 +1,83 @@
#!/usr/bin/env bash
# GDPRScanner — self-update script.
#
# Pulls the latest release from origin, reinstalls dependencies if they
# changed, and restarts the systemd service if one is installed.
# Safe to run from cron: exits quietly when already up to date, and
# auto-stashes local hotfixes instead of aborting the merge.
#
# Usage:
# ./update_gdpr.sh # update if origin has new commits
# ./update_gdpr.sh --check # report status only, change nothing
#
# Environment:
# GDPR_BRANCH branch to track (default: main)
# GDPR_SERVICE systemd unit to restart (default: gdprscanner, if it exists)
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
BRANCH="${GDPR_BRANCH:-main}"
SERVICE="${GDPR_SERVICE:-gdprscanner}"
log() { printf '[%s] %s\n' "$(date '+%Y-%m-%d %H:%M:%S')" "$*"; }
cd "$SCRIPT_DIR"
if [ ! -d .git ]; then
log "ERROR: $SCRIPT_DIR is not a git checkout — cannot self-update."
exit 1
fi
git fetch origin "$BRANCH" --quiet
LOCAL="$(git rev-parse HEAD)"
REMOTE="$(git rev-parse "origin/$BRANCH")"
if [ "$LOCAL" = "$REMOTE" ]; then
log "Already up to date ($(git describe --always HEAD))."
exit 0
fi
log "Update available: $(git rev-parse --short HEAD) -> $(git rev-parse --short "$REMOTE")"
git log --oneline "HEAD..origin/$BRANCH" | sed 's/^/ /'
if [ "${1:-}" = "--check" ]; then
exit 0
fi
# Local edits (e.g. a hotfix applied directly on the server) would make the
# merge abort. Stash them so the update proceeds; the stash is kept so
# nothing is lost.
if ! git diff-index --quiet HEAD --; then
log "Local changes detected — stashing:"
git diff --stat HEAD | sed 's/^/ /'
git stash push --quiet -m "update_gdpr.sh auto-stash $(date '+%Y-%m-%d %H:%M:%S')"
log "Recover later with: git stash show -p / git stash pop"
fi
REQS_CHANGED=false
if ! git diff --quiet "HEAD..origin/$BRANCH" -- requirements.txt; then
REQS_CHANGED=true
fi
# Fast-forward only: the server checkout must never diverge from origin.
git merge --ff-only --quiet "origin/$BRANCH"
log "Updated to $(git rev-parse --short HEAD)."
if [ "$REQS_CHANGED" = true ]; then
log "requirements.txt changed — updating dependencies..."
"$SCRIPT_DIR/venv/bin/pip" install --quiet -r requirements.txt
log "Dependencies updated."
fi
if command -v systemctl >/dev/null 2>&1 \
&& systemctl list-unit-files --type=service 2>/dev/null | grep -q "^$SERVICE\.service"; then
log "Restarting $SERVICE.service..."
systemctl restart "$SERVICE"
log "Service restarted."
else
log "No systemd unit '$SERVICE' found — restart GDPRScanner manually."
fi
log "Done."