Compare commits

..

No commits in common. "main" and "v1.6.14" have entirely different histories.

88 changed files with 2184 additions and 9711 deletions

View File

@ -1,21 +1,10 @@
name: Build — Windows, Linux & macOS
name: Build — Windows & Linux
# Trigger on every push to main, on version tags, or manually
on:
push:
branches: [main]
tags: ['v*']
paths-ignore:
- '**.md'
- 'docs/**'
- 'tests/**'
- 'pytest.ini'
- 'run_tests.sh'
- 'build_gdpr.sh'
- 'start_gdpr.sh'
- 'install_macos.sh'
- 'install_windows.ps1'
- '.github/ISSUE_TEMPLATE/**'
workflow_dispatch:
# Only run one build at a time per branch to avoid race conditions
@ -33,10 +22,10 @@ jobs:
include:
- os: windows-latest
name: windows
artifact_glob: "dist/*.exe"
- os: ubuntu-22.04
name: linux
- os: macos-15
name: macos
artifact_glob: "dist/GDPRScanner"
runs-on: ${{ matrix.os }}
name: GDPRScanner / ${{ matrix.name }}
@ -69,11 +58,6 @@ jobs:
Xvfb :99 -screen 0 1024x768x24 &
echo "DISPLAY=:99" >> $GITHUB_ENV
- name: Install macOS system dependencies
if: runner.os == 'macOS'
run: |
brew install tesseract tesseract-lang poppler
- name: Install Python dependencies
run: |
python -m pip install --upgrade pip
@ -94,27 +78,14 @@ jobs:
cd dist
zip -r "GDPRScanner_linux_x86_64.zip" "GDPRScanner"
- name: Package Windows binary
if: runner.os == 'Windows'
shell: pwsh
run: |
Compress-Archive -Path dist\GDPRScanner -DestinationPath dist\GDPRScanner_windows_x64.zip
- name: Package macOS binary
if: runner.os == 'macOS'
run: |
cd dist
zip -r "GDPRScanner_macos_arm64.zip" "GDPRScanner.app"
- name: Upload artifact
uses: actions/upload-artifact@v4
with:
name: M365Scanner-${{ matrix.name }}
retention-days: 30
path: |
dist/*.exe
dist/GDPRScanner_linux_x86_64.zip
dist/GDPRScanner_windows_x64.zip
dist/GDPRScanner_macos_arm64.zip
# ── Release ───────────────────────────────────────────────────────────────
# • version tag (v*) → proper versioned release with generated notes

View File

@ -9,380 +9,18 @@ Version numbers follow [Semantic Versioning](https://semver.org/spec/v2.0.0.html
## [Unreleased]
---
## [1.7.9] — 2026-06-22
### Added
- **"Always send via SMTP" option for email reports** — new toggle in **Settings → E-mailrapport**. When the scanner is signed in to Microsoft 365 it normally sends email through Microsoft Graph; Graph reports "accepted" the instant a message is queued, which hides the case where Exchange Online later silently drops it (e.g. a recipient on a Google-hosted subdomain of your Microsoft 365 domain — the message is treated as internal, finds no mailbox, and is discarded, with no delivery and no bounce). Enabling this option makes the manual report, the test email, and the after-scan auto-email all go straight through your configured SMTP server (e.g. Google Workspace `smtp.gmail.com` / `smtp-relay.gmail.com`), bypassing the Graph routing entirely.
### Changed
- **The results grid now shows every open item by default, not just the last scan** — when you open the app (or refresh after a scheduled or manual scan), the grid loads *all* flagged items that still need action — i.e. those with no disposition — across every scan, instead of only the most recent scan session. Items you have already tagged (kept, redacted, deleted, false positive, …) drop out of the view. Re-scans are de-duplicated so each item appears once, showing its most recent state. The session picker still loads any individual past scan, and the history banner button (formerly "Latest scan") is now **"Open items"** and returns to this default view.
### Fixed
- **Interrupted scans no longer lose their results** — a scan only became visible once it was *finalised*, but the Microsoft 365 and Google scan engines skipped finalisation when a scan was stopped, and any scan cut short by a server restart, crash, or out-of-memory kill never finalised at all. Its already-found items were then stranded in the database and invisible in the grid (this is what caused "scan finished but no results shown", especially after the in-app self-update restarts). Unfinished scans are now finalised automatically on startup (nothing is scanning at boot, so any unfinished scan is known to be dead), and a manually stopped Microsoft 365 scan finalises immediately so its partial results stay visible.
- **User and group badges were missing on result cards loaded from the database** — the reviewer's display name was shown live during a scan but never saved, so cards loaded from a past scan (now the default view) lost both the person badge and the Elev/Ansat group badge. The display name is now stored with each item, and the group badge is shown from the saved role even for older items that predate this fix (where a name can't be recovered, the group badge and a resolved e-mail still appear).
- **Email reports sent via SMTP failed with "authentication failed"** — the **Settings → E-mailrapport** tab saved the SMTP username under the wrong field name, so the username never reached the mail server and sign-in was skipped — the server then rejected the unauthenticated message, which surfaced as a misleading authentication error even with a correct password or app password. The setting is now saved correctly, and configurations saved before the fix are migrated automatically.
---
## [1.7.8] — 2026-06-16
### Fixed
- **Blank results grid after a browser refresh (especially after a server restart)** — restoring the last scan session on page load was one-shot: `_sseWatchdog()` called `loadHistorySession(null)` a single time, guarded by `_initialStatusChecked`. If that attempt was blocked — a completed scan's replayed `scan_phase` event leaves a `_*ScanRunning` flag set, and the `loadHistorySession` guard then bails — nothing retried, because `sse_replay_done` (the other retry path) only fires when the SSE replay buffer is non-empty, and the buffer is empty after a server restart (so refreshing after the in-app self-update reliably showed an empty grid even though the results were in the database). The watchdog now re-attempts the restore on every 4-second poll while nothing is shown and no scan is running, clearing stale running flags first (both scan locks are confirmed free at that point). Additionally, `/api/scan/status` now reports `google_running` separately from `running` (which only ever reflected the M365 + file lock), so a refresh during a live Google scan is detected instead of being treated as idle.
---
## [1.7.7] — 2026-06-15
### Changed
- **Share modal no longer leaves a stale link in the create box** — after clicking "Create", the generated-link preview row ("Copy link:") stayed visible at the top of the modal even though the new link was already listed under Active links with its own Copy button — so it looked like the form hadn't cleared. The redundant preview row is removed; creating a link now resets the form and briefly highlights the new entry in the Active links list, where it can be copied. (The 1.7.4 fix cleared the input fields but not this preview row.)
### Added
- **Reverse-proxy / HTTPS setup guide** — new `docs/setup/ZORAXY_SETUP.md` walks through putting the scanner behind Zoraxy with a Let's Encrypt certificate on a LAN-only deployment: DNS A-record to a private IP, ACME via DNS-01 challenge (HTTP-01 cannot reach a LAN-only host), proxy rule to `127.0.0.1:5100`, binding the app to loopback with `--host 127.0.0.1`, and scanner-specific verification (SSE streaming, HTTPS share links, self-update). Linked from the README (new "HTTPS / reverse proxy" section) and SECURITY.md.
### Fixed
- **SECURITY.md corrections** — the web UI binds to `0.0.0.0` by default, not `127.0.0.1` as claimed; the MSAL token cache path was still the pre-1.x `~/.gdpr_scanner_config.json` (actual: `~/.gdprscanner/token.json`).
---
## [1.7.6] — 2026-06-11
### Fixed
- **Update restart leaked the listening socket and hopped to port 5101** — Werkzeug marks its server socket inheritable (`srv.socket.set_inheritable(True)`, unconditionally, for its debug reloader), so the in-app update's `os.execv` restart carried the old listening socket into the new process as a zombie listener: same PID listening on both 5100 (never accepted — clients hang) and 5101 (the actual server). The 1.7.3 `SO_REUSEADDR`/grace-period fix couldn't help because the port genuinely was occupied — by the restarting process itself. `_restart_self()` now marks every fd above stderr close-on-exec before the exec (`_mark_fds_cloexec()`, enumerating `/proc/self/fd` on Linux), so the old socket dies with the exec and the new server rebinds 5100 immediately.
---
## [1.7.5] — 2026-06-11
### Fixed
- **Stale UI after updating the server** — Flask served `/static/` files with no `Cache-Control` header, so browsers cached JS/CSS heuristically (often for days). After a server update — including the new in-app self-update, whose post-install reload hit the cache — the backend was new but the frontend stayed old, and fixes appeared "not to work" until a hard refresh. `SEND_FILE_MAX_AGE_DEFAULT = 0` now makes every static file revalidate via ETag: unchanged files answer with a cheap 304, changed files are re-fetched immediately on the next normal page load.
---
## [1.7.4] — 2026-06-10
### Fixed
- **Share modal kept stale input after creating a link** — clicking "Create" only cleared the label field; scope type, user email, date range, and expiry kept their values, so the next link silently inherited the previous link's scope settings. The form-reset logic from `openShareModal()` is now a shared `_resetShareForm()` helper called after every successful create (the generated link row stays visible for copying).
---
## [1.7.3] — 2026-06-10
### Fixed
- **App restart no longer hops to a new port** — the in-app update restart (and any quick stop/start) left connections from the previous instance in TIME_WAIT, and the startup port probe did a plain `bind()` that treats TIME_WAIT as occupied — so the restarted app silently came up on 5101 and the browser's reload poll never found it. The probe now sets `SO_REUSEADDR` (matching how Werkzeug actually binds, so an actively listening port is still detected as occupied), and the requested port gets a 10-second grace period before the auto-increment fallback kicks in, covering the brief window where the old process hasn't fully released the socket.
- **Share links now respect a reverse proxy**`_getShareBaseUrl()` rewrote every copied share link to `http://<LAN-IP>:5100` (via `/api/local_ip`), which would bypass TLS when the scanner sits behind a reverse proxy (Zoraxy, Caddy, nginx, …): a DPO opening the link would silently fall back to plain HTTP. The LAN-IP rewrite now only applies in the case it was built for — browsing the app at `localhost` over HTTP, where `window.location.origin` would produce links unusable from other machines. Any HTTPS or non-localhost origin is used as-is.
---
## [1.7.2] — 2026-06-10
### Fixed
- **Copy buttons did nothing over plain HTTP** — the share modal's "Copy" buttons (new link + active links) and the log panel's copy button called `navigator.clipboard.writeText()` directly. The Clipboard API only exists in secure contexts (HTTPS or localhost), so when the scanner is reached at `http://<LAN-IP>:5100` the call threw synchronously and the intended `execCommand` fallback never ran — the button silently did nothing. `_copyText()` in `viewer.js` now feature-detects the API, falls back to `document.execCommand('copy')`, and as a last resort shows the link in a `prompt()` for manual copying; `log.js` reuses the same helper via `window._copyText`. `_getShareBaseUrl()` now caches the LAN-IP lookup so the token-list Copy buttons copy synchronously within the click gesture (required for `execCommand`).
---
## [1.7.1] — 2026-06-10
### Added
- **Software update from the GUI** — a new **Settings → General → Software update** group lets the operator check for and install updates without touching the server shell. "Check for updates" fetches origin and shows either "You are running the latest version" or the list of pending commits; "Install update" fast-forwards the git checkout to `origin/<branch>`, reinstalls dependencies only if `requirements.txt` changed, writes an `app_update` audit-log entry, and restarts the app in place by re-exec'ing the process (`os.execv` — same PID, so it works both under systemd and when launched via `start_gdpr.sh`). The page polls until the server is back and reloads itself. Local server-side edits are auto-stashed (kept, never discarded) before the merge. Updating is refused with a clear message while any scan is running. An **"Install updates automatically"** toggle (stored in `config.json` under `auto_update`) enables a background thread that checks once a day and installs unattended, skipping (and retrying hourly) while a scan runs. The group is only shown when the app runs from a git checkout — the frozen desktop build hides it. New blueprint `routes/updates.py` with `GET /api/update/check`, `POST /api/update/apply`, `GET/POST /api/update/settings`; 11 new tests in `tests/test_updates.py` with fully mocked git.
- **`update_gdpr.sh`** — standalone CLI/cron equivalent of the GUI update: fetch + fast-forward-only merge with auto-stash of local hotfixes, dependency reinstall only when `requirements.txt` changed, and a `systemctl restart` if a `gdprscanner.service` unit exists (override with `GDPR_SERVICE`). `./update_gdpr.sh --check` reports pending commits without changing anything; safe to run from cron (quiet no-op when already up to date).
### Fixed
- **Delta token status hid the source count** — the "Tokens saved" line under the Δ Delta scan toggle always showed the bare translation ("Tokens gemt") because the source count only existed in the JS fallback string, which is ignored whenever the lang key exists. The translations now carry a `{n}` placeholder ("Tokens gemt for {n} kilde(r)") substituted in `checkDeltaStatus()`, and the row gained a "?" hint bubble explaining what the saved change-tokens do and that "Clear tokens" forces the next scan to be a full scan.
- **Stale data-file paths in docs and UI text** — README, SECURITY.md, MAINTAINER.md, the `--headless` argparse help (`--settings`, `--reset-db`, epilog), the DB-import replace warning/confirm strings (all three languages), and two code comments still referenced the pre-1.x flat dotfile layout (`~/.gdpr_scanner_delta.json`, `~/.gdpr_scanner_smtp.json`, `~/.gdpr_scanner_machine_id`, `~/.gdpr_scanner.db`). All now point to the actual locations under `~/.gdprscanner/` (`delta.json`, `smtp.json`, `machine_id`, `scanner.db`). The legacy-migration rename tables in `gdpr_scanner.py` intentionally keep the old names.
---
## [1.7.0] — 2026-06-10
### Added
- **PDF redaction for local files** — the ✂ redact button now works on local PDF files in addition to DOCX, XLSX, CSV, and TXT. Text-based PDFs are redacted using PyMuPDF's physical redaction (`page.apply_redactions()`), which removes the underlying text data from the PDF stream — not just paints over it. Scanned/image-based PDFs go through the OCR bbox path: CPR positions are found via Tesseract then physically painted and sanitised. Falls back to a reportlab overlay if PyMuPDF is not installed; raises a clear error if both libraries are absent.
- **Google Drive file redaction** — the ✂ redact button now works on native DOCX, XLSX, and PDF files stored in Google Drive (both Google Workspace service-account and personal OAuth connectors). The file is downloaded via the Drive API, redacted locally using the same PyMuPDF / python-docx / openpyxl pipeline as local files, then uploaded back as a new revision via `files().update()`. Google Docs/Sheets exported as DOCX are detected by MIME type and refused with a clear message (re-upload after exporting manually). Requires the `drive` scope (not `drive.readonly`) on the service-account domain-wide delegation grant; a 403 surfaces the exact Google error so admins can add the scope. Methods added: `get_drive_file_mime`, `download_drive_file_by_id`, `update_drive_file` on both `GoogleWorkspaceConnector` and `PersonalGoogleConnector`.
- **SFTP file redaction** — the ✂ button now works on SFTP files (DOCX, XLSX, CSV, TXT, PDF). The file is downloaded via paramiko, redacted locally, then written back with `sftp.open(path, "wb")`. Source config is matched from `_load_file_sources()` by host + username; credentials are resolved from the keychain via `_resolve_sftp_credentials`. Requires the item to be in the current session's `state.flagged_items` (SFTP host info is not stored in the DB). New method: `SFTPScanner.write_file(remote_path, content)`.
- **SMB file redaction** — the ✂ button now works on SMB/CIFS network share files (DOCX, XLSX, CSV, TXT, PDF). Source config is looked up by matching the host parsed from `full_path` (`//host/share/…`). File is downloaded and re-uploaded using smbprotocol with `CreateDisposition.FILE_SUPERSEDE` so the file is atomically replaced. New function: `file_scanner.write_smb_file(path, content, username, password, domain)`.
- **AI-enhanced NER via Claude** — Named Entity Recognition (names, addresses, organisations) can now be powered by Claude Haiku instead of spaCy. Enable in **Settings → AI / NER**: paste an Anthropic API key, toggle on, click Test to confirm. When enabled, `document_scanner.py` calls the Claude API (`claude-haiku-4-5-20251001`) instead of spaCy for all three scan engines; results are cached in-memory per document (bounded at 2 000 entries) so repeated scans of the same file never re-charge the API. Falls back to spaCy automatically if the key is missing or the `anthropic` package is not installed. API key stored in `config.json` under `claude_api_key`; toggle stored under `claude_ner`. Routes: `GET/POST /api/settings/claude`, `POST /api/settings/claude/test`.
### Changed
- **Redacted and deleted cards stay in the grid until the next scan** — previously redacting (✏) or deleting (🗑) a card — or running a bulk delete — removed the affected cards from the grid and from `S.flaggedData`/`S.filteredData` immediately. Now each item is kept and marked: the card is greyed (`card-resolved` styling), shows a `✏ Redacted` (green) or `🗑 Deleted` (red) badge, and its action buttons are hidden so it can't be re-processed. The operator can see what was handled during the session; the grid is rebuilt on the next scan run, which clears the markers. Implemented with `_redacted` / `_deleted` flags in `results.js` (`appendCard`, `redactItem`, `deleteItem`, `executeBulkDelete`, `deleteSubjectItems`); handled items are also excluded from the bulk-delete match set. `POST /api/delete_bulk` now returns `deleted_ids` so the grid marks exactly the items the server actually deleted (partial failures stay active). Also fixes a latent bug in the data-subject delete flow where `renderGrid()` was called with no argument and threw, falsely reporting "Delete failed" after a successful erasure.
### Fixed
- **Selected card scrolled out of view when opening the preview** — opening the preview panel narrows `.grid-area`, which reflows the `auto-fill` grid to fewer columns and moves every card to a new row. The single-frame `scrollIntoView` ran while the browser's scroll-anchoring re-adjusted `scrollTop` mid-reflow, fighting the scroll so the clicked card ended up off-screen. Fixed by disabling scroll anchoring on `.grid-area` (`overflow-anchor: none`) and deferring the scroll by two animation frames so it runs against the settled layout; the card is now centred (`block: 'center'`) instead of `'nearest'` so it stays clearly visible.
- **Cards not shown after browser refresh** — when the browser reconnected to the SSE stream after a completed scan, the `scan_phase` events in the replay buffer temporarily set `S._m365ScanRunning = true` (all running flags start at `false` after a page reload). The watchdog's `loadHistorySession` call fired in this window and bailed on the stale flag; once `scan_done` cleared the flag, `_initialStatusChecked` was already `true` so `loadHistorySession` was never retried. Fixed by having the `sse_replay_done` handler retry `loadHistorySession(null)` when no scan is running and `S._historyRefScanId` is still `null` after replay.
- **Settings modal too narrow for seven tabs** — widened from 640 px to 720 px so all tab labels fit on one line without wrapping.
- **Card action buttons invisible in grid view**`.card` was missing `position: relative`, so the `position:absolute` delete (🗑), redact (✏), and bulk-select checkbox elements anchored to the viewport instead of the card and were then clipped away by the card's `overflow:hidden`. They only appeared in list view, where those elements are `position:static` and flow inline. Added `position: relative` to `.card` so all three position correctly within each card. Also gave `.card-redact-btn` the same `0.35` baseline opacity as the delete button (it was `opacity:0` at rest) so it's discoverable without hovering.
### Security
- **Stored XSS in the results grid** — scan-derived strings (file name, account/display name, folder, source label, modified date, image `alt`) were interpolated straight into `innerHTML` and `title=` attributes across the card, list, preview, data-subject lookup, and related-documents views. Because these values come from scanned content (e.g. a OneDrive file deliberately named with markup), a crafted filename could execute script in a reviewer's session — including a shared read-only viewer/DPO session. A new `esc()` helper in `static/js/results.js` (escapes `& < > " '`) is now applied to every untrusted field before embedding. The related-documents `onclick` JSON is also escaped with `.replace(/"/g,'&quot;')` to match the delete/redact button pattern, closing an attribute-injection hole where a filename containing `"` could break out of the handler.
- **Reflected XSS in `/api/thumb`** — the `?name=` query parameter was embedded unescaped into the placeholder SVG served as `image/svg+xml`, so opening a crafted `/api/thumb?name=<script>…` URL directly executed script in the app origin. `cpr_detector._placeholder_svg` now HTML-escapes both the type label and the filename before embedding them in the SVG.
- **Claude API key now encrypted at rest** — the Anthropic API key was stored in plaintext in `config.json` while the SMTP password was already Fernet-encrypted. `save_claude_config()` now encrypts the key with the same machine-keyed Fernet (`_encrypt_password`); a new `get_claude_api_key()` decrypts it for use. Legacy plaintext keys are still read transparently and re-encrypted on the next save. Readers in `document_scanner.py` and `routes/app_routes.py` updated accordingly.
---
## [1.6.28] — 2026-05-28
### Added
- **Date-range scoping for viewer tokens** — tokens can now carry optional `valid_from` and `valid_to` scope fields (YYYY-MM-DD). When set, `GET /api/db/flagged` filters items whose `modified` date falls outside the range. The share modal now shows two date inputs ("Items from" / "Items until") that apply to any scope type (all/role/user). The token list shows a green date-range badge when a range is stored. The server validates format and enforces `valid_from ≤ valid_to`. All three scope dimensions (role, user, date-range) are independent and combinable.
- **CPR-only mode** — a new `cpr_only` scan option (sidebar toggle `#optCprOnly`, profile editor `#peOptCprOnly`) makes all three scan engines skip items that have no qualifying CPR numbers. Files whose only hits are email addresses, phone numbers, detected faces, or EXIF/GPS metadata are not flagged. The flag already detected is still shown on cards when `cpr_only=false` (default). Gated in all three engines: file scan skip condition, M365 email flagging, M365 file flagging, and Google Gmail/Drive flagging.
- **OCR language override** — a new `ocr_lang` scan option (sidebar select `#optOcrLang`, profile editor `#peOptOcrLang`) lets operators choose the Tesseract language pack(s) used when scanning scanned PDFs and images. Presets: `dan+eng` (default), `dan`, `eng`, `dan+eng+deu`, `dan+eng+swe`, `dan+eng+fra`. The setting flows from the UI through the profile, into all three scan engines (M365 `_scan_bytes_timeout`, M365 attachments `_scan_bytes`, M365 files `_scan_bytes`, Google `_scan_bytes` for both Gmail and Drive). The `lang` parameter is threaded through `cpr_detector._scan_bytes``document_scanner.scan_pdf` / `scan_image` and the spawned PDF-OCR subprocess worker. The OCR cache key already included `lang`, so per-language results are cached independently.
- **Built-in file redaction for local files** — a scissor button (`✂`) appears on cards for local DOCX, XLSX, CSV, and TXT files. Clicking it rewrites the file in-place with all detected CPR numbers replaced by `██████-████` (DOCX/XLSX) or `█`-blocks (CSV/TXT), then removes the card from the grid and logs a `"redacted"` disposition. The redaction is atomic: a temp file in the same directory is written first and then moved over the original, so a crash never leaves a half-written file. Implemented in `routes/export.py` (`POST /api/redact_item`) using the existing `document_scanner` redact functions; front-end in `results.js` (`redactItem`) with the button hidden for non-local or unsupported-extension items and for resolved/viewer-mode cards.
- **`DELETE /api/delete_item` route registration fix** — the `delete_item` handler in `routes/export.py` was missing its `@bp.route` decorator, so the endpoint was never registered in Flask's URL map. The route now works correctly.
- **Scheduled report-only email job** — scheduled jobs can now be configured as "report only" (toggle `#schedReportOnly`). When enabled, the job skips the scan entirely and instead emails the latest scan results already in the database. If the in-memory result list is empty (e.g. after a server restart), results are loaded from the DB via `get_session_items()`. M365 authentication is not required for report-only jobs — email is sent Graph-first if authenticated, SMTP otherwise. Jobs fail with a clear error if no scan results are available. The job list card shows a blue "Report only" badge. Setting `report_only=True` in the editor automatically enables "Email report automatically" and dims the Profile field (unused for report-only runs).
- **Compliance audit log** — every significant admin action is now written to an immutable `audit_log` table in the scanner database. Recorded events: profile save/delete, viewer token create/revoke, viewer/interface/admin PIN set/change/clear, file source add/update/delete, scheduler job save/delete, scan start/stop, SMTP config save, single and bulk disposition changes, item delete, and item redact. Each record stores a Unix timestamp, an action key, a human-readable detail string, and the client IP address. Accessible via `GET /api/audit_log` (returns newest-first, max 1000 entries; filterable by `?action=`). Visible in the Settings modal under a new **Audit Log** tab; the table refreshes whenever the tab is opened. The `log_audit_event()` module-level helper in `gdpr_db.py` silently no-ops if the DB is unavailable, so all call sites are safe in test and offline contexts.
### Fixed
- **Stop button had no effect on Google Workspace scans**`POST /api/scan/stop` only set `state._scan_abort` (the M365/file abort event) and never touched `state._google_scan_abort`. Separately, `_check_abort()` inside `_run_google_scan` was checking `gdpr_scanner._scan_abort` (the M365 event) instead of the module-level `_scan_abort` alias that points to `state._google_scan_abort`. Both bugs combined meant neither the Stop button nor `POST /api/google/scan/cancel` had any effect on a running Google scan. Fixed by having `scan_stop()` set both events and having `_check_abort()` use the correct module-level alias.
- **Settings tab labels wrapping to two lines** — adding the Audit Log tab pushed the six-tab row past the 540 px modal width, causing "E-mailrapport" (and similar long translations) to break onto a second line. The modal is now 640 px wide and tabs carry `white-space:nowrap`; `.settings-tabs` retains `flex-wrap:wrap` as a safety net on very small screens.
---
## [1.6.27] — 2026-05-27
### Added
- **Email body excerpt preserved for offline preview** — when an M365 email or Gmail message is flagged, the first 500 characters of its plain-text body are stored in the card (`body_excerpt`), the checkpoint JSON, and a new `body_excerpt` DB column (migration #10). The M365 email preview now falls back to this excerpt when Graph is unavailable (not authenticated, token expired) or when resuming from a checkpoint without a live connection. The Gmail preview now shows the stored excerpt as the primary content (with the "Open in Gmail" link appended below) rather than the previous plain link-card. A helper `_excerpt_page()` in `routes/database.py` renders the excerpt with the same header layout as the full Graph-fetched preview.
- **Re-scan diff — resolved items in history view** — when browsing a past scan session, items that were flagged in the immediately preceding session but are no longer present in the current one are automatically appended below a "N items no longer present" divider. Resolved items are greyed out and carry a green `✓ Resolved` badge; the delete button is hidden since the file is already gone. The history banner updates to show the resolved count alongside the flagged count. The diff is computed client-side by fetching the previous session's items and comparing IDs — no new API endpoint needed. Implemented in `history.js` (`loadHistorySession`) and `results.js` (`appendCard`).
- **Google Workspace scan test suite** — 19 new tests in `tests/test_google_scan.py` covering all three routes (`GET /api/google/scan/users`, `POST /api/google/scan/start`, `POST /api/google/scan/cancel`) and the core scan engine (`_run_google_scan`). Route tests verify: 401 when unauthenticated, 409 when scan already running, lock released on both normal completion and exception, abort event cleared on start. Engine tests verify: CPR hits are broadcast as `scan_file_flagged`, clean items are not, `source_type` is correctly set to `"gmail"` for Gmail items and `"gdrive"` for Drive items, and `google_scan_done` always fires with correct `flagged_count` / `total_scanned` values.
---
## [1.6.26] — 2026-04-29
### Fixed
- **Previous scan results visible when a new scan starts** — two async functions (`loadHistorySession` and `loadLastScanSummary`) could resolve after `startScan` had already cleared the grid. `loadHistorySession` would re-populate the grid with old history items; `loadLastScanSummary` would re-show the last-scan summary card. Both functions now bail early after each `await` if any of the three scan-running flags (`S._m365ScanRunning`, `S._googleScanRunning`, `S._fileScanRunning`) is set — those flags are written synchronously by `startScan` before any awaits, so the check is race-free.
- **Selected card scrolls out of view when preview panel opens** — clicking a card in grid view opens the 420 px preview panel, which shrinks the grid area and reflows the card columns. The selected card was no longer visible. `openPreview()` now schedules a `requestAnimationFrame` after removing `.hidden` from the panel so the card is scrolled back into view (`scrollIntoView block: nearest`) once the layout has settled.
- **Gmail and Google Drive preview crashed with a 404 Graph API error**`_source_type` was never set on Google items in `routes/google_scan.py`, so Gmail and Google Drive cards carried an empty `source_type`. The preview route in `routes/database.py` only checked for `"local"`, `"smb"`, and `"email"` before falling through to the M365 else-branch, which tried to call `https://graph.microsoft.com/.../drive/items/gmail:{id}/preview` — always a 404. Fixed by tagging Gmail items as `_source_type = "gmail"` and Google Drive items as `"gdrive"` at scan time. The preview route now handles both: Google Drive files get an embeddable `https://drive.google.com/file/d/{id}/preview` iframe; Gmail messages (not embeddable) show an info card with an "Open in Gmail" link. The `state.connector` (M365 auth) guard was also moved inside the `email` and M365 `else` branches so Google-only setups no longer receive a 401 when opening a Gmail or Drive preview.
---
## [1.6.25] — 2026-04-25
### Added
- **Checkpoint / resume for Google and File scans** — stopping a Google Workspace or file (local/SMB/SFTP) scan mid-way and restarting now resumes from where it left off, exactly like M365 scans have always done. Each engine writes its own checkpoint file (`checkpoint_google.json`, `checkpoint_file_{source_id}.json`) every 25 items. On restart, previously found cards are re-emitted via SSE so the grid is repopulated before new items arrive. The Scan button now always checks for a live checkpoint before starting — if one exists the resume banner is shown regardless of whether the user reloaded the page. `POST /api/scan/checkpoint` returns a per-engine breakdown; `POST /api/scan/clear_checkpoint` wipes all `checkpoint_*.json` files. Google users' email addresses are included in the checkpoint payload from the frontend so the server can compute a matching key. `checkpoint.py` functions gained a `prefix` keyword argument (default `"m365"`) — existing M365 call sites are unchanged.
- **CPR cross-referencing (related documents)** — clicking any flagged card that contains CPR hits now shows a "Related documents" section in the preview panel listing other items from the same scan session that share at least one CPR number. Items are ordered by number of shared CPRs; clicking any entry opens it in the preview panel. Works in both live mode and history mode (respects `?ref=N`). Powered by a self-join on the existing `cpr_index` table — no new data collection needed. New `GDPRDb.get_related_items(item_id, ref_scan_id)` method and `GET /api/db/related/<item_id>?ref=N` endpoint in `routes/database.py`. Frontend: `#previewRelated` div in the preview panel, `_loadRelated(f)` in `results.js`, `window._openRelated(id, itemData)` helper (looks up live `S.flaggedData` first, falls back to API response for history items).
- **Email address and Danish phone number detection** — all three scan engines (M365, Google Workspace, local/SMB/SFTP) can now flag files and messages containing email addresses or Danish phone numbers in addition to CPR numbers. Detection is opt-in per profile: two new toggle options **Scan for email addresses** and **Scan for phone numbers** (default off) appear in the scan options panel and profile editor. When enabled, matches are stored as `email_count` / `phone_count` on each DB row and surfaced as colour-coded badges in list view, grid view, and the preview panel. Email regex requires a structurally valid address (`local@domain.tld`); phone regex covers 8-digit Danish numbers with optional `+45`/`0045` prefix and common spacing patterns. Both are deduplicated before counting. Requires DB migration (adds two INTEGER columns to `flagged_items`; applied automatically on first startup via `_MIGRATIONS`).
- **SFTP as a 4th file connector** — SFTP servers can now be added as file sources alongside local folders, SMB shares, and cloud sources. A new `SFTPScanner` class in `sftp_connector.py` implements the same `iter_files()` interface as `FileScanner`, so `run_file_scan()`, SSE broadcasting, DB persistence, card building, scheduled scans, and exports work without changes. Supports password auth and SSH private key auth (RSA, Ed25519, ECDSA, DSS); passphrases stored in the OS keychain. Key files uploaded via `POST /api/file_sources/upload_key` and stored in `~/.gdprscanner/sftp_keys/` with `chmod 600`. SFTP sources appear with a 🔒 icon in the sources panel. Requires `paramiko>=3.4` (optional — scanner falls back gracefully if not installed). New source-type selector (Local / Network (SMB) / SFTP) replaces the SMB path-prefix auto-detection in the add-source form.
- **`POST /api/file_sources/upload_key`** — new endpoint that validates and stores an SSH private key file, returning a `key_path` for use in the source definition.
- **SFTP entry in export SOURCE_MAP** — Excel and Article 30 exports render SFTP sources as "🔒 SFTP" with a purple tint (`EDE9F7`), consistent with the existing per-source tab and summary table logic.
### Fixed
- **File source form placeholders untranslated** — all nine placeholder texts in the Add source and Edit source forms (source name, path, SMB host/user, SFTP host/user/path, passphrase) were hardcoded English strings. Nine new `data-i18n-placeholder` keys added to `en.json`, `da.json`, and `de.json`; all 12 affected `<input>` elements now carry `data-i18n-placeholder` attributes.
- **"Name" and "Auth" labels untranslated in SFTP form** — the source-name label and the Auth toggle label in the add-source panel had no `data-i18n` attributes. Added keys `m365_fsrc_name` (DA: "Navn") and `m365_fsrc_sftp_auth` (same across languages). The name label used an inner `<span data-i18n>` to preserve the required-field `*` indicator, which would have been clobbered by a `data-i18n` on the outer `<label>` element. The same clobber bug was fixed for the `m365_fsrc_label` usage in the edit form.
- **Password field placeholder showed "Stored in OS keychain" in English** — added translation key `m365_fsrc_pw_keychain_placeholder` (DA: "Gemt i OS-nøglering") and applied `data-i18n-placeholder` to the three password inputs across both forms (SMB add, SFTP add, SMB edit).
---
## [1.6.24] — 2026-04-25
### Fixed
- **Scheduler UI showed untranslated English strings** — frequency labels ("Daily", "Weekly", "Monthly"), "Next:", "Running...", "Disabled", and both empty-state messages ("No scheduled scans yet." / "No scheduled runs yet") were hardcoded English strings in `scheduler.js` instead of using `t()`. All six call sites in `schedLoad()`, `schedRenderJobs()`, and `schedLoadHistory()` now call `t()` with the appropriate key. Three new translation keys added to `en.json`, `da.json`, and `de.json`: `m365_sched_no_jobs`, `m365_sched_running`, `m365_sched_disabled`.
---
## [1.6.23] — 2026-04-21
### Added
- **Video file metadata scanning**`.mp4`, `.mov`, `.m4v`, `.avi`, `.mkv`, `.wmv`, `.flv`, `.webm` files are now included in all scan sources (M365 OneDrive/SharePoint/Teams, Google Drive, local/SMB). No frame or audio analysis is performed; only container metadata is extracted: GPS coordinates (iPhone/Android QuickTime `©xyz` atom, ISO 6709 format), author/artist, title, comment/description, and recording date. A smartphone recording with an embedded GPS location is flagged with the `gps_location` special category, exactly like a geotagged photo. AVI metadata (RIFF INFO `INAM`/`IART`/`ICMT`) is parsed without any external library. Requires `mutagen>=1.47` (added to `requirements.txt`).
- **Audio file metadata scanning**`.mp3`, `.flac`, `.ogg`, `.m4a`, `.aac`, `.wma`, `.wav`, `.opus`, `.aiff` files are now scanned for PII-bearing tags across all sources. Extracted fields: title, artist, album artist, composer, lyricist, conductor, author, copyright, comment, description. No audio content is transcribed. Uses `mutagen.File(easy=True)` which normalises tag formats across ID3 (MP3), MPEG-4 (M4A/AAC), Vorbis (FLAC/OGG), and ASF (WMA) into a unified lowercase-key interface. A voice recording saved with a student's name in the artist tag will be flagged with `exif_pii`. Fixed a silent bug in `_extract_audio_metadata` where `mutagen.File(io.BytesIO(content), filename)` was passing the BytesIO as the `filename` positional argument; corrected to `mutagen.File(fileobj=..., filename=...)`.
- **Audio and video test fixtures**`tests/fixtures/local_files/generate_fixtures.py` now generates 6 new fixtures: `14_audio_artist_pii.mp3`, `15_audio_artist_pii.flac` (artist name → flag), `16_audio_no_pii.mp3`, `17_audio_no_pii.flac` (no tags → no flag), `18_video_gps.mp4` (GPS + artist → flag), `19_video_no_pii.mp4` (no tags → no flag). Total fixtures: 19 (14 flagged, 5 negative).
### Fixed
- **Audio and video files not appearing in local/SMB file scan**`file_scanner.py` maintained its own hardcoded `DEFAULT_EXTENSIONS` set that was never updated when video and audio extensions were added to `cpr_detector.SUPPORTED_EXTS`. Fixed by importing `SUPPORTED_EXTS` from `cpr_detector` directly; `DEFAULT_EXTENSIONS` is now an alias for it. `cpr_detector.SUPPORTED_EXTS` is the single source of truth for all scan sources (M365, Google Drive, local, SMB).
- **Profile copy rename not reflected in left column until modal reopen** — saving a renamed profile via the full editor (`_pmgmtSaveFullEdit`) called `loadProfiles()` to refresh `S._profiles` but never called `_renderProfileMgmt()`, so the left-column list was not repainted. The new name only appeared after closing and reopening the modal. Fixed by calling `_renderProfileMgmt()` immediately after `loadProfiles()` and re-applying the `.active` highlight to the correct row. 10 new route integration tests added for all profile API endpoints; total test count: 182.
---
## [1.6.22] — 2026-04-21
### Added
- **Auto-email after manual scan** — a new **Email report after manual scan** toggle in **Settings → Email report** sends the Excel report to the configured recipients automatically when a manual scan completes. Disabled by default. Stored as `auto_email_manual` in `smtp.json`. Uses the same Graph-first → SMTP-fallback path as scheduled scan auto-email. Only fires when there are flagged items and at least one recipient is saved; errors are logged but never surface to the UI (the scan result is unaffected).
- **Route integration test suite** — 44 new tests in `tests/test_route_integration.py` covering security-sensitive and data-correctness paths: viewer token CRUD, role and user scope enforcement on `GET /api/db/flagged`, bulk disposition isolation, viewer PIN set/verify/rate-limit/clear, interface PIN gate and multi-step flows, scan lock release on `run_scan()` exception, and `GET /api/db/sessions` shape and ordering. Total test count: 172.
### Fixed
- **Role scope filter silently returned nothing**`GET /api/db/flagged` filtered rows by `row.get("role")` but the column returned from the DB is `user_role`. Role-scoped viewer tokens (`{"role": "student"}` or `{"role": "staff"}`) therefore excluded every item and returned an empty list. Fixed in `routes/database.py`.
- **Historical session query included newer scans**`gdpr_db.get_session_items(ref_scan_id=N)` used a lower-bounded window (`started_at >= ref.started_at - 300`) with no upper bound, so any scan that started after the historical reference was also returned. Viewing a past session in the history browser would show items from all subsequent scans as well. Fixed by adding an upper bound (`started_at BETWEEN ref.started_at - 300 AND ref.started_at + 300`).
- **Scan button stuck disabled after file scan**`run_file_scan` broadcast a `scan_start` SSE event, which the `scan_start` handler in `_attachSchedulerListeners` intercepted and set `S._m365ScanRunning = true`. When `file_scan_done` fired it checked `!S._m365ScanRunning` before re-enabling the button — finding it still `true`, the button stayed disabled permanently. No `scan_done` (M365) ever arrives to clear the flag. Fixed by removing the `scan_start` broadcast from `run_file_scan`; the `scan_phase "Files — …"` event immediately following already sets `_fileScanRunning` correctly via the phase-source detection in `_attachScanListeners`.
- **`TypeError: unhashable type: 'dict'` during file and M365 scans** — `_distinct_cprs = list(dict.fromkeys(cprs))` in both scan paths treated `cprs` as a list of strings, but `extract_matches` returns a list of dicts (`{"formatted": "…", "page": …, …}`). The deduplication crashed on the first file that contained CPR numbers, aborting the scan loop. Fixed in both `run_file_scan` (line 251) and `run_scan` (line 1100) by keying on `c["formatted"]`: `list(dict.fromkeys(c["formatted"] for c in cprs))`.
- **Profile applied early lost user selection and source checkboxes** — two startup race conditions: (1) Profiles with `user_ids = "all"` applied before the M365 user list had loaded ran `.forEach()` on an empty array (no-op); when `loadUsers()` completed it defaulted all users to `selected = false` with nothing to override, leaving the accounts panel completely unchecked. Fixed by adding a `_pendingProfileAllUsers` deferred flag mirroring the existing `_pendingProfileUserIds` mechanism — `loadUsers()` applies it after populating `S._allUsers`. (2) If the profile was selected in the narrow window before `_loadFileSources()` returned and rendered the sources panel, `_applyProfile()` iterated zero checkboxes and the source selection was silently discarded; a subsequent `renderSourcesPanel()` call then re-rendered all sources as checked (their default). Fixed by calling `renderSourcesPanel()` in `_applyProfile()` when no source checkboxes are present in the DOM yet — same guard already used in `loadUsers()`.
---
## [1.6.21] — 2026-04-20
### Added
- **Local-file scan test fixtures**`tests/fixtures/local_files/` contains 13 ready-made files (`.txt`, `.csv`, `.docx`, `.xlsx`) covering every detection scenario: CPR with explicit label, mod-11valid CPR without label, post-2007 CPR with/without context keyword, protected number (day+40), multiple CPRs in one file, mixed PII (CPR + email + Art. 9 health data), and three true-negative cases (clean content, invoice false-positive, post-2007 serial number without context). All CPR numbers are mathematically valid; false-positive fixtures are verified to produce zero hits. Run `generate_fixtures.py` to regenerate the binary files.
- **Interface PIN** — optional session-level authentication gate for the main scanner interface. Set a 48 digit PIN in **Settings → Security → Interface PIN**; anyone reaching `http://host:5100` is redirected to `/login` and must enter the PIN before accessing scan controls, settings, or results. Viewer tokens and the `/view` route are completely unaffected — reviewers continue to use their own auth chain. The PIN is stored as a salted SHA-256 hash in `config.json`. Brute-force protection: 5 failed attempts per IP locks out for 5 minutes. A `POST /api/interface/logout` endpoint clears the session. PIN management via `GET/POST/DELETE /api/interface/pin`.
### Fixed
- **"Vælg" (select mode) button did nothing** — `toggleSelectMode`, `toggleCardSelect`, `selectAllVisible`, and `applyBulkDisposition` were defined inside an ES module but never assigned to `window`, so all `onclick` attributes calling them silently failed. Added the four missing `window.*` exports at the bottom of `results.js`.
- **Progress counter frozen at M365 total during Google/file scan** — the `scan_progress` handler in `scan.js` only updated `progressStats` and `progressEta` for `source === "m365"`. When M365 finished first, the counter stayed at its final value (e.g. "15083 / 15083 ETA 0s") for the entire duration of the Google and file scans. Fixed in two places: `scan_done` now clears the stats/ETA elements immediately when another scan is still running; `scan_progress` for Google/file sources now shows a running `"X scanned"` count (using the `scanned` field those engines already send) and clears ETA, but only while M365 is not running — M365 stats continue to dominate during concurrent scans.
- **PDF OCR kills process on large files**`document_scanner` previously called `convert_from_path()` once for the entire PDF before the processing loop, allocating all page images in memory simultaneously. A 50-page A4 PDF at 300 DPI required ~1.3 GB in a single allocation, triggering the OS OOM killer. Fixed by rendering one page at a time with `convert_from_path(first_page=N, last_page=N)` inside the loop across `scan_pdf`, `redact_fitz_pdf`, and `redact_pdf`. Peak OCR memory is now bounded to roughly one page (~26 MB at 300 DPI) regardless of document length.
- **No bulk disposition tagging** — each result card had to be opened individually to set a disposition. Added a Select mode (filter bar "Vælg" button) that reveals per-card checkboxes. Selecting one or more items shows a bulk tag bar at the bottom of the grid with a disposition dropdown and Apply button. Calls `POST /api/db/disposition/bulk`; updates all selected items in-memory and clears the selection. "Select all visible" / "Deselect all" toggle available in the bar. Hidden in viewer mode.
- **No disposition progress summary** — added a thin stats bar between the filter bar and the grid showing total · unreviewed · retain · delete · % reviewed. Updates after every single or bulk disposition save and after each grid render. Unreviewed count is highlighted in red until everything is tagged; turns green at 100%.
- **Google Drive always did a full scan** — Drive scanning in `routes/google_scan.py` used `conn.iter_drive_files()` on every run, re-downloading every file regardless of what changed. Added Google Drive delta scan using the Drive Changes API. When `delta` is enabled in scan options, the first run records a Changes API start page token per user (`gdrive:{email}` key in `delta.json`). Subsequent runs call `conn.get_drive_changes(user_email, token)` and only process files that have been added or modified since the last scan. Invalid or expired tokens fall back to a full scan automatically. Token save loads the current `delta.json` fresh before writing to avoid racing with concurrent M365 token saves. `google_scan_done` SSE event now includes `delta` and `delta_sources` fields.
- **No memory guard before OCR page renders** — added `_ocr_mem_ok()` check (`psutil.virtual_memory().available >= 500 MB`) before each page render in all three OCR paths. Pages that would exceed the threshold are skipped and recorded as `"skipped"` in `page_methods` with a printed warning rather than crashing the scan.
---
## [1.6.20] — 2026-04-18
### Fixed
- **Graph `sendMail` reported as failure despite email being delivered**`_post()` in `m365_connector.py` called `r.json()` unconditionally after `raise_for_status()`. The Graph `sendMail` endpoint returns HTTP 202 with an empty body on success, causing `json.JSONDecodeError: Expecting value: line 1 column 1 (char 0)`. This was caught by the `smtp_test` exception handler and surfaced as an error even though the email had been sent. Fixed by returning `r.json() if r.content else {}` so any Graph endpoint that responds with no body (sendMail, delete operations, etc.) is handled correctly.
- **Graph error hidden when SMTP host not configured** — when Graph failed and no SMTP host was saved, `smtp_test` returned the generic "No SMTP host configured" message, swallowing the actual Graph error. The `if not host` branch now surfaces the Graph exception text alongside the Mail.Send permission guidance so the real cause is visible.
- **Gmail vs Google Workspace SMTP error messages** — the auth failure handler now detects whether the username is a personal Gmail address (`@gmail.com`) or a Google Workspace custom-domain account, and shows a different message for each. Personal Gmail: existing App Password troubleshooting steps. Google Workspace: explains that SMTP access is controlled by the Workspace admin console (2-Step Verification policy, SMTP relay service), not the user's personal security settings.
---
## [1.6.19] — 2026-04-18
### Fixed
- **Gmail SMTP error message misleading when App Password already in use** — the auth failure handler in both `smtp_test` and `send_report` unconditionally told the user to "create an App Password", even when they were already using one. Gmail returns the same `535` / `Username and Password not accepted` error for a wrong app password, a revoked app password, spaces left in the 16-character code, or a wrong username — none of which are helped by the old message. The Gmail branch now lists the three most common causes (spaces in the code, revoked password, wrong username) and still links to the App Password page to generate a new one. The Microsoft personal account branch is unchanged.
---
## [1.6.18] — 2026-04-18
### Fixed
- **Art.30 and Excel exports missing GWS and local/SMB sources** — two silent failures caused Google Workspace and file-scan results to be absent from all exports after a page reload.
- `routes/google_scan.py`: called `_db.end_scan()` (method does not exist on `GDPRDb` — the correct name is `finish_scan`). The resulting `AttributeError` was swallowed by the bare `except Exception: pass` guard, so `finished_at` was never written on GWS scan records. Since `get_session_items()` requires `finished_at IS NOT NULL`, every GWS scan was permanently invisible to both export functions.
- `routes/google_scan.py`: emitted `"scan_done"` at completion instead of `"google_scan_done"`, causing the M365 done handler to fire for Google scans and breaking the SSE teardown logic.
- `scan_engine.py` (`run_file_scan`): called `_db.begin_scan(sources=…, user_count=0, options=source)` with keyword arguments, but `begin_scan(self, options: dict)` only accepts a single positional dict. The `TypeError` was caught silently, leaving `_db_scan_id = None`; all subsequent `save_item` calls were skipped, so local and SMB items were never written to the database.
---
## [1.6.17] — 2026-04-18
### Added
- **Scan history browser** — results from any past scan session can now be reviewed without running a new scan. On page load, when no scan is running, the last completed session is automatically loaded into the results grid. A **History** banner appears above the filter bar showing the session date, scanned sources, and item count. A **Sessions** button in the banner opens a dropdown listing all past sessions newest-first, each showing date, time, source labels, item count, and Delta / Latest badges. Clicking a session loads its items. A **Latest scan** button (shown only when browsing a past session) jumps back to the most recent session. Starting a new scan exits history mode and takes over the grid with live SSE results. Session cache is invalidated on each scan completion so the picker always reflects the true state of the database.
- `gdpr_db.py` — new `get_sessions(limit, window_seconds)` groups all completed scans by the 300-second concurrent-scan window and returns session summaries newest-first. `get_session_items()` gains an optional `ref_scan_id` parameter to anchor the session window to any past scan.
- `routes/database.py` — new `GET /api/db/sessions`; `GET /api/db/flagged` now accepts `?ref=<scan_id>` to serve items for a specific historical session.
- `static/js/history.js` (new) — `loadHistorySession(refScanId)`, `openHistoryPicker()`, `closeHistoryPicker()`, `exitHistoryMode()`, `invalidateHistoryCache()` all exposed on `window`.
- `state.js``_historyRefScanId: null` tracks which session is currently displayed (`null` = live/SSE).
- `results.js` — initial status check calls `loadHistorySession(null)` instead of `loadLastScanSummary()`.
- `scan.js``startScan()` calls `exitHistoryMode()`; all three `*_done` handlers call `invalidateHistoryCache()`.
- **User-scoped viewer tokens (#34)** — viewer token links can now be restricted to a specific person so the recipient sees only their own flagged files, across both M365 and Google Workspace. The Share modal's scope selector gains a **User** option that opens a searchable name autocomplete backed by the already-loaded `S._allUsers` list. Typing filters by display name or email; each row shows the person's full name, role badge, and all associated email addresses (M365 UPN and GWS email shown together for dual-platform users). Selecting a name fills the input with the display name and stores both email addresses internally. Scope is stored as `{"user": ["alice@m365.dk", "alice@gws.dk"], "display_name": "Alice Smith"}`. Server-side enforcement in `GET /api/db/flagged` filters `WHERE account_id IN (list)` so items from either platform are included. The viewer header shows the person's full name in a locked identity badge (`#viewerIdentityBadge`); `#filterRole` is hidden. Token rows in the Active links list show the display name badge. Free-text email entry still works as a fallback when no accounts are loaded. File-scan items (`account_id = ""`) never appear in user-scoped views — consistent with the existing role-scope behaviour.
---
## [1.6.16] — 2026-04-18
### Added
- **User-scoped viewer tokens (#34)** — viewer token links can now be restricted to a specific person so the recipient sees only their own flagged files, across both M365 and Google Workspace. The Share modal's scope selector gains a **User** option that opens a searchable name autocomplete backed by the already-loaded `S._allUsers` list. Typing filters by display name or email; each row shows the person's full name, role badge, and all associated email addresses (M365 UPN and GWS email shown together for dual-platform users). Selecting a name fills the input with the display name and stores both email addresses internally. Scope is stored as `{"user": ["alice@m365.dk", "alice@gws.dk"], "display_name": "Alice Smith"}`. Server-side enforcement in `GET /api/db/flagged` filters `WHERE account_id IN (list)` so items from either platform are included. The viewer header shows the person's full name in a locked identity badge (`#viewerIdentityBadge`); `#filterRole` is hidden. Token rows in the Active links list show the display name badge. Free-text email entry still works as a fallback when no accounts are loaded. File-scan items (`account_id = ""`) never appear in user-scoped views — consistent with the existing role-scope behaviour.
---
## [1.6.15] — 2026-04-12
### Added
- **Role-scoped viewer tokens** — viewer token links can now be restricted to a single role so the recipient can only see student or staff items. A new **Role scope** dropdown (All roles / Ansatte / Elever) in the Share modal is selected when creating a token. The scope is stored as `"scope": {"role": "student"|"staff"}` in `viewer_tokens.json`. Enforcement is two-layered: `GET /api/db/flagged` filters items server-side using `session["viewer_scope"].role` set at token validation time; the `#filterRole` dropdown in the viewer is pre-set and hidden so the constraint cannot be bypassed client-side. Tokens without a scope field (existing tokens, PIN sessions) remain unrestricted. Role badge (Ansatte / Elever) shown on each scoped token row in the Active links list.
- **Role filter in results + role-scoped exports** — a new **Role** dropdown in the filter bar (All roles / Ansatte / Elever) narrows the results grid to staff or student items. Clicking **Excel** or **Art.30** while a role is selected exports only that group — the `?role=student|staff` param is forwarded to both export endpoints. `_build_excel_bytes()` and `_build_article30_docx()` now accept a `role` param; all internal sheets (GPS, External transfers, Art.30 staff/student tables) respect the filter. Filenames get an `_elever` or `_ansatte` suffix.
- **Scan filter options for student environments** — two new profile options reduce noise when scanning student accounts:
- **Ignore GPS in images** (`skip_gps_images`) — images whose only PII signal is an embedded GPS coordinate are not flagged. Smartphones embed location in every camera photo by default, generating large numbers of low-priority flags in school contexts. GPS data is still extracted and shown in the detail card when the image is flagged by another signal (faces, EXIF author/comment). Applies to M365, Google, and file scans.
- **Min. CPR count per file** (`min_cpr_count`, default 1) — a file is only flagged if it contains at least this many *distinct* CPR numbers. Set to 2 to avoid reporting a student's own consent form or registration document (one CPR) while still flagging class lists and grade sheets with multiple students' CPRs. Deduplication is by value — a CPR repeated 10 times counts as 1 distinct number. Applies to M365, Google, and file scans.
- Both options are saved in profiles and editable in the Profile Manager editor.
- **GitHub Actions CI/CD — macOS build**`.github/workflows/build.yml` now also builds a macOS `.app` bundle (`macos-15`, Apple Silicon ARM64) on every push to `main` and on `v*` tags. Released as `GDPRScanner_macos_arm64.zip`. (Originally `macos-13` / Intel, changed when GitHub retired that runner.)
### Fixed
- **OneDrive 404 errors during delta scans**`GET /users/{id}/drive/root/delta` returns 404 for users with no OneDrive licence, a disabled service plan, a drive that was never provisioned (account never signed in), or a suspended account. Previously these 404s fell through to `requests.raise_for_status()` and were caught by the generic `except Exception` handler in `_scan_user_onedrive`, broadcasting a red `scan_error` card. Full scans never showed the error because `_iter_drive_folder_for` has a bare `except Exception: return`. Fixed by adding `M365DriveNotFound(M365Error)` to `m365_connector.py`, raising it from `_get()` on HTTP 404, and handling it explicitly in `_scan_user_onedrive` with a `scan_phase` broadcast ("OneDrive (user): not provisioned — skipped") before the generic exception handler.
- **CI — Windows artifact never uploaded** — PyInstaller `--onedir` puts the exe inside `dist/GDPRScanner/`, not at `dist/*.exe`. The artifact glob never matched, so no Windows build appeared in releases. A PowerShell packaging step now zips `dist\GDPRScanner\` into `GDPRScanner_windows_x64.zip` (mirroring the existing Linux step).
- **GitHub Actions CI/CD** — automated build workflow (`.github/workflows/build.yml`) builds Windows `.exe` and Linux binary on every push to `main`. Creates a GitHub Release with artifacts when a `v*` tag is pushed.
- **`EFFORT_ESTIMATE.md`** — build effort estimate document covering component-by-component hour breakdowns and complexity drivers for the project.
- **Settings → Security tab** — new dedicated pane in the Settings modal. Admin PIN and Viewer PIN groups moved here from the General tab, which now contains only Appearance and About. The Share modal's **Configure** button navigates directly to the Security tab.
- **Viewer mode layout** — the sidebar, log panel, and progress bar are now hidden in viewer mode so results fill the full window width. The `🔍 GDPRScanner` brand is shown in the top-left of the topbar (replacing the sidebar header) at the same size and weight as the normal sidebar title.
### Fixed
- **Share modal — Revoke / Copy buttons broken**`JSON.stringify(token)` produced a double-quoted string that terminated the surrounding `onclick="…"` HTML attribute early, so neither button fired its handler. Both now pass the token as a single-quoted JS string literal, which is safe for the hex token format.
- **Viewer PIN — Clear PIN rejected with "current PIN is incorrect"** — clicking **Clear PIN** without first typing in the Current PIN field sent an empty string to the server, which correctly rejected it. A client-side guard now validates the field is non-empty before sending the request, and focuses the input with an inline error message if it is empty.
- **Share modal — all UI strings now translated** — the Share results modal and Viewer PIN settings group were fully hardcoded in English. All visible strings are now backed by i18n keys (`share_*`, `viewer_pin_*`) in `en.json`, `da.json`, and `de.json`.
- **Excel / ART.30 export — Gmail and Google Drive missing from summary**`by_source` was built from flagged items only, so sources that produced zero hits were silently skipped. Both the Excel Summary sheet and the ART.30 "Breakdown by source" table now include every source that was actually scanned, showing `0` items and `0` CPR hits where nothing was found. New `GDPRDb.get_session_sources()` method reads the `sources` JSON column from all scans in the current session window to determine which sources ran.
- **Scan never finishes when M365 + Google run concurrently**`scan_done` (M365 finished) was closing the SSE connection immediately via `S.es.close()`, even when `S._googleScanRunning` or `S._fileScanRunning` was still true. The `google_scan_done` / `file_scan_done` events therefore never arrived, leaving the progress bar stuck at 100% indefinitely. SSE teardown is now deferred until the last concurrent scan completes: `scan_done` only closes the connection if neither Google nor File is still running; `google_scan_done` and `file_scan_done` close it when they are the final scan to finish.
---

View File

@ -16,27 +16,11 @@ python -m pytest tests/ -q
**Split modules:** `scan_engine.py` (M365 + file scan), `sse.py` (SSE broadcast), `checkpoint.py`, `app_config.py` (all persistence), `cpr_detector.py`
**Google Drive delta scan** — `routes/google_scan.py` reads `scan_opts.get("delta", False)` (same flag as M365). Per user, delta key is `f"gdrive:{user_email}"` stored in `~/.gdprscanner/delta.json` alongside M365 tokens. First delta-enabled scan fetches all files then records a Changes API start page token via `conn.get_drive_start_token(user_email)`. Subsequent scans call `conn.get_drive_changes(user_email, token)` and update the token. Invalid/expired tokens fall back to full scan automatically.
**Google connector write-back** — `google_connector.py` exposes `get_drive_file_mime`, `download_drive_file_by_id`, `update_drive_file` on both connectors for in-place Drive redaction. These use `DRIVE_WRITE_SCOPES` (`drive`, not `drive.readonly`) — the service-account delegation must include this scope or the call raises 403.
**SFTP connector** — `sftp_connector.py` provides `SFTPScanner` with the same `iter_files()` interface as `FileScanner`. `run_file_scan()` in `scan_engine.py` checks `source.get("source_type") == "sftp"` and instantiates `SFTPScanner`; the rest of the pipeline is source-agnostic. Auth: `"password"` via OS keychain; `"key"` from `~/.gdprscanner/sftp_keys/<uuid>`. `SFTP_OK` flag guards graceful degradation if `paramiko` is not installed. Single-file I/O: `_ssh_connect()`, `read_file(remote_path)`, `write_file(remote_path, content)` — do not duplicate SSH setup outside these methods.
**Shared content processing** — all three scan engines funnel downloaded bytes through `cpr_detector._scan_bytes(content, filename)`. `scan_engine.py` uses `_scan_bytes_timeout` for PDFs (subprocess + hard timeout). Do not duplicate file-type handling in per-source code.
**`cpr_detector.SUPPORTED_EXTS` is the single source of truth** for which file extensions are scanned. `file_scanner.py` imports it as `DEFAULT_EXTENSIONS`. Do not maintain a separate extension list anywhere else.
**`_scan_bytes` injection pattern** — `scan_engine.py` defines no-op stubs at module level (avoids circular import). `gdpr_scanner.py` overwrites them at startup. `routes/google_scan.py` resolves them lazily via `gdpr_scanner.__getattr__`. Do not import them directly in those modules.
**Blueprints** in `routes/` — see `routes/CLAUDE.md` for SSE constraints, export, preview, scheduler, NER, audit log, viewer, software update, and other route-specific rules.
**Self-update (server only)** — `routes/updates.py` powers **Settings → General → Software update**: git fetch → ff-only merge → conditional `pip install``os.execv` restart (same PID; marks fds close-on-exec first so Werkzeug's inheritable listening socket doesn't leak and squat the port). Only enabled for git checkouts (`_supported()` is false for frozen desktop builds). `update_gdpr.sh` is the CLI/cron equivalent. Refused while a scan runs; optional daily auto-update thread (`config.json["auto_update"]`). Restart keeps port 5100 (the port probe uses `SO_REUSEADDR` + a 10s grace). See `routes/CLAUDE.md` → "Software update".
**Blueprints** in `routes/` — see `routes/CLAUDE.md` for state/SSE rules.
**Frontend:** `templates/index.html` (SPA), `static/style.css` (all styles), `static/js/*.js` (11 ES modules + `state.js`). `static/app.js` is an archived monolith — no longer loaded.
**Checkpoint / resume** — all three scan engines save progress to `~/.gdprscanner/checkpoint_{prefix}.json` every 25 items. Prefixes: `m365`, `google`, `file_{source_id}`. Use `_cp_path(prefix)` — do not hard-code filenames. The Scan button calls `checkCheckpoint(() => startScan(false))` so a resume banner is offered before any grid clearing. `POST /api/scan/clear_checkpoint` globs and deletes all `checkpoint_*.json` files.
**Data dir** `~/.gdprscanner/`: `scanner.db`, `config.json` (also holds `claude_api_key`/`claude_ner` and the `auto_update` flag), `settings.json`, `schedule.json`, `token.json`, `delta.json`, `checkpoint_m365.json`, `checkpoint_google.json`, `checkpoint_file_*.json`, `smtp.json`, `machine_id` (**never delete** — Fernet key), `role_overrides.json`, `google_sa.json`, `google.json`, `src_toggles.json`, `app.lock`, `viewer_tokens.json`. Static files are served with `SEND_FILE_MAX_AGE_DEFAULT=0` (ETag revalidation) so the UI is fresh after a self-update — do not re-add long static caching.
**Data dir** `~/.gdprscanner/`: `scanner.db`, `config.json`, `settings.json`, `schedule.json`, `token.json`, `delta.json`, `checkpoint.json`, `smtp.json`, `machine_id` (**never delete** — Fernet key), `role_overrides.json`, `google_sa.json`, `google.json`, `src_toggles.json`, `app.lock`, `viewer_tokens.json`
## Non-obvious files
@ -46,70 +30,57 @@ python -m pytest tests/ -q
| `routes/state.py` | Shared mutable state + scan locks (not a typical Flask state file) |
| `routes/google_scan.py` | Google scan execution lives here, not in `google_connector.py` |
| `routes/viewer.py` | Viewer token + PIN API; also owns brute-force rate-limit state |
| `static/js/viewer.js` | Share modal, token CRUD, viewer PIN settings UI. Also defines `window._copyText` (HTTP-safe clipboard helper reused by `log.js`) |
| `static/js/viewer.js` | Share modal, token CRUD, viewer PIN settings UI |
| `lang/da.json` | Primary language — source of truth is `en.json` |
| `build_gdpr.py` | Desktop app builder; contains embedded `LAUNCHER_CODE` for PyInstaller |
| `routes/updates.py` | Self-update routes + `os.execv` restart with fd-cleanup; git-checkout only |
| `update_gdpr.sh` | CLI/cron self-update (fetch, ff-merge, deps, service restart) |
| `docs/setup/ZORAXY_SETUP.md` | HTTPS via Zoraxy reverse proxy (LAN-only, Let's Encrypt DNS-01) |
## Tests
215 tests in `tests/`. No integration tests for live M365/Google connections.
128 tests in `tests/`. No integration tests for Flask routes or live M365/Google connections.
**`tests/test_updates.py`** — 12 tests for the software-update routes (`routes/updates.py`). All git interaction goes through a mocked `_git()`; `_schedule_restart` is patched so no test re-execs the process, and `gdpr_db.log_audit_event` is patched so no test writes the real database. Includes `_mark_fds_cloexec` (the socket-leak guard for the restart).
## Viewer mode (#33) — routes/viewer.py + static/js/viewer.js
**`tests/test_google_scan.py`** — 19 tests for the Google Workspace scan module. Route tests for `GET /api/google/scan/users`, `POST /api/google/scan/start`, `POST /api/google/scan/cancel`. Engine tests for `_run_google_scan` using synchronous invocation with mocked `broadcast`, `_scan_bytes`, `checkpoint.*`, `scan_engine._with_disposition`, and `gdpr_db.get_db`. The `clean_google_state` autouse fixture releases `_google_scan_lock` and clears `_google_scan_abort` after each test.
Read-only access for DPOs and reviewers. Key invariants:
**`tests/test_route_integration.py`** — 54 Flask test-client tests covering security-sensitive paths: viewer token CRUD and scope validation, `GET /api/db/flagged` role/user scope enforcement, bulk disposition isolation, viewer PIN (set/verify/rate-limit/change/clear), interface PIN gate (multi-step flows require `session["interface_ok"] = True` after PIN set), scan lock release on `run_scan()` exception, `GET /api/db/sessions` shape and ordering, profile routes CRUD and rename. Uses a tmp-path `ScanDB` monkeypatched into `routes.database._get_db` — tests never touch the real database.
- **`/view` auth chain** — token (`?token=`) → session cookie (`session["viewer_ok"]`) → PIN form (if PIN configured) → 403. Never skip this order.
- **`window.VIEWER_MODE`** — injected by Jinja2 in `index.html`. `auth.js` reads it at startup; adds `viewer-mode` class to `<body>`. All hide rules are CSS (`body.viewer-mode …`), not scattered JS checks — except `delBtn` in the card builder which is also guarded in JS. Hidden in viewer mode: `.sidebar` (entire left panel), `#logWrap`, `#progressBar`, scan/stop/profile/bulk-delete buttons, share button.
- **`viewer_tokens.json` format** — stored as `{"tokens": [...], "__pin__": {"hash": "…", "salt": "…"}}`. The old bare-list format is migrated transparently on first write. Do not write the file as a bare list.
- **`app.secret_key`** — derived from `machine_id` bytes so Flask sessions survive restarts. Set once at startup in `gdpr_scanner.py`; do not override it.
- **`GET /api/db/flagged`** — returns `get_session_items()` (last completed scan session, joined with dispositions). Used exclusively by `_loadViewerResults()` in `results.js`. Do not confuse with `get_flagged_items()` (single scan_id, no disposition join).
- **Rate-limit state** (`_pin_attempts` dict in `routes/viewer.py`) — in-memory only, resets on server restart. Intentional — a restart clears lockouts without a persistent store.
- **Token onclick attributes** — Copy/Revoke buttons in `_renderTokenList()` pass the token as a single-quoted JS string literal (`'\'' + tok.token + '\''`), never via `JSON.stringify`. `JSON.stringify` produces double-quoted strings that break the surrounding `onclick="…"` HTML attribute.
- **Settings Security pane** — Admin PIN and Viewer PIN groups live in `stPaneSecurity`, not `stPaneGeneral`. `switchSettingsTab('security')` in `sources.js` triggers both `stLoadPinStatus()` and `stLoadViewerPinStatus()`. The Share modal Configure button opens `openSettings('security')`.
- **`stClearViewerPin` guard** — validates that the current-PIN field is non-empty client-side before sending the DELETE request; shows an inline error and focuses the field if empty.
- **Share link base URL**`_getShareBaseUrl()` in `viewer.js` fetches `/api/local_ip` (returns the machine's LAN IP via a UDP probe to `8.8.8.8`) and substitutes it so copied links are routable from other machines. Falls back to `window.location.origin` on error. Both `createShareLink` and `copyTokenLink` are `async` and `await` this helper. Do not revert to a bare `window.location.origin` — that produces `127.0.0.1` links useless to remote viewers.
- **Flask binds to `0.0.0.0`**`gdpr_scanner.py` default `--host`, `m365_launcher.py`, and `build_gdpr.py` all use `host="0.0.0.0"`. Internal loopback URLs (urllib exports, webview window, port probe) intentionally keep `127.0.0.1` — do not change those to `0.0.0.0`.
**Local-file scan fixtures** — `tests/fixtures/local_files/` holds 19 files (14 flagged, 5 true negatives). `generate_fixtures.py` regenerates the binary files. Audio fixtures need 2 silent MPEG frames so mutagen can sync; FLAC uses a hand-packed STREAMINFO + Vorbis comment block.
## Sources panel resize — static/js/log.js + sources.js
**`_CPR_PREFIX_NOISE` in `.docx` fixtures** — `scan_docx` concatenates all run texts with no separators. The fixture generator appends a trailing `" "` to every value run so CPRs are always surrounded by word boundaries. Do not remove this trailing space — the detection will silently regress.
## Scan filter options — scan_engine.py
All options live in the profile `options` dict and apply to **all three scan engines** (M365, Google, file scan).
- **`skip_gps_images` (bool, default `false`)** — images whose only PII is GPS coordinates are not flagged. GPS data still stored in `exif` field if flagged by another signal.
- **`min_cpr_count` (int, default `1`)** — minimum distinct CPR numbers before flagging. Deduplication uses `list(dict.fromkeys(c["formatted"] for c in cprs))` — do not revert to `dict.fromkeys(cprs)` (raises `TypeError: unhashable type: 'dict'`). Files with faces or EXIF PII are still flagged regardless.
- **`cpr_only` (bool, default `false`)** — skip items whose only hits are email addresses, phone numbers, faces, or EXIF/GPS metadata.
- **`ocr_lang` (str, default `"dan+eng"`)** — Tesseract language packs. Threaded through `_scan_bytes`/`_scan_bytes_timeout``document_scanner` and the PDF-OCR subprocess worker. Cache key already includes `lang`.
- **File scan** reads options from `source` dict keys directly. **M365 scan** reads from `scan_opts = options.get("options", {})`. Both paths apply the same `_cpr_qualifies` / `_exif_has_pii` logic.
- **UI:** sidebar `#optSkipGps`, `#optMinCpr`, `#optCprOnly`, `#optOcrLang`; profile editor `#peOptSkipGps`, `#peOptMinCpr`, `#peOptCprOnly`, `#peOptOcrLang`. All saved/loaded by `profiles.js`.
- **`_fitSourcesPanel()`** — called at the end of every `renderSourcesPanel()` call. Clears the panel's inline height, reads `scrollHeight` (natural content height), then either restores a saved smaller preference from `localStorage` (`gdpr_sources_h`) or pins the height to `scrollHeight`. This keeps the panel exactly as tall as needed to show all sources.
- **`_initSourcesResize()`** — attaches pointer-drag to `#sourcesResizeHandle`. On `pointerdown` it captures `scrollHeight` as the hard max; drag up shrinks, drag down is capped at that max. Saves to `localStorage` on release; clears the key if the user drags back to full height.
- **Do not add a fixed `max-height` or `height` to `#sourcesPanel` in HTML** — height is controlled entirely by `_fitSourcesPanel()` at runtime.
- **Do not call `_fitSourcesPanel()` before the panel has rendered**`scrollHeight` will be 0. The call in `renderSourcesPanel()` is the correct hook; `_initSourcesResize()` only sets up the drag handler.
## Memory management — scan_engine.py
- **Email body stripped at collection time**`_scan_user_email` stores body as `msg["_precomputed_body"]`, deletes `msg["body"]` and `msg["bodyPreview"]`. Processing loop reads `meta.pop("_precomputed_body", "")`. Do not re-add `body` to `$select` without also stripping it.
- **`body_excerpt`** — 500-char plain-text preview stored per flagged email; flows into `flagged_items`, checkpoint JSON, and DB. Do not remove before broadcasting — needed for preview on checkpoint resume.
- **`work_items``deque` before processing** — drained via `popleft()` so each item's memory is released immediately. Do not convert back to a list.
- **`del content` / `del body_text`** — raw bytes and body text deleted immediately after use. Both hit and no-hit paths have explicit deletes.
- **PDF OCR rendered page-by-page**`convert_from_path(first_page=N, last_page=N)` inside the loop; only one page image in memory at a time. Do NOT revert to a bulk call — triggers OOM on large PDFs.
- **OCR memory guard**`_ocr_mem_ok()` checks `psutil.virtual_memory().available >= 500 MB` before each page render.
- **Memory guard**`psutil.virtual_memory().available` checked before each M365 file download; skips if < 300 MB free.
Large M365 tenants can generate enormous memory pressure. Key rules to preserve:
## Scan history browser — gdpr_db.py
- **`get_sessions(limit=50, window_seconds=300)`** — groups `scans` rows by 300 s window. Groups built ascending, returned descending. `ref_scan_id` is the highest `scan_id` in each group. Do not change window size independently of `get_session_items`.
- **`get_session_items(ref_scan_id=N)`** — anchors 300 s window to that scan's `started_at`. Window is **symmetric**: `started_at BETWEEN ref.started_at - 300 AND ref.started_at + 300`. Do not revert to a one-sided lower bound.
- **`get_related_items(item_id, ref_scan_id, window_seconds=300)`** — self-joins `cpr_index` to find items sharing ≥1 CPR hash. Uses same 300 s symmetric window — do not change independently.
- **`account_name` (display name) is persisted** (migration 11) so DB-loaded cards show the user badge. Legacy rows predating it have `account_name=''` — the frontend `_accountPill` resolves a fallback and still shows the group badge from `user_role`. `save_item` must keep writing `card["account_name"]` (both M365 and Google cards carry it).
- **Scans must be finalised or their items are invisible**`get_session_items`, `get_open_items`, and `latest_scan_id` all filter on `finished_at IS NOT NULL`. The file scan finalises in a `finally`; M365 (`run_scan`) and Google (`_run_google_scan`) `return` early on abort, so each now calls `finish_scan` before that abort-return. A process kill (deploy/OOM/crash) mid-scan still strands a scan → **`finalize_orphan_scans()`** runs once at server startup (`gdpr_scanner.py` `__main__`, before the scheduler) and finalises every `finished_at IS NULL` scan (safe because nothing is scanning at boot). Do not add a scan-results query that ignores `finished_at` instead of fixing finalisation.
- **`get_open_items()`** — returns every flagged item with **no action taken**, across **all** scans (not just the latest session window). "Open" = no `dispositions` row, or one whose `status='unreviewed'`. Because `flagged_items` PK is `(id, scan_id)`, the same item recurs per scan; the query dedupes by `id`, keeping the row from the highest finished `scan_id`. This powers the **default landing view** so items don't drop out of sight once a newer scan opens a fresh session.
- **`GET /api/db/flagged`** — **with `?ref=N`**`get_session_items(ref_scan_id=N)` (history mode); **without ref**`get_open_items()` (default + viewer). Viewer scope enforcement applies to both. Do not change the no-ref `get_session_items()` default elsewhere (`export.py`, `scan_scheduler.py` still rely on latest-session for the current scan's report/email).
- See `static/js/CLAUDE.md` for the frontend history browser behaviour and `sse_replay_done` retry fix.
- **Email body stripped at collection time**`_scan_user_email` calls `conn.get_message_body_text(msg)`, stores the result as `msg["_precomputed_body"]`, then deletes `msg["body"]` and `msg["bodyPreview"]` before appending to `work_items`. The processing loop reads `meta.pop("_precomputed_body", "")`. Do not re-add `body` to the `$select` query without also stripping it here.
- **`work_items``deque` before processing** — converted with `deque(work_items)` and drained via `popleft()` so each item's memory is released immediately after processing. Do not convert back to a list or iterate with `enumerate()`.
- **`del content` in file branch** — raw download bytes are deleted as soon as `content.decode()` is done (before NER/PII counting). Both the hit and no-hit paths have explicit `del content`.
- **`del body_text` in email branch** — deleted after `_broadcast_card` call.
- **PDF OCR images freed page-by-page** — in `document_scanner.scan_pdf`, `images[page_num-1] = None` immediately after OCR. Do not cache or accumulate page images.
- **Memory guard**`psutil.virtual_memory().available` checked before each M365 file download; scan skips the file if < 300 MB free.
## Global gotchas
- **Pattern matching in Python** — when using `str.replace()` to patch JS/HTML, whitespace and quote style must match exactly. Use `in` check first and print if not found.
- **`__getattr__` on modules** — only resolves `module.name` access from outside, not bare name lookups inside function bodies. Always import directly.
- **`JSON.stringify` inside `onclick="…"` attributes** — produces double-quoted strings that terminate the HTML attribute early. Use single-quoted JS string literals instead, or `data-*` attributes read from the handler. When the object is embedded as an `onclick` payload, also `.replace(/"/g,'&quot;')` it (matches the delete/redact button pattern) so a `"` in a filename can't break out.
- **Escape scan-derived strings before `innerHTML`** — file names, account/display names, folders, and source labels come from scanned content and may contain markup. Pass them through `esc()` (in `results.js`) before embedding in `innerHTML` or `title=`/`alt=` attributes. Server-side SVG/HTML built from request params (e.g. `_placeholder_svg` for `/api/thumb`) must use `_html_esc`. Skipping either re-introduces stored/reflected XSS.
- **Secrets at rest use the machine-keyed Fernet** — the SMTP password and Claude API key are encrypted via `app_config._encrypt_password` / `_decrypt_password`. New secret-bearing config fields must follow the same pattern; read them through a decrypting accessor (e.g. `get_claude_api_key()`), never `_load_config().get(...)` directly.
- **`JSON.stringify` inside `onclick="…"` attributes** — produces double-quoted strings that terminate the HTML attribute early. Use single-quoted JS string literals instead, or `data-*` attributes read from the handler.
## Directory-scoped rules
- `routes/CLAUDE.md` — SSE constraints, M365 exceptions, export, preview, audit log, email, scheduler, Claude NER, viewer route, Python gotchas
- `static/js/CLAUDE.md` — profile dropdown, progress bar, SSE teardown, history browser, CPR cross-referencing, sources panel resize, viewer JS, JS gotchas
- `routes/CLAUDE.md` — SSE constraints, scan_progress source field, file_sources, Python gotchas
- `static/js/CLAUDE.md` — profile dropdown, progress bar phase parsing, JS gotchas
- `templates/CLAUDE.md` — CSS variable names, sizing rules, badge standard, design rules
- `lang/CLAUDE.md` — i18n conventions

View File

@ -1,16 +1,15 @@
# Contributing to GDPR Scanner
Thank you for considering a contribution. This project helps organisations find
and manage personal data across Microsoft 365 (Exchange, OneDrive, SharePoint,
Teams), Google Workspace (Gmail, Google Drive), and local/SMB file systems.
Contributions that improve compliance coverage, reliability, and usability are
very welcome.
and manage personal data in Microsoft 365 tenants. Contributions that improve
compliance coverage, reliability, and usability are very welcome.
---
## Before You Start
- Check the [open issues](../../issues) to see if your idea is already tracked
- Check the [open issues](../../issues) and [SUGGESTIONS.md](SUGGESTIONS.md) to
see if your idea is already tracked
- For large features, open an issue first to discuss the approach — this avoids
wasted effort if the direction doesn't fit
- Security vulnerabilities: see [SECURITY.md](SECURITY.md) — do not file public issues
@ -32,16 +31,16 @@ pip install -r requirements.txt
# Danish NER model (optional — needed for name/address detection)
python -m spacy download da_core_news_lg
# Start the scanner (serves on http://0.0.0.0:5100)
python gdpr_scanner.py
# Run the Document Scanner
python server.py
# Run the test suite
python -m pytest tests/ -q
# Run the GDPRScanner
python gdpr_scanner.py
```
To test against a real M365 tenant you will need a Microsoft Azure app
registration with the permissions described in the README. A free developer
tenant is available via the [Microsoft 365 Developer Program](https://developer.microsoft.com/microsoft-365/dev-program).
You will need a Microsoft Azure app registration with the permissions described
in the README to test GDPRScanner against a real tenant. A developer tenant
is available for free via the [Microsoft 365 Developer Program](https://developer.microsoft.com/microsoft-365/dev-program).
---
@ -49,7 +48,8 @@ tenant is available via the [Microsoft 365 Developer Program](https://developer.
- Bug fixes
- Improved CPR false-positive reduction
- New language files (see `lang/en.json` for the key list)
- New language files (see `lang/en.lang` for the key list)
- Items from [SUGGESTIONS.md](SUGGESTIONS.md) — check the status column first
- Performance improvements for large tenants
- Docker / deployment improvements
- Documentation fixes
@ -65,7 +65,7 @@ tenant is available via the [Microsoft 365 Developer Program](https://developer.
- All personal data (CPR numbers) must be SHA-256 hashed before storage — never store or log raw CPR values
- Wrap Graph API calls in try/except and handle `M365PermissionError` gracefully
**JavaScript (`static/js/*.js` — ES modules)**
**JavaScript (embedded in the Flask templates)**
- `const` / `let` — no `var`
- `async/await` over `.then()` chains
- All user-visible strings must have a `data-i18n` key so translations work
@ -78,9 +78,9 @@ tenant is available via the [Microsoft 365 Developer Program](https://developer.
## Adding a Language
1. Copy `lang/en.json` to `lang/xx.json` (ISO 639-1 code)
1. Copy `lang/en.lang` to `lang/xx.lang` (ISO 639-1 code)
2. Translate all values — keys must stay identical
3. Test by writing `xx` to `~/.gdprscanner/lang` and restarting
3. Test by setting `~/.m365_scanner_lang` to `xx` and restarting
---
@ -88,12 +88,10 @@ tenant is available via the [Microsoft 365 Developer Program](https://developer.
1. Fork the repository and create a branch: `git checkout -b feature/my-feature`
2. Make your changes and test them
3. Run the test suite: `python -m pytest tests/ -q`
4. Run a syntax check on the modules you touched, e.g.:
`python -m py_compile gdpr_scanner.py scan_engine.py app_config.py gdpr_db.py`
5. Update `README.md` if your change adds or changes user-visible behaviour
6. Open a pull request with a clear description of what it does and why
7. Link to the relevant issue if applicable
3. Run a syntax check: `python -m py_compile gdpr_scanner.py m365_connector.py gdpr_db.py`
4. Update `README.md` if your change adds or changes user-visible behaviour
5. Open a pull request with a clear description of what it does and why
6. Link to the relevant issue or SUGGESTIONS.md item if applicable
We aim to review pull requests within one week.

View File

@ -7,7 +7,7 @@ All Python modules used in the GDPR Scanner project, with a short explanation of
### Web server
| Module | Purpose |
|---|---|
| `flask` | Web server and API routing for the GDPRScanner UI |
| `flask` | Web server and API routing for both the GDPRScanner UI |
### Microsoft 365 authentication and API
| Module | Purpose |
@ -15,64 +15,39 @@ All Python modules used in the GDPR Scanner project, with a short explanation of
| `msal` | Microsoft Authentication Library — handles OAuth2 device code flow (delegated) and client credentials (application) for Microsoft Graph API access |
| `requests` | HTTP client used for all Microsoft Graph API calls |
### Google Workspace scanning
| Module | Purpose |
|---|---|
| `google-auth` | Service account authentication and domain-wide delegation for Google APIs |
| `google-auth-httplib2` | HTTP transport adapter for `google-auth` |
| `google-api-python-client` | Gmail API, Google Drive API, and Admin Directory API client |
### SMB / file system scanning
| Module | Purpose |
|---|---|
| `smbprotocol` | SMB2/3 network share scanning without requiring a mounted drive — used for Windows file server sources |
| `keyring` | OS keychain credential storage for SMB passwords |
| `python-dotenv` | `.env` file fallback for headless SMB credentials when no keychain is available |
### PDF handling
| Module | Purpose |
|---|---|
| `pdfplumber` | Text extraction from PDFs with a selectable text layer — fast and accurate for native PDFs |
| `pdf2image` | Converts PDF pages to images (via Poppler) for OCR processing of scanned/image-based PDFs |
| `pytesseract` | Python wrapper for the Tesseract OCR engine — extracts text from rasterised PDF pages and images |
| `pypdf` | PDF metadata reading and low-level page manipulation |
| `reportlab` | Fallback PDF redaction via overlay rendering — used when PyMuPDF is unavailable |
| `pymupdf` (fitz) | Physically removes the text layer from PDFs — preferred GDPR-compliant redaction method |
| `pdf2image` *(optional)* | Converts PDF pages to images (via Poppler) for OCR processing of scanned/image-based PDFs |
| `pytesseract` *(optional)* | Python wrapper for the Tesseract OCR engine — extracts text from rasterised PDF pages and images |
| `pypdf` *(optional)* | PDF metadata reading and low-level page manipulation — used in the `document_scanner.py` redaction path |
| `reportlab` *(optional)* | Fallback PDF redaction via overlay rendering — used when PyMuPDF is unavailable |
> Optional packages are not in `requirements.txt`. Install them manually if you need OCR or the standalone `document_scanner.py` CLI.
### Document formats
| Module | Purpose |
|---|---|
| `python-docx` | Read and write `.docx` Word documents; also used to generate the Article 30 Register of Processing Activities report |
| `openpyxl` | Read and write `.xlsx` Excel files — used for the scan result export workbook |
| `img2pdf` | Converts images to PDF for archiving redacted output |
### Image processing and face detection
| Module | Purpose |
|---|---|
| `opencv-python` (cv2) | Face detection in images via Haar cascade classifiers; also used for face blurring during anonymisation |
| `numpy` | Array operations required internally by OpenCV |
| `Pillow` (PIL) | Image manipulation — thumbnail generation, format conversion, EXIF metadata extraction |
| `Pillow` (PIL) | Image manipulation — thumbnail generation, format conversion, image resizing |
### NLP / Named Entity Recognition
| Module | Purpose |
|---|---|
| `spacy` | NLP engine for Danish Named Entity Recognition — detects person names, addresses, and organisations in text. Requires the `da_core_news_lg` model (~500 MB) |
### Encryption
### Archive scanning
| Module | Purpose |
|---|---|
| `cryptography` | Fernet symmetric encryption — encrypts SMTP passwords at rest in `~/.gdprscanner/smtp.json`; the Fernet key is derived from `~/.gdprscanner/machine_id` |
### Scheduling
| Module | Purpose |
|---|---|
| `APScheduler` | In-process background scheduler — drives the scheduled scan feature (`schedule.json`). Uses `BackgroundScheduler` with `CronTrigger` |
### System monitoring
| Module | Purpose |
|---|---|
| `psutil` | Available-memory probe in `scan_engine.py` — skips file downloads when free RAM drops below 300 MB to prevent OOM crashes on large tenants |
| `py7zr` | 7-Zip archive support — allows the scanner to inspect `.7z` compressed files |
### Desktop app packaging
| Module | Purpose |
@ -89,17 +64,16 @@ All Python modules used in the GDPR Scanner project, with a short explanation of
### Data storage
| Module | Purpose |
|---|---|
| `sqlite3` | SQLite database — stores scan results, CPR index (hashed), dispositions, deletion audit log, and scan history in `~/.gdprscanner/scanner.db` |
| `sqlite3` | SQLite database — stores scan results, CPR index (hashed), dispositions, deletion audit log, and scan history in `~/.gdpr_scanner.db` |
| `json` | Config files, checkpoint files, language files, API request/response serialisation |
| `zipfile` | Database export/import archive creation and reading; also used in the PyInstaller build process |
| `csv` | CSV file scanning support |
| `csv` | CSV file scanning support in the Document Scanner |
### Security and hashing
| Module | Purpose |
|---|---|
| `hashlib` | SHA-256 hashing of CPR numbers before storage — raw CPR values are never written to the database |
| `secrets` | Cryptographically secure random values — used for viewer token generation and auth state parameters |
| `uuid` | UUID generation for viewer tokens and scan session identifiers |
| `secrets` | Cryptographically secure random values (used in auth state parameters) |
### File system and paths
| Module | Purpose |
@ -111,9 +85,8 @@ All Python modules used in the GDPR Scanner project, with a short explanation of
### Networking and email
| Module | Purpose |
|---|---|
| `smtplib` | SMTP email delivery for the scheduled report feature — supports STARTTLS and SMTPS/SSL |
| `smtplib` | SMTP email delivery for the headless report feature — supports STARTTLS and SMTPS/SSL |
| `email` | Email message construction (MIME) for the SMTP report feature |
| `socket` | UDP probe to determine the machine's LAN IP address — used to build routable share links for viewer tokens |
### Text and pattern matching
| Module | Purpose |
@ -126,13 +99,12 @@ All Python modules used in the GDPR Scanner project, with a short explanation of
| `threading` | Background scan thread so the Flask web UI stays responsive during long scans |
| `queue` | Server-Sent Events message queue — passes scan results from the background thread to the browser |
| `concurrent.futures` | `ProcessPoolExecutor` for parallel OCR processing of multi-page PDFs |
| `gc` | Explicit garbage collection after large scan batches to release memory promptly |
### I/O and streams
| Module | Purpose |
|---|---|
| `io` | In-memory byte streams for generating Excel and Word documents without writing to disk |
| `struct` | Binary data unpacking used in some PDF processing paths |
| `struct` | Binary data unpacking (used in some PDF processing paths) |
### Date and time
| Module | Purpose |
@ -145,15 +117,15 @@ All Python modules used in the GDPR Scanner project, with a short explanation of
|---|---|
| `platform` | Detects the operating system for macOS/Windows-specific code paths |
| `subprocess` | Launches Tesseract and Poppler as external processes for OCR and PDF rendering |
| `argparse` | CLI argument parsing for `--headless`, `--reset-db`, `--export-db`, `--import-db`, etc. |
| `sys` | Python runtime access — `sys.exit()`, `sys.path`, `sys.version` |
| `argparse` | CLI argument parsing for `--headless`, `--reset-db`, `--export-db`, `--import-db` etc. |
| `sys` | Python runtime access — sys.exit(), sys.path, sys.version |
| `os` | Environment variables and low-level file operations |
| `logging` | Application-level logging — routes warnings and errors to stderr and rotating file handlers |
### Encoding and serialisation
| Module | Purpose |
|---|---|
| `base64` | Encodes thumbnail images as base64 strings for embedding in JSON API responses |
| `struct` | Binary format parsing used in some document processing paths |
---

View File

@ -1,10 +0,0 @@
{
"folders": [
{
"path": "."
},
{
"path": "."
}
]
}

View File

@ -102,7 +102,7 @@ tests/ pytest test suite — 112 tests, all should pass.
**Settings stats show 0 (Scanned / Flagged / Scans)**
`routes/database.py``db_stats()` — queries `flagged_items` and `scans` directly
→ Stats populate from existing DB on app start — no re-scan needed
→ If still 0 after a completed scan: check `~/.gdprscanner/scanner.db` exists and is not empty
→ If still 0 after a completed scan: check `~/.gdpr_scanner.db` exists and is not empty
**File scan results not persisting to DB**
`scan_engine.py``run_file_scan()` — must call `_db.begin_scan()` not `start_scan()`

View File

@ -1,67 +0,0 @@
# Open Source Landscape — GDPR / PII Document Scanners
An overview of existing open source tools in the same space as GDPRScanner, and where the gaps are.
---
## Summary
No open source project covers the same combination of M365 + Google Workspace connectors, Danish CPR detection, and GDPR Article 30 reporting in a single web UI. The closest commercial equivalent is [PII Tools](https://pii-tools.com) (closed source, SaaS).
---
## Existing open source tools
### [Microsoft Presidio](https://github.com/microsoft/presidio)
A well-maintained PII detection *library* (not an application) from Microsoft. Supports custom recognisers — a CPR pattern could be added. Covers text, images, and structured data via NLP + regex pipelines. No M365/GWS connectors, no UI, no reports, no scheduling. You would have to build the entire scanning application around it. ~9k GitHub stars.
### [Octopii](https://github.com/redhuntlabs/Octopii)
Local filesystem / S3 / Apache open-directory scanner using OCR + NLP + regex. Detects passports, government IDs, emails, and addresses in image and document files. No cloud connectors, no CPR awareness, no web UI.
### [pdscan](https://github.com/ankane/pdscan) / [piicatcher](https://github.com/tokern/piicatcher)
CLI tools that scan *databases* and data warehouses for PII columns using column-name heuristics and NLP sampling. No file storage scanning, no email, no cloud connectors.
### "GDPR scanners" on GitHub
Projects such as [baudev/gdpr-checker-backend](https://github.com/baudev/gdpr-checker-backend), [dev4privacy/gdpr-analyzer](https://github.com/dev4privacy/gdpr-analyzer), [mammuth/gdpr-scanner](https://github.com/mammuth/gdpr-scanner), and [City-of-Helsinki/GDPR-compliance-scanner](https://github.com/City-of-Helsinki/GDPR-compliance-scanner) are all **website and cookie compliance** scanners. They check whether a domain sets tracking cookies without consent — a completely different problem.
### CPR libraries
Several small libraries exist for validating or generating Danish CPR numbers ([mathiasvr/danish-ssn](https://github.com/mathiasvr/danish-ssn), [anhoej/cprr](https://github.com/anhoej/cprr), [ekstroem/DKcpr](https://github.com/ekstroem/DKcpr)). None of them are document or cloud-storage scanners.
---
## Commercial products that do cover it
| Product | M365 | GWS | CPR | Article 30 | Open source |
|---|---|---|---|---|---|
| [PII Tools](https://pii-tools.com) | ✅ | ✅ | ❌ | ❌ | ❌ |
| BigID | ✅ | ✅ | ❌ | ❌ | ❌ |
| Varonis | ✅ | partial | ❌ | ❌ | ❌ |
| Spirion | ✅ | ❌ | ❌ | ❌ | ❌ |
PII Tools is the most direct commercial equivalent: Graph API + GWS service account connectors, document scanning, web UI. Closed source, SaaS pricing targeted at enterprise.
---
## Capability comparison
| Capability | GDPRScanner | Presidio | Octopii | Commercial |
|---|---|---|---|---|
| M365 (Exchange / OneDrive / SharePoint / Teams) | ✅ | ❌ | ❌ | ✅ |
| Google Workspace (Gmail / Drive) | ✅ | ❌ | ❌ | ✅ |
| Local / SMB / SFTP | ✅ | ❌ | partial | ✅ |
| Danish CPR with modulus-11 validation | ✅ | plugin only | ❌ | ❌ |
| Email address + phone number detection | ✅ | ✅ | ✅ | ✅ |
| GDPR Article 30 report generation | ✅ | ❌ | ❌ | partial |
| Disposition tagging + bulk deletion | ✅ | ❌ | ❌ | partial |
| Scheduled scans | ✅ | ❌ | ❌ | ✅ |
| Checkpoint / resume | ✅ | ❌ | ❌ | unknown |
| Read-only viewer / share links | ✅ | ❌ | ❌ | partial |
| Web UI for non-technical staff | ✅ | ❌ | ❌ | ✅ |
| Danish-language UI | ✅ | ❌ | ❌ | ❌ |
| Open source | ✅ | ✅ | ✅ | ❌ |
---
## What makes GDPRScanner unique
The combination of Danish CPR specificity (modulus-11 validation, date sanity checks), M365 + Google Workspace connectors in a single tool, and GDPR Article 30 output is the gap no open source project fills. The Danish public-sector target audience (schools, municipalities) also drives requirements — role classification (student/staff), Danish-language UI, municipal data retention rules — that no general-purpose PII tool addresses.

196
README.md
View File

@ -1,13 +1,8 @@
# GDPRScanner
Scans Microsoft 365, Google Workspace, local/network file systems, and SFTP servers
for Danish CPR numbers and personal data (PII). Produces GDPR compliance reports and
supports Article 30 record-keeping obligations.
---
> **Work in progress — not ready for production use.**
> This project is under active development and has not been formally tested or audited for production deployment. It is shared publicly for transparency and collaboration. Use at your own risk.
Scans Microsoft 365, Google Workspace, and local/network file systems for Danish
CPR numbers and personal data (PII). Produces GDPR compliance reports and supports
Article 30 record-keeping obligations.
---
@ -32,7 +27,7 @@ an IDE with intelligent completion. The result is the author's work.
- **Folder path in results** — each email result shows its full folder path (e.g. `Inbox / Ansøgninger pædagog SFO`) in the card and in Excel export
- **Delete items** — flagged results can be deleted directly from the UI, individually or in bulk
- **CPR false-positive reduction** — strict CPR validation
- **Excel export** — multi-tab `.xlsx` report with per-source breakdown, auto-filters, and URL hyperlinks. Columns include: Name, CPR Hits, Face count, GPS (✔ if GPS in EXIF), Special category, EXIF author, Folder, Account, Role, Disposition, Date Modified, Size (KB), URL. A dedicated **GPS locations** sheet lists all items with GPS coordinates including a Google Maps link. Separate tabs for Outlook (Exchange), OneDrive, SharePoint, Teams, Gmail, Google Drive, local folders, SMB/network shares, and SFTP. Summary sheet shows counts by source and GPS item total. When M365, Google Workspace, and file scans run concurrently, all results are captured in the export — not just the last completed scan
- **Excel export** — multi-tab `.xlsx` report with per-source breakdown, auto-filters, and URL hyperlinks. Columns include: Name, CPR Hits, Face count, GPS (✔ if GPS in EXIF), Special category, EXIF author, Folder, Account, Role, Disposition, Date Modified, Size (KB), URL. A dedicated **GPS locations** sheet lists all items with GPS coordinates including a Google Maps link. Separate tabs for Outlook (Exchange), OneDrive, SharePoint, Teams, Gmail, Google Drive, local folders, and SMB/network shares. Summary sheet shows counts by source and GPS item total. When M365, Google Workspace, and file scans run concurrently, all results are captured in the export — not just the last completed scan
- **Progressive streaming** — results stream card-by-card via Server-Sent Events as the scan runs
- **Token auto-refresh** — expired tokens are detected and silently refreshed mid-scan without interrupting the UI
- **Incremental / resumable scans** — interrupted scans save a checkpoint; the next run resumes from where it stopped rather than starting over
@ -46,13 +41,10 @@ an IDE with intelligent completion. The result is the author's work.
- **Account name on cards** — when scanning multiple users, each card displays the owner's display name so results from different mailboxes are instantly distinguishable
- **Retention policy enforcement** — flag items older than a configurable retention period with a Overdue badge; supports both rolling and fiscal-year-aligned cutoffs (e.g. Bogføringsloven Dec 31); headless auto-delete via `--retention-years`
- **Data subject lookup** — find all flagged items containing a specific CPR number across all scans; CPR is SHA-256 hashed before querying — never stored in plaintext
- **CPR cross-referencing** — clicking any flagged card with CPR hits shows a "Related documents" section listing other items from the same scan session that share at least one CPR number, ordered by number of shared CPRs. Clicking any entry opens it in the preview panel. Works in live mode and history mode. Powered by a SQL self-join on the `cpr_index` table — no new data collection required
- **Disposition tagging** — compliance officers can tag each flagged item with a legal basis (retain / delete-scheduled / deleted) directly from the preview panel; **bulk disposition tagging** lets you select multiple cards with checkboxes and apply a disposition to all of them at once. A stats bar above the grid shows total · unreviewed · retain · delete counts and the percentage reviewed
- **Interface PIN** — optional session-level PIN that gates the main scanner interface (`/`). Set a 48 digit PIN in **Settings → Security → Interface PIN**; unauthenticated visitors are redirected to `/login`. The `/view` viewer route and all viewer API endpoints are exempt — reviewers are unaffected. Salted SHA-256 hash; brute-force protection (5 attempts / 5 min per IP)
- **Read-only viewer mode** — share scan results with a DPO or manager via a secure token URL (`/view?token=…`) or a numeric PIN; viewers see the full results grid and disposition panel but cannot scan, delete, or change settings. Tokens can be **role-scoped** (Ansatte / Elever) so a recipient only sees items for their group, or **user-scoped** so an individual employee only sees their own flagged files (supports dual M365 + Google Workspace identity)
- **Disposition tagging** — compliance officers can tag each flagged item with a legal basis (retain / delete-scheduled / deleted) directly from the preview panel
- **Read-only viewer mode** — share scan results with a DPO or manager via a secure token URL (`/view?token=…`) or a numeric PIN; viewers see the full results grid and disposition panel but cannot scan, delete, or change settings
- **Article 30 report** — one-click export of a structured Word document (`.docx`) satisfying the GDPR Article 30 register of processing activities obligation
- **SQLite results database** — scan results, CPR index, PII breakdown, disposition decisions, and scan history are persisted to `~/.gdprscanner/scanner.db` alongside the JSON cache, enabling cross-scan queries and trend tracking
- **Software updates from the UI** — check for and install new versions from **Settings → General → Software update**, or enable automatic daily updates; the app restarts itself in place (see [Software updates](#software-updates) below)
- **Built-in user manual** — click the **?** button in the top bar to open the manual in a dedicated window. Available in Danish and English. Printable via the browser's print function. Served from `MANUAL-DA.md` / `MANUAL-EN.md` at `/manual?lang=da|en` — always in sync with the installed version, no internet required. In the packaged desktop app the manual opens as a native pywebview window; in the browser it opens as a popup.
---
@ -81,7 +73,7 @@ The sidebar sources panel lists all configured scan sources. Click **Sources** t
**Google Workspace tab** — Two authentication modes: **Workspace** (service account with domain-wide delegation — scans all users) and **Personal account** (OAuth 2.0 device-code flow — scans the signed-in account only). Once connected, per-source toggles control whether Gmail and/or Google Drive appear in the sidebar panel and are included in scans. See [GOOGLE_SETUP.md](docs/setup/GOOGLE_SETUP.md) for setup instructions.
**File sources tab** — Add local folder paths, SMB/CIFS network shares, or SFTP servers. A pill selector (Local / Network / SFTP) switches the form fields. SFTP sources require host, port, username, remote path, and auth type (password or private key). SSH private keys are uploaded via the UI, validated with paramiko, and stored in `~/.gdprscanner/sftp_keys/` with `600` permissions; passwords and passphrases are stored in the OS keychain. Each saved source appears as a checkbox in the sidebar panel. Use the **Edit** button on each row to update credentials or rename a source without deleting it.
**File sources tab** — Add local folder paths or SMB/CIFS network shares with a name, path, and optional SMB credentials. Each saved source appears as a checkbox in the sidebar panel (local, SMB/network). Use the **Edit** button on each row to update credentials or rename a source without deleting it.
**Skipped automatically:** `.recycle`, `.sync`, `.btsync`, `.trash`, `.git`, `node_modules`, `System Volume Information`, and other system/sync folders. Hidden directories (`.` prefix) are skipped too.
@ -131,10 +123,9 @@ A date-from picker limits the scan to items modified after the selected date. Qu
| Scan attachments | On | Scan PDF/Word/Excel attachments inside emails |
| Max attachment size | **20 MB** | Skip attachments larger than this threshold |
| Max emails per user | **2000** | Cap per mailbox to avoid very long scans |
| **Δ Delta scan** | Off | Fetch only changed items since the last scan (see [Delta scan](#delta-scan) below) |
| **Δ Delta scan** | Off | Fetch only changed items since the last scan — hover the **?** for details (see [Delta scan](#delta-scan) below) |
| **Scan photos for faces** | Off | Detect faces in image files and flag as Art. 9 biometric data — hover the **?** for details (see [Photo scanning](#photo--biometric-scanning) below) |
| **Ignore GPS in images** | Off | Skip images whose only PII signal is an embedded GPS coordinate. Useful for student scans where smartphones embed location in every camera photo. GPS is still shown in the detail card if the image is flagged for another reason (faces, EXIF author). |
| **Min. CPR count per file** | **1** | Only flag a file if it contains at least this many *distinct* CPR numbers. Set to 2 to suppress false positives in student scans (e.g. a student's own consent form with a single CPR) while still reporting class lists and grade sheets with multiple CPRs. |
| ** Scan photos for faces** | Off | Detect faces in image files and flag as Art. 9 biometric data — hover the **?** for details (see [Photo scanning](#photo--biometric-scanning) below) |
| **Retention policy** | Off | Flag items older than N years — hover the **?** for details (see [Retention policy](#retention-policy-enforcement)) |
#### Results grid
@ -153,32 +144,14 @@ Each flagged item appears as a card showing:
- **Ext.** / **** badge — external email recipient or externally shared file (Art. 4446 transfer risk)
- **delete button** — appears on hover (grid view) or always visible (list view)
**Disposition stats bar** — always visible above the results grid when items are loaded. Shows: Total · Unreviewed · Retain · Delete · percentage reviewed. Updates live after every disposition save.
**Select mode** — click **Vælg** in the filter bar to enter bulk-selection mode. Per-card checkboxes appear; a bulk tag bar at the bottom of the grid shows the count of selected items, a **Select all visible** button, a disposition dropdown, and an **Apply** button. Click **Done** to exit select mode.
**Filter bar** — always visible above both the results grid and the preview panel. Narrow results by source, disposition, transfer risk, risk level, and role:
**Filter bar** — always visible above both the results grid and the preview panel. Narrow results by source, disposition, transfer risk, and risk level:
| Filter | Options |
|---|---|
| Source | All / Email / OneDrive / SharePoint / Teams |
| Disposition | All / Unreviewed / Retain (legal/legitimate/contract) / Delete-scheduled / Deleted |
| Transfer risk | All / External recipient / External share / Shared |
| Risk level | All risk levels / Art. 9 special category / Photos / biometric |
| **Role** | **All roles / Ansatte (staff) / Elever (students)** |
The Role filter also scopes exports — selecting **Elever** before clicking **Excel** or **Art.30** produces a report containing only student items. The exported filename gets an `_elever` or `_ansatte` suffix so recipients can distinguish the files.
#### Scan history browser
Review results from any past scan session without running a new scan. A **Sessions** button appears in the banner above the results grid once a scan has completed.
- Click **Sessions** to open the session picker — lists all past scans with date, sources, and item count. Each entry shows a **Δ** badge for delta scans and a **Latest** badge for the most recent session.
- Click any session row to load its results into the grid. A history banner replaces the progress bar, showing the session date, sources scanned, and item count.
- **Latest scan** button in the banner jumps back to the most recent session.
- Starting a new scan automatically exits history mode and switches to live SSE results.
- All filters, dispositions, and exports work normally while browsing history — the Role filter and viewer-scope enforcement still apply.
- Viewer tokens work with history mode: `GET /api/db/flagged?ref=N` applies scope filtering the same way as the live endpoint.
| Risk level | All risk levels / Art. 9 special category / Photos / biometric |
#### Delete items
@ -209,11 +182,6 @@ The **⬇ Excel** button exports all current results to a `.xlsx` file (`m365_sc
| OneDrive | Flagged OneDrive files |
| SharePoint | Flagged SharePoint files |
| Teams | Flagged Teams files |
| Gmail | Flagged Gmail messages |
| Google Drive | Flagged Google Drive files |
| Local | Flagged local-folder files |
| Network | Flagged SMB/NAS files |
| SFTP | Flagged SFTP server files |
In macOS app builds, the export opens a native Save dialog instead of a browser download.
@ -228,7 +196,7 @@ Configure email delivery in **Settings → Email report**. Click **Save** to sto
| SMTP host | e.g. `smtp.office365.com`, `smtp.gmail.com` |
| Port | `587` for STARTTLS (default), `465` for SMTPS/SSL |
| Username | SMTP login — usually your sender email address |
| Password | Saved to `~/.gdprscanner/smtp.json` (permissions 600). Encrypted at rest using Fernet — key in `~/.gdprscanner/machine_id` (chmod 0o600, never share) |
| Password | Saved to `~/.gdpr_scanner_smtp.json` (permissions 600). Encrypted at rest using Fernet — key in `~/.gdpr_scanner_machine_id` (chmod 0o600, never share) |
| Graph API | When connected to M365, email is sent via `/me/sendMail` (delegated) or `/users/{sender}/sendMail` (app mode) — no SMTP password needed. Requires `Mail.Send` Graph permission with admin consent. |
| From address | Sender address (defaults to username if blank) |
| STARTTLS | Enable STARTTLS on port 587 (recommended) |
@ -268,13 +236,13 @@ The checkpoint is keyed by a hash of the scan configuration (sources + users + d
### Delta scan
Delta scan uses the Microsoft Graph `/delta` API (M365) and the Google Drive **Changes API** (Google Workspace) to fetch only items that have **changed since the last scan**, dramatically reducing API quota usage and scan time on large tenants.
Delta scan uses the Microsoft Graph `/delta` API to fetch only items that have **changed since the last scan**, dramatically reducing Graph API quota usage and scan time on large tenants.
#### How it works
1. Run one **full scan** first (Delta checkbox off) — this establishes baseline delta tokens
2. Tick **Δ Delta scan** and run again — only items added, modified, or deleted since the previous scan are fetched and CPR-scanned
3. Delta tokens are saved automatically to `~/.gdprscanner/delta.json` after each successful scan
3. Delta tokens are saved automatically to `~/.gdpr_scanner_delta.json` after each successful scan
4. To force a full rescan, click **Clear tokens** under the checkbox (or delete the file)
Delta tokens are stored **per-source**:
@ -285,12 +253,9 @@ Delta tokens are stored **per-source**:
| `sharepoint:{drive_id}` | One SharePoint document library |
| `teams:{drive_id}` | One Teams channel file store |
| `email:{user_id}:{folder_id}` | One mail folder for one user |
| `gdrive:{email}` | One Google Workspace user's Google Drive |
If a token expires (Graph returns HTTP 410 Gone), that source falls back to a full collection automatically and a fresh token is saved. Other sources are unaffected.
If a user's OneDrive returns HTTP 404 during a delta scan (no licence assigned, service plan disabled, or drive never provisioned because the account has never signed in), the user is silently skipped with a grey log entry — no red error card is shown. Full scans already skipped these users silently; delta scans now behave the same way.
Deleted items returned by delta (items with a `deleted` or `@removed` marker) are skipped during CPR scanning.
After each delta scan, the log panel shows:
@ -340,7 +305,7 @@ Scan results are persisted to `~/.gdprscanner/scanner.db` (SQLite) automatically
| `dispositions` | Compliance officer decisions per item |
| `scan_history` | Aggregated stats per scan for trend tracking |
**API endpoints:** `GET /api/db/stats`, `GET /api/db/trend`, `GET /api/db/scans`, `POST /api/db/subject`, `GET /api/db/overdue`, `POST /api/db/disposition`, `GET /api/db/disposition/<id>`, `GET /api/db/sessions`, `GET /api/db/flagged`
**API endpoints:** `GET /api/db/stats`, `GET /api/db/trend`, `GET /api/db/scans`, `POST /api/db/subject`, `GET /api/db/overdue`, `POST /api/db/disposition`, `GET /api/db/disposition/<id>`
If `gdpr_db.py` is not present, the scanner falls back to JSON-only mode silently.
@ -374,12 +339,6 @@ Every flagged item can be tagged with a compliance decision from the preview pan
Dispositions are saved to the `dispositions` table in the SQLite database and included in the Article 30 report.
#### Bulk disposition tagging
Click **Vælg** in the filter bar to enter select mode. Per-card checkboxes appear. Select individual cards or use **Select all visible** to select every card matching the current filters. Choose a disposition from the bulk tag bar at the bottom of the grid and click **Apply** — the selected items are updated in a single request to `POST /api/db/disposition/bulk`. Click **Done** to exit select mode.
A **disposition stats bar** above the results grid shows totals at a glance and updates after every save.
---
### Retention policy enforcement
@ -499,49 +458,6 @@ python gdpr_scanner.py --import-db ~/compliance/gdpr_export_2026.zip --import-mo
---
### Software updates
When the app runs from a git checkout (the normal server install), it can update itself. The **Settings → General → Software update** group offers:
- **Check for updates** — fetches the upstream repository and shows either "You are running the latest version" or the list of pending commits
- **Install update** — fast-forwards the checkout, reinstalls dependencies if `requirements.txt` changed, and restarts the app in place; the browser waits for the server to come back and reloads automatically
- **Install updates automatically** — optional toggle; a background thread checks once a day and installs unattended
Safety guarantees:
- Updating is **refused while any scan is running** — manual attempts get a clear message, and the auto-updater simply retries on its next hourly tick, so a scheduled scan is never killed mid-run
- Local edits on the server are **auto-stashed** (kept, never discarded) before the merge; the merge is fast-forward-only, so a diverged checkout stops the update instead of creating a merge mess
- Every applied update is recorded in the **compliance audit log** (`app_update`, old → new commit)
- The restart re-execs the process with the same PID, so it works identically under systemd and when launched via `start_gdpr.sh`
The Settings group is hidden in the packaged desktop app (no git checkout to update) — desktop users update by installing a new build.
**CLI / cron equivalent** — `update_gdpr.sh` performs the same update from a shell:
```bash
./update_gdpr.sh # update if upstream has new commits, restart service
./update_gdpr.sh --check # report pending commits, change nothing
```
It restarts a `gdprscanner.service` systemd unit if one exists (override the name with `GDPR_SERVICE=…`) and is quiet when already up to date, so it is safe to run from cron:
```bash
# /etc/cron.d/gdprscanner-update — nightly at 04:00
0 4 * * * root /opt/gdprscanner/update_gdpr.sh >> /var/log/gdpr_update.log 2>&1
```
API endpoints: `GET /api/update/check`, `POST /api/update/apply`, `GET/POST /api/update/settings`.
---
### HTTPS / reverse proxy
The scanner itself serves plain HTTP. For encrypted transport on a LAN — recommended, since scan results contain CPR numbers — put it behind a TLS-terminating reverse proxy and bind the app to loopback (`--host 127.0.0.1`) so the proxy is the only way in. Share links automatically follow the HTTPS hostname, and the browser Clipboard API (Copy buttons) works natively in a secure context.
See [ZORAXY_SETUP.md](docs/setup/ZORAXY_SETUP.md) for a complete walkthrough: Zoraxy, Let's Encrypt via DNS-01 challenge (required when the hostname resolves to a private IP), proxy rule, and the scanner-specific verification steps.
---
### Article 30 report
The **Art.30** button in the filter bar generates a GDPR **Article 30 Register of Processing Activities** as a Word document (`.docx`).
@ -566,32 +482,16 @@ The document is dated and can be stored as evidence of ongoing compliance activi
---
### Building the desktop app
### Building the M365 app
`build_gdpr.py` packages `gdpr_scanner.py` + `m365_connector.py` + `lang/` into a standalone native app using PyInstaller + pywebview.
`build_gdpr.py` packages `gdpr_scanner.py` + `m365_connector.py` + `lang/` into a standalone native app — same PyInstaller / pywebview approach as `build.py`.
```bash
python build_gdpr.py # build for the current platform
python build_gdpr.py --icons-only # regenerate icon_gdpr.icns / icon_gdpr.ico
python build_gdpr.py --icons-only # regenerate icon_m365.icns / icon_m365.ico
```
| Platform | Output | Native window |
|---|---|---|
| macOS | `dist/GDPRScanner.app` | WKWebView |
| Windows | `dist/GDPRScanner/GDPRScanner.exe` | WebView2 (Edge) |
| Linux | `dist/GDPRScanner/GDPRScanner` | GTK WebKit |
> **Cross-compilation is not supported** — build on the target platform, or use the pre-built binaries from the [GitHub Releases](../../releases) page.
**GitHub Actions** builds all three platforms automatically on every push to `main` and on `v*` tags. Pre-built zips are attached to each release:
| File | Platform |
|---|---|
| `GDPRScanner_windows_x64.zip` | Windows 10/11 x64 |
| `GDPRScanner_linux_x86_64.zip` | Ubuntu 22.04+ / Debian |
| `GDPRScanner_macos_x86_64.zip` | macOS 12+ Intel / Apple Silicon (Rosetta) |
> **macOS Gatekeeper:** the app is unsigned. On first launch right-click → **Open** to bypass the security warning.
> **Note:** Same cross-compilation restriction applies — must build on the target platform.
---
@ -644,58 +544,26 @@ python gdpr_scanner.py # GDPRScanner on port 5100 (auto-increments if in use)
### Test suite
GDPRScanner ships with a `pytest` test suite covering the CPR detection engine, configuration layer, checkpoint persistence, the SQLite database, and security-sensitive Flask routes.
GDPRScanner ships with a `pytest` test suite covering the CPR detection engine, configuration layer, checkpoint persistence, and the SQLite database.
```bash
pip install pytest
pytest tests/
```
**212 tests across 8 modules — all expected to pass.**
**112 tests across 4 modules — all expected to pass.**
| Module | Tests | Covers |
|---|---|---|
| `tests/test_document_scanner.py` | 37 | `is_valid_cpr`, `extract_matches`, `scan_docx`, `scan_xlsx`, `_scan_bytes` — CPR detection, false-positive suppression, binary crash safety |
| `tests/test_document_scanner.py` | 36 | `is_valid_cpr`, `extract_matches`, `scan_docx`, `scan_xlsx`, `_scan_bytes` — CPR detection, false-positive suppression, binary crash safety |
| `tests/test_app_config.py` | 34 | i18n loading, Article 9 keyword detection, config round-trip, admin PIN, profiles CRUD, Fernet encryption |
| `tests/test_checkpoint.py` | 18 | Checkpoint key stability, save/load/clear, wrong-key isolation, delta token round-trip |
| `tests/test_db.py` | 23 | Scan lifecycle, CPR hash-only storage, data subject lookup, dispositions, export/import cycle |
| `tests/test_routes.py` | 16 | Core route behaviour — scan status/start/stop, DB stats, dispositions, Excel and Article 30 export |
| `tests/test_route_integration.py` | 54 | Viewer token CRUD, role/user scope enforcement, bulk disposition isolation, viewer PIN, interface PIN gate, scan lock release on failure, session history ordering, profile routes CRUD and rename |
| `tests/test_google_scan.py` | 19 | Google scan routes (users/start/cancel) and `_run_google_scan` engine with mocked connector, checkpoints, and DB |
| `tests/test_updates.py` | 11 | Software-update routes — check/apply with mocked git, scan-running refusal, dirty-tree auto-stash, requirements reinstall, settings round-trip |
| `tests/test_db.py` | 24 | Scan lifecycle, CPR hash-only storage, data subject lookup, dispositions, export/import cycle |
Each unit-test module (`cpr_detector.py`, `app_config.py`, `checkpoint.py`, `gdpr_db.py`) is importable in isolation without Flask or MSAL — tests run without any cloud credentials or a running server.
Each new module (`cpr_detector.py`, `app_config.py`, `checkpoint.py`, `gdpr_db.py`) is importable in isolation without Flask or MSAL — tests run without any cloud credentials or a running server.
The test suite should be run before every release and after any change to `document_scanner.py`, `cpr_detector.py`, or `gdpr_db.py`. CPR detection is the legal core of the tool — a false negative means a real GDPR violation goes undetected.
#### Local-file scan fixtures
`tests/fixtures/local_files/` provides 19 files for end-to-end testing of the file scanner via the UI or `file_scanner.py`. Drop the folder as a local source and run a scan — all 14 PII-bearing files should be flagged and all 5 negative-case files should produce zero hits.
| File | Format | Expected | Scenario |
|---|---|---|---|
| `01_cpr_with_context_label.txt` | TXT | Flag | CPR with explicit `CPR-nummer:` label |
| `02_cpr_mod11_valid_bare.txt` | TXT | Flag | mod-11valid CPR without any context keyword |
| `03_cpr_post2007_with_context.txt` | TXT | Flag | Post-2007 birth (fails mod-11), detected via `Personnummer:` keyword |
| `04_multiple_cprs.txt` | TXT | Flag | 3 distinct CPR numbers in one staff-records file |
| `05_student_register.csv` | CSV | Flag | 8 students incl. one protected-address (day+40) CPR |
| `06_employee_list.csv` | CSV | Flag | 5 employees with CPRs |
| `07_protected_number.txt` | TXT | Flag | Protected CPR (`410172-1200`, day+40 encoding) |
| `08_mixed_pii.txt` | TXT | Flag | CPR + email + phone + GDPR Art. 9 health category |
| `09_cpr_in_docx.docx` | DOCX | Flag | 2 CPRs in a Word document (paragraph format) |
| `10_clean_no_pii.txt` | TXT | **No flag** | Meeting minutes — no personal data |
| `11_false_positive_invoice.txt` | TXT | **No flag** | Invoice: CPR-shaped numbers suppressed by `faktura`/`varenr` context |
| `12_post2007_no_context.txt` | TXT | **No flag** | Equipment serial that looks like a post-2007 CPR but has no context keyword |
| `13_cpr_in_xlsx.xlsx` | XLSX | Flag | Excel workbook with two sheets: students + employees |
| `14_audio_artist_pii.mp3` | MP3 | Flag | ID3 artist/title tags with a personal name → `exif_pii` |
| `15_audio_artist_pii.flac` | FLAC | Flag | Vorbis comment artist/title tags with a personal name → `exif_pii` |
| `16_audio_no_pii.mp3` | MP3 | **No flag** | Empty ID3 header — no metadata tags |
| `17_audio_no_pii.flac` | FLAC | **No flag** | FLAC with no Vorbis comment block |
| `18_video_gps.mp4` | MP4 | Flag | QuickTime GPS coordinates (Copenhagen) + artist tag → `gps_location` + `exif_pii` |
| `19_video_no_pii.mp4` | MP4 | **No flag** | Minimal MP4 container with no metadata |
All CPR numbers are mathematically valid (verified against `is_valid_cpr`). Run `generate_fixtures.py` inside the venv to regenerate all binary files after any changes. Requires `python-docx`, `openpyxl`, and `mutagen` (all included in `requirements.txt`).
### Roadmap
See [SUGGESTIONS.md](SUGGESTIONS.md) for the full feature roadmap with implementation status.
@ -707,22 +575,21 @@ See [SUGGESTIONS.md](SUGGESTIONS.md) for the full feature roadmap with implement
| File | Description |
|---|---|
| `gdpr_scanner.py` | Flask entry point — scan orchestration, SSE route (`/api/scan/stream`), root route |
| `scan_engine.py` | M365 and local/SMB/SFTP scan logic — `run_scan()`, `run_file_scan()` |
| `scan_engine.py` | M365 and local/SMB scan logic — `run_scan()`, `run_file_scan()` |
| `app_config.py` | All persistence — profiles, settings, SMTP config, lang loading, Fernet encryption |
| `sse.py` | SSE broadcast queue and `_current_scan_id` |
| `checkpoint.py` | Mid-scan checkpoint save/load, `_checkpoint_key()` |
| `cpr_detector.py` | CPR pattern matching and validation. Defines `SUPPORTED_EXTS` — the single source of truth for which file extensions are scanned across all sources (M365, Google Drive, local/SMB). Also contains `VIDEO_EXTS` and `AUDIO_EXTS` subsets and the metadata extractors `_extract_video_metadata` / `_extract_audio_metadata`. |
| `cpr_detector.py` | CPR pattern matching and validation |
| `document_scanner.py` | Core scanning, redaction, OCR, NER, and PII detection engine |
| `gdpr_db.py` | SQLite persistence layer — scan results, CPR index, PII hits, dispositions, scan history |
| `m365_connector.py` | Microsoft Graph API client — auth, token refresh, email/OneDrive/SharePoint/Teams fetchers, delete methods |
| `google_connector.py` | Google Workspace API client — Gmail, Drive, Admin SDK |
| `file_scanner.py` | Unified local + SMB/CIFS file iterator — `FileScanner.iter_files()` yields `(path, bytes, metadata)`. SMB reads use a 1-slot sliding-window `ThreadPoolExecutor` (`PREFETCH_WINDOW=1`) with a 60-second per-file timeout. `DEFAULT_EXTENSIONS` is imported from `cpr_detector.SUPPORTED_EXTS` (not a local hardcoded set) so the scannable extension list stays in sync automatically. |
| `sftp_connector.py` | SFTP file iterator — `SFTPScanner.iter_files()` yields the same `(path, bytes, metadata)` tuple as `FileScanner`. Uses paramiko (`AutoAddPolicy`); supports password auth and private-key auth (RSA / Ed25519 / ECDSA / DSS). Passwords and key passphrases are stored in the OS keychain; key files live in `~/.gdprscanner/sftp_keys/`. Gracefully degrades when paramiko is not installed (`SFTP_OK` flag). |
| `file_scanner.py` | Unified local + SMB/CIFS file iterator — `FileScanner.iter_files()` yields `(path, bytes, metadata)`. SMB reads use a 1-slot sliding-window `ThreadPoolExecutor` (`PREFETCH_WINDOW=1`) with a 60-second per-file timeout. |
| `scan_scheduler.py` | In-process APScheduler wrapper — multi-job scheduled scan engine |
| `templates/index.html` | Single-page HTML shell — Jinja2 template. Two variables: `app_version`, `lang_json`. |
| `static/style.css` | All application CSS — custom properties, layout, components, light/dark themes |
| `static/js/state.js` | Shared mutable state module (`export const S`) — imported by all 12 feature modules |
| `static/js/*.js` | 12 ES modules: `ui`, `log`, `users`, `auth`, `profiles`, `scan`, `results`, `sources`, `scheduler`, `connector`, `viewer`, `history` |
| `static/js/state.js` | Shared mutable state module (`export const S`) — imported by all 11 feature modules |
| `static/js/*.js` | 11 ES modules: `ui`, `log`, `users`, `auth`, `profiles`, `scan`, `results`, `sources`, `scheduler`, `connector`, `viewer` |
| `static/app.js` | Archived JS monolith — no longer loaded |
| `routes/__init__.py` | Blueprint package marker |
| `routes/state.py` | Shared mutable state (`connector`, `flagged_items`, `LANG`, scan locks) — imported by all blueprints |
@ -737,15 +604,12 @@ See [SUGGESTIONS.md](SUGGESTIONS.md) for the full feature roadmap with implement
| `routes/email.py` | `/api/smtp/*` and `/api/send_report` |
| `routes/database.py` | `/api/db/*`, `/api/admin/*`, `/api/preview`, `/api/thumb` |
| `routes/export.py` | `/api/export_excel`, `/api/export_article30`, `/api/delete_bulk` |
| `routes/viewer.py` | `/view`, `/api/viewer/tokens`, `/api/viewer/pin` — read-only viewer mode: token + PIN auth, share-link management, role-scoped and user-scoped tokens |
| `routes/viewer.py` | `/view`, `/api/viewer/tokens`, `/api/viewer/pin` — read-only viewer mode: token + PIN auth, share-link management |
| `routes/app_routes.py` | `/api/about`, `/api/langs`, `/api/lang`, `/manual` |
| `routes/updates.py` | `/api/update/*` — software update check/apply, auto-update background thread |
| `update_gdpr.sh` | CLI/cron self-update script — fetch, fast-forward merge, dependency reinstall, service restart |
| `docs/manuals/MANUAL-EN.md` | End-user manual in English (15 sections) — served at `/manual?lang=en` |
| `docs/manuals/MANUAL-DA.md` | End-user manual in Danish (15 sections) — served at `/manual?lang=da` |
| `docs/setup/M365_SETUP.md` | Step-by-step Microsoft 365 setup guide |
| `docs/setup/GOOGLE_SETUP.md` | Step-by-step Google Workspace setup guide |
| `docs/setup/ZORAXY_SETUP.md` | HTTPS via Zoraxy reverse proxy — LAN-only deployment with Let's Encrypt DNS-01 |
| `build_gdpr.py` | PyInstaller build script — generates `m365_launcher.py`, packages desktop app |
| `lang/en.json` | English translations (source of truth) |
| `lang/da.json` | Danish translations (primary language) |

View File

@ -54,10 +54,10 @@ Out of scope:
## Data Handling Notes for Security Researchers
- CPR numbers are stored in the SQLite database as **SHA-256 hashes only** — never in plaintext
- SMTP passwords are stored in `~/.gdprscanner/smtp.json` with chmod 600
- Microsoft OAuth tokens are stored in the MSAL token cache in `~/.gdprscanner/token.json`
- Scan results are stored locally in `~/.gdprscanner/scanner.db` — never transmitted externally
- The web UI binds to `0.0.0.0` by default so reviewers on the LAN can reach it — it is not designed to be exposed to the internet. For encrypted transport, put it behind a TLS-terminating reverse proxy and bind the app to loopback with `--host 127.0.0.1` — see [docs/setup/ZORAXY_SETUP.md](docs/setup/ZORAXY_SETUP.md)
- SMTP passwords are stored in `~/.gdpr_scanner_smtp.json` with chmod 600
- Microsoft OAuth tokens are stored in the MSAL token cache in `~/.gdpr_scanner_config.json`
- Scan results are stored locally in `~/.gdpr_scanner.db` — never transmitted externally
- The web UI binds to `127.0.0.1` by default — it is not designed to be exposed to the internet
---

File diff suppressed because it is too large Load Diff

161
TODO.md
View File

@ -1,35 +1,11 @@
# TODO — Pending features and sustainability
Quick overview of what's still to be done.
Quick overview of what's still to be done. Full details in [SUGGESTIONS.md](SUGGESTIONS.md).
---
## Recently completed
### Bulk disposition tagging + disposition stats ✅
Select mode (filter bar "Vælg" button) reveals per-card checkboxes. Bulk tag bar appears at bottom of grid when items are selected; a single disposition dropdown + Apply sends `POST /api/db/disposition/bulk`. Stats bar shows total · unreviewed · retain · delete · % reviewed and updates after every save.
---
### Google Drive delta scan ✅
Drive scanning now uses the Google Drive Changes API when `delta` is enabled in scan options. First run records a start page token per user (`gdrive:{email}` in `delta.json`). Subsequent runs fetch only changed/new files. Invalid tokens fall back to a full scan automatically. Token save is load-then-merge to avoid overwriting concurrent M365 delta token writes.
---
### Auto-email after scheduled scan ✅ (already existed)
The scheduler already has an "Email report automatically" checkbox (`auto_email` flag in job config). `_send_email_report()` in `scan_scheduler.py` handles it after each scheduled scan completes — tries Microsoft Graph first, falls back to SMTP. Enable it in the scheduler settings panel.
---
### PDF OCR OOM kills on large documents ✅
`document_scanner` called `convert_from_path()` for the whole PDF before the processing loop, allocating all page images at once. A 50-page A4 at 300 DPI required ~1.3 GB in a single shot — enough to trigger the OS OOM killer.
Fixed in `scan_pdf`, `redact_fitz_pdf`, and `redact_pdf`:
- Replaced bulk pre-render with `convert_from_path(first_page=N, last_page=N)` inside the loop — one page in memory at a time
- Added `_ocr_mem_ok()` guard (checks `psutil.virtual_memory().available >= 500 MB`) before each render; pages that fail the check are skipped and recorded as `"skipped"` in `page_methods` with a printed warning
---
### Memory exhaustion during large M365 scans ✅
Six root causes fixed in `scan_engine.py` and `document_scanner.py`:
- Email body HTML stripped at collection time (`body` key deleted from each message dict before it enters `work_items`; plain text stored as `_precomputed_body` instead)
@ -65,141 +41,6 @@ Full spec in SUGGESTIONS.md §29.
A shareable URL (token-protected) or numeric PIN that gives a DPO, school principal, or compliance coordinator read-only access to the results grid — with disposition tagging but without scan controls, credentials, or delete access. Full spec in SUGGESTIONS.md §33.
**Size:** Medium · **Priority:** Medium
### OneDrive 404 errors — investigate and handle appropriately ✅
404 on `drive/root/delta` during delta scans was being broadcast as a red `scan_error`. Root cause: `_get()` hit `raise_for_status()` for 404s, which fell through to the generic `except Exception` handler in `_scan_user_onedrive`. The full-scan path silently swallowed the same 404 via `except Exception: return` in `_iter_drive_folder_for`.
Fixed by adding `M365DriveNotFound(M365Error)` exception, raising it from `_get()` on 404, and catching it explicitly in `_scan_user_onedrive` with a lower-severity `scan_phase` broadcast ("OneDrive (user): not provisioned — skipped") instead of a red error card.
---
### #34 — User-scoped viewer tokens ✅
Viewer token scope extended to `{"user": ["m365@…", "gws@…"], "display_name": "Alice Smith"}`, filtering `flagged_items` by `account_id IN (list)`. Lets a single employee see only their own flagged files across both M365 and Google Workspace.
**Implemented:**
1. Scope format — `user` is a list of email strings (one per platform); `display_name` stored for UI display. Legacy single-string format coerced to list automatically.
2. Token creation UI — scope-type selector (`All` / `Role` / `User`) reveals either the role select or a searchable name autocomplete. Autocomplete filters `S._allUsers` by display name or email; rows show name + both emails for dual-platform users. Selected user's full name fills the input; both emails stored in the scope.
3. `GET /api/db/flagged` — filters `WHERE account_id IN (scope.user set)`, covering items from both platforms.
4. Viewer header — `#viewerIdentityBadge` shows `scope.display_name` (full name); `#filterRole` hidden.
5. `POST /api/viewer/tokens` — validates all entries in `scope.user` contain `@`; rejects combined `role`+`user` scope.
6. Token list — shows display name badge; falls back to emails joined with `, `.
**Size:** Small · **Priority:** Medium
---
### Scan history browser ✅
Review results from any past scan session without running a new scan.
**Implemented:**
1. `gdpr_db.py``get_sessions(limit=50, window_seconds=300)`: groups `scans` rows into 300 s windows (same logic as `get_session_items`), returns newest-first list with `ref_scan_id` (highest scan_id in group), timestamps, sources set, flagged count, total scanned, and a delta flag.
2. `gdpr_db.py``get_session_items(ref_scan_id=N)`: when `ref_scan_id` given, anchors the 300 s window to that scan's `started_at` instead of the latest scan.
3. `GET /api/db/sessions` (new endpoint in `routes/database.py`) — returns the sessions list; viewer-mode sessions share the same `GET /api/db/flagged?ref=N` endpoint with scope enforcement intact.
4. `static/js/history.js` (new module) — `loadHistorySession(refScanId)`, `openHistoryPicker()`, `closeHistoryPicker()`, `exitHistoryMode()`, `invalidateHistoryCache()` all exposed on `window.*`. Session cache (`_sessions`) invalidated by all `*_done` SSE handlers so the picker stays fresh after a new scan.
5. History banner (`#historyBanner`) — shows session date/time, sources, item count; "Sessions" button opens picker dropdown; "Latest scan" button appears only when not already viewing the latest.
6. Auto-load on page load — `results.js` calls `window.loadHistorySession?.(null)` when the SSE watchdog detects `!status.running`; `null` resolves to the latest completed session.
7. Live→history transition: clicking a session in the picker sets `S._historyRefScanId` and shows the banner. History→live transition: `startScan()` calls `window.exitHistoryMode?.()`.
---
### Gmail SMTP error message when App Password already in use ✅
The `535` auth error from Gmail fires for wrong app password, revoked app password, spaces in the 16-char code, and wrong username — all indistinguishable at the SMTP level. The old message unconditionally told users to "create an App Password", which is unhelpful when they already have one. Both the `smtp_test` and `send_report` error handlers now emit a Gmail-specific message that lists the three common causes and links to the App Password page for regeneration.
---
### Interface PIN ✅
Optional session-level authentication gate for the main scanner interface. Set in **Settings → Security → Interface PIN**. When set, any request to the main UI or API redirects to `/login` until the correct PIN is entered. `/view` and all viewer auth routes are exempt. Salted SHA-256 hash stored in `config.json`. Rate-limited: 5 failures per IP per 5 minutes.
---
### OCR language override ✅
Tesseract language pack(s) used for scanned PDFs and images are now configurable per profile. Option `ocr_lang` (default `dan+eng`). Presets: `dan+eng`, `dan`, `eng`, `dan+eng+deu`, `dan+eng+swe`, `dan+eng+fra`. Threaded through `_scan_bytes`/`_scan_bytes_timeout``document_scanner.scan_pdf`/`scan_image` and the spawned PDF-OCR subprocess. OCR result cache keys include `lang` so per-language results are cached independently. Sidebar select `#optOcrLang`; profile editor `#peOptOcrLang`.
---
### CPR-only mode ✅
New scan option `cpr_only` (default `false`). When enabled, items whose only hits are email addresses, phone numbers, detected faces, or EXIF/GPS metadata are skipped — only items with at least one qualifying CPR number are flagged. Implemented as a compact short-circuit at each engine's flagging gate. Sidebar toggle `#optCprOnly`; profile editor `#peOptCprOnly`.
Also added `min_cpr_count` (default `1`) — minimum number of **distinct** CPR numbers required before a file is flagged. Files with faces or EXIF PII are still flagged regardless of this threshold.
---
### Skip GPS images ✅
Scan option `skip_gps_images` (default `false`). When enabled, images whose only PII is GPS coordinates are not flagged. GPS data is still stored in the card `exif` field if the item is flagged by another signal. Sidebar toggle `#optSkipGps`; profile editor `#peOptSkipGps`.
---
### CPR cross-referencing (related documents) ✅
The preview panel now shows a "Related documents" section listing other items in the same scan session that share ≥1 CPR number. Clicking any related item opens its preview. Implemented as a query-time self-join on the existing `cpr_index` table — no new data collection needed. `GET /api/db/related/<item_id>?ref=N` returns rows ordered by shared CPR count descending.
---
### Email preview on checkpoint resume ✅
A 500-character plain-text body excerpt (`body_excerpt`) is now stored per flagged email at broadcast time and persisted in the DB. When the preview modal opens for an email item, this excerpt is shown immediately without requiring a live Graph/Gmail connection. Enables email preview to work correctly after a server restart and checkpoint resume.
---
### Built-in file redaction ✅
Local files (`.docx`, `.xlsx`, `.csv`, `.txt`) can be redacted in-place: CPR numbers are replaced by `██████-████` / `█` blocks, the card is removed from the grid, and a `"redacted"` disposition is logged. The ✂ button appears on redactable local file cards (hidden in viewer mode and for resolved items). File is written to a temp path in the same directory before `shutil.move` to avoid cross-device rename failures.
---
### Date-range scoping for viewer tokens ✅
Viewer tokens can now carry `valid_from` and/or `valid_to` fields (YYYY-MM-DD). `GET /api/db/flagged` filters out items whose `modified` date falls outside the range. All three scope dimensions (role, user, date-range) are independent and combinable. The share modal exposes `#shareValidFrom` / `#shareValidTo` date inputs. Token list shows a green date-range badge when a range is present.
---
### Re-scan diff ✅
When viewing a history session, items present in the immediately preceding session but absent from the current one are shown below a `.resolved-divider` separator with a green ✓ Resolved badge (opacity dimmed). These resolved items are grid-only — they are not added to `S.flaggedData` and cannot be bulk-selected or exported. The history banner shows a resolved count when applicable.
---
### Tests for Google Workspace scan engine ✅
19 tests added in `tests/test_google_scan.py` covering: `GET /api/google/scan/users`, `POST /api/google/scan/start`, `POST /api/google/scan/cancel`, and `_run_google_scan` engine internals. Uses synchronous invocation with mocked `broadcast`, `_scan_bytes`, `checkpoint.*`, and `gdpr_db.get_db`. The `clean_google_state` autouse fixture releases `_google_scan_lock` and clears `_google_scan_abort` after each test.
---
### Compliance audit log ✅
Every significant admin action is written to an immutable `audit_log` table in the scanner database. Recorded events: profile save/delete, viewer token create/revoke, viewer/interface/admin PIN set/change/clear, file source add/update/delete, scheduler job save/delete, scan start/stop, SMTP config save, single and bulk disposition changes, item delete, and item redact. Each record stores a Unix timestamp, action key, human-readable detail, and client IP. `GET /api/audit_log` returns newest-first (max 1000; filterable by `?action=`). Visible in Settings → **Audit Log** tab; refreshes when the tab is opened. `log_audit_event()` helper in `gdpr_db.py` silently no-ops if the DB is unavailable.
---
### Scheduled report-only email job ✅
Scheduler jobs can now be configured as "report only" (toggle `#schedReportOnly`). The job skips the scan entirely and emails the latest results already in the database. If the in-memory result list is empty (e.g. after a server restart), results are loaded from DB via `get_session_items()`. M365 auth is not required — email is sent Graph-first if authenticated, SMTP otherwise. Jobs fail with a clear error if no scan results are available. The job list card shows a blue "Report only" badge. Enabling report-only automatically checks "Email report automatically" and dims the Profile field (unused for report-only runs).
---
### SFTP as a 4th file connector ✅
Scan SFTP servers (SSH File Transfer Protocol) alongside local, SMB, and cloud sources. A new `SFTPScanner` class in `sftp_connector.py` implements the same `iter_files()` interface as `FileScanner`, so `run_file_scan()` and everything downstream (SSE, DB, export, scheduling) is unchanged. Auth supports password and SSH private key (+ optional passphrase). Key files stored in `~/.gdprscanner/sftp_keys/`. SFTP sources appear in the file sources panel with a 🔒 icon, are profile-aware, and are included in scheduled scans automatically.
**Files changed:** `sftp_connector.py` (new), `scan_engine.py`, `routes/sources.py`, `app_config.py`, `static/js/sources.js`, `templates/index.html`, `lang/en|da|de.json`, `routes/export.py`, `requirements.txt`
---
### Checkpoint / resume for Google and File scans ✅
Extended the M365 checkpoint/resume mechanism to all three scan engines. Each engine writes its own file (`checkpoint_m365.json`, `checkpoint_google.json`, `checkpoint_file_{source_id}.json`) every 25 items. Previously found cards are re-emitted via SSE on resume so the grid repopulates before new items arrive. The Scan button now checks for a checkpoint before clearing the grid, so the resume banner appears even without a page reload. `POST /api/scan/checkpoint` returns a per-engine breakdown; `POST /api/scan/clear_checkpoint` wipes all `checkpoint_*.json` files. `checkpoint.py` functions gained a `prefix` keyword (default `"m365"`); M365 call sites are unchanged.
---
### Extended document anonymisation (redaction beyond local DOCX/XLSX/CSV/TXT)
Currently the ✂ redact button only works for local files with extensions `.docx`, `.xlsx`, `.csv`, `.txt`. Several valuable cases are not yet covered:
**1. PDF redaction for local files** ✅ — `redact_pdf_secure` (PyMuPDF physical redaction) wired to `_REDACT_EXTS` and the ✂ button. Falls back to reportlab overlay if PyMuPDF is absent.
**2. OneDrive / SharePoint / Teams file redaction** ✅ — `put_drive_item_content()` added to `m365_connector.py`; `redact_item()` in `routes/export.py` extended with a cloud branch: download via Graph, redact to a local temp file, re-upload via PUT. Supports DOCX, XLSX, PDF. ✂ button shown on cloud cards with supported extensions.
**3. Google Drive file redaction** ✅ — `get_drive_file_mime`, `download_drive_file_by_id`, `update_drive_file` added to both `GoogleWorkspaceConnector` and `PersonalGoogleConnector`. `redact_item()` extended with a `gdrive` branch: check MIME type (rejects Google Docs/Sheets), download bytes, redact locally, upload back via `files().update()`. Requires `drive` scope (not `drive.readonly`) on the service-account delegation. ✂ button shown on Drive cards with DOCX/XLSX/PDF extension.
**4. SMB / SFTP file redaction** ✅ — `write_file(remote_path, content)` added to `SFTPScanner`; `write_smb_file(path, content, user, password, domain)` added to `file_scanner.py`. `redact_item()` extended with `sftp` and `smb` branches: download via native protocol, redact locally, write back. Source config matched from `_load_file_sources()`. SFTP requires the item to still be in `state.flagged_items` (in-session only). ✂ button shown on SMB/SFTP cards with DOCX/XLSX/CSV/TXT/PDF extension.
**5. Email body redaction (Exchange / Gmail)** — overwrite the message body via Graph `PATCH /messages/{id}` or Gmail API. High effort and high risk: HTML formatting must be preserved, inline images handled, and a mistake permanently corrupts the email. **Recommendation: skip** — deleting the email is a safer and simpler GDPR response for emails containing CPR numbers.
**Priority order:** PDF (1) first since it reuses existing code. Cloud files (24) on demand.
**Size:** Small (PDF) · Medium (cloud/SMB/SFTP) · **Priority:** Medium
---
### #32 — Windowed mode for Profiles, Sources, and Settings ✗ Won't do
The workflow is sequential (configure → scan → review), not parallel — there is no realistic scenario where a modal and the results grid need to be open simultaneously. The Sources panel is already visible in the sidebar. Option A (the least-work path) still loads the full 3800-line JS stack twice. Closed.

View File

@ -1 +1 @@
1.7.9
1.6.14

View File

@ -276,44 +276,6 @@ def _admin_pin_is_set() -> bool:
return bool(_get_admin_pin_hash())
# ── Interface PIN ─────────────────────────────────────────────────────────────
# Salted SHA-256, stored in config.json under "interface_pin".
# When set, the main web interface requires PIN authentication before the
# index page or any /api/* route is accessible (viewer routes are exempt).
_INTERFACE_PIN_KEY = "interface_pin"
def get_interface_pin_hash() -> "dict | None":
"""Return the stored interface PIN hash dict, or None if not set."""
return _load_config().get(_INTERFACE_PIN_KEY)
def set_interface_pin(pin: str) -> None:
import secrets as _sec
if not pin:
raise ValueError("PIN must not be empty")
salt = _sec.token_hex(16)
h = _hashlib.sha256((salt + pin).encode()).hexdigest()
cfg = _load_config()
cfg[_INTERFACE_PIN_KEY] = {"hash": h, "salt": salt}
_save_config(cfg)
def verify_interface_pin(pin: str) -> bool:
"""Return True if *pin* matches the stored hash."""
meta = get_interface_pin_hash()
if not meta:
return False
return _hashlib.sha256((meta["salt"] + pin).encode()).hexdigest() == meta["hash"]
def clear_interface_pin() -> None:
cfg = _load_config()
cfg.pop(_INTERFACE_PIN_KEY, None)
_save_config(cfg)
def _load_config() -> dict:
if _CONFIG_FILE.exists():
try:
@ -329,43 +291,6 @@ def _save_config(cfg: dict):
pass
# ── Claude NER config ─────────────────────────────────────────────────────────
def get_claude_config() -> dict:
cfg = _load_config()
return {
"enabled": bool(cfg.get("claude_ner", False)),
"api_key_set": bool(cfg.get("claude_api_key", "")),
}
def save_claude_config(enabled: bool, api_key: "str | None" = None) -> None:
cfg = _load_config()
cfg["claude_ner"] = bool(enabled)
if api_key is not None:
# Encrypt at rest with the machine-keyed Fernet (same as the SMTP
# password). Falls back to plaintext only if cryptography is missing.
cfg["claude_api_key"] = _encrypt_password(api_key) if api_key else ""
_save_config(cfg)
def get_claude_api_key() -> str:
"""Return the decrypted Claude API key (handles legacy plaintext)."""
return _decrypt_password(_load_config().get("claude_api_key", ""))
# ── Software update config ────────────────────────────────────────────────────
def get_update_config() -> dict:
return {"auto_update": bool(_load_config().get("auto_update", False))}
def save_update_config(auto_update: bool) -> None:
cfg = _load_config()
cfg["auto_update"] = bool(auto_update)
_save_config(cfg)
# ── Profile storage (15a) ─────────────────────────────────────────────────────
_SETTINGS_PATH = _DATA_DIR / "settings.json"
_SRC_TOGGLES_PATH = _DATA_DIR / "src_toggles.json"
@ -581,8 +506,6 @@ def _save_role_overrides(overrides: dict) -> None:
# ── File source settings (#8) ─────────────────────────────────────────────────
_FILE_SOURCES_PATH = _DATA_DIR / "file_sources.json"
_SFTP_KEYS_DIR = _DATA_DIR / "sftp_keys"
_SFTP_KEYS_DIR.mkdir(exist_ok=True)
def _load_file_sources() -> list:
@ -607,32 +530,6 @@ def _save_file_sources(sources: list) -> None:
except Exception as e:
logger.error("[file_sources] write failed: %s", e)
def _resolve_sftp_credentials(source: dict) -> dict:
"""Return a copy of source with password/passphrase resolved from keychain.
Callers (run_file_scan, upload_key endpoint) should use this rather than
reading keychain credentials themselves, so the lookup logic stays in one place.
"""
try:
from sftp_connector import get_sftp_password
except ImportError:
return source
resolved = dict(source)
keychain_key = source.get("keychain_key") or None
host = source.get("sftp_host", "")
user = source.get("sftp_user", "")
if not resolved.get("sftp_password"):
resolved["sftp_password"] = get_sftp_password(host, user, keychain_key)
if not resolved.get("sftp_passphrase"):
# Passphrase stored under a distinct account name
passphrase_key = (keychain_key + ":passphrase") if keychain_key else None
resolved["sftp_passphrase"] = get_sftp_password(host, user, passphrase_key)
return resolved
# ── Viewer tokens ────────────────────────────────────────────────────────────
# Read-only viewer tokens allow sharing scan results with a DPO or compliance
# officer without exposing scan controls or credentials. Each token is a
@ -661,14 +558,12 @@ def _save_viewer_tokens(tokens: list) -> None:
logger.error("[viewer_tokens] write failed: %s", e)
def create_viewer_token(label: str = "", expires_days: int | None = None, scope: dict | None = None) -> dict:
def create_viewer_token(label: str = "", expires_days: int | None = None) -> dict:
"""Generate a new viewer token, persist it, and return the token dict.
Args:
label: Human-readable description (e.g. "DPO review April 2026").
label: Human-readable description (e.g. "DPO review April 2026").
expires_days: Days until expiry. None = no expiry.
scope: Optional access scope, e.g. {"role": "student"} or {"role": "staff"}.
Empty dict / None means unrestricted.
"""
import secrets as _secrets
token = _secrets.token_hex(32) # 64-char URL-safe hex string
@ -676,7 +571,6 @@ def create_viewer_token(label: str = "", expires_days: int | None = None, scope:
entry: dict = {
"token": token,
"label": label or "",
"scope": scope or {},
"created_at": now,
"expires_at": now + expires_days * 86400 if expires_days else None,
"last_used_at": None,
@ -813,7 +707,7 @@ def clear_viewer_pin() -> None:
# ── SMTP password encryption ─────────────────────────────────────────────────
# The SMTP password is encrypted at rest using Fernet symmetric encryption.
# The encryption key is derived from a stable machine-specific UUID stored in
# ~/.gdprscanner/machine_id. This key is only usable on the same machine —
# ~/.gdpr_scanner_machine_id. This key is only usable on the same machine —
# the encrypted password cannot be decrypted if the config file is copied to
# another host.
@ -878,13 +772,6 @@ def _load_smtp_config() -> dict:
cfg = json.loads(_SMTP_CONFIG_PATH.read_text(encoding="utf-8"))
if cfg.get("password"):
cfg["password"] = _decrypt_password(cfg["password"])
# Normalise legacy key names written by an older settings-tab UI
# (`user`/`starttls`) to the canonical keys every reader expects
# (`username`/`use_tls`), so configs saved before the fix still work.
if "username" not in cfg and "user" in cfg:
cfg["username"] = cfg["user"]
if "use_tls" not in cfg and "starttls" in cfg:
cfg["use_tls"] = cfg["starttls"]
return cfg
except Exception:
pass

View File

@ -15,9 +15,7 @@ logger = logging.getLogger(__name__)
_DATA_DIR = Path.home() / ".gdprscanner"
_DATA_DIR.mkdir(exist_ok=True)
def _cp_path(prefix: str) -> Path:
return _DATA_DIR / f"checkpoint_{prefix}.json"
_CHECKPOINT_PATH = _DATA_DIR / "checkpoint.json"
def _checkpoint_key(options: dict) -> str:
"""Stable hash of the scan options — used to detect when a checkpoint
@ -29,7 +27,7 @@ def _checkpoint_key(options: dict) -> str:
}, sort_keys=True)
return hashlib.sha256(sig.encode()).hexdigest()[:16]
def _save_checkpoint(key: str, scanned_ids: set, flagged: list, meta: dict, *, prefix: str = "m365") -> None:
def _save_checkpoint(key: str, scanned_ids: set, flagged: list, meta: dict) -> None:
"""Write checkpoint to disk. Called periodically during scanning."""
try:
payload = {
@ -38,31 +36,28 @@ def _save_checkpoint(key: str, scanned_ids: set, flagged: list, meta: dict, *, p
"flagged": flagged,
"meta": {k: v for k, v in meta.items() if k != "options"},
}
path = _cp_path(prefix)
tmp = path.with_suffix(".tmp")
tmp = _CHECKPOINT_PATH.with_suffix(".tmp")
tmp.write_text(json.dumps(payload, ensure_ascii=False, default=str), encoding="utf-8")
tmp.replace(path)
tmp.replace(_CHECKPOINT_PATH)
except Exception as e:
logger.error("[checkpoint] save failed: %s", e)
def _load_checkpoint(key: str, *, prefix: str = "m365") -> dict | None:
def _load_checkpoint(key: str) -> dict | None:
"""Load checkpoint if it matches the current scan key. Returns None on mismatch or error."""
try:
path = _cp_path(prefix)
if not path.exists():
if not _CHECKPOINT_PATH.exists():
return None
payload = json.loads(path.read_text(encoding="utf-8"))
payload = json.loads(_CHECKPOINT_PATH.read_text(encoding="utf-8"))
if payload.get("key") != key:
return None
return payload
except Exception:
return None
def _clear_checkpoint(*, prefix: str = "m365") -> None:
def _clear_checkpoint() -> None:
try:
path = _cp_path(prefix)
if path.exists():
path.unlink()
if _CHECKPOINT_PATH.exists():
_CHECKPOINT_PATH.unlink()
except Exception:
pass

View File

@ -2,17 +2,15 @@
cpr_detector.py File scanning and CPR/PII detection for GDPRScanner.
Provides:
_scan_bytes(content, filename) dispatch to correct scanner by file type
_scan_text_direct(text) scan a plain text string
_extract_exif(content, filename) extract PII-bearing EXIF tags from images
_extract_video_metadata(content, fn) extract PII-bearing metadata from video files
_extract_audio_metadata(content, fn) extract PII-bearing tags from audio files
_detect_photo_faces(content, fn) count faces in an image (OpenCV)
_get_pii_counts(text) NER-based PII type counts
_make_thumb(content, filename) JPEG thumbnail as base64 string
_placeholder_svg(ext, name) SVG file-type icon
_scan_bytes(content, filename) dispatch to correct scanner by file type
_scan_text_direct(text) scan a plain text string
_extract_exif(content, filename) extract PII-bearing EXIF tags from images
_detect_photo_faces(content, fn) count faces in an image (OpenCV)
_get_pii_counts(text) NER-based PII type counts
_make_thumb(content, filename) JPEG thumbnail as base64 string
_placeholder_svg(ext, name) SVG file-type icon
Globals SCANNER_OK, PIL_OK, PHOTO_EXTS, VIDEO_EXTS, AUDIO_EXTS, SUPPORTED_EXTS, ds, PILImage, LANG,
Globals SCANNER_OK, PIL_OK, PHOTO_EXTS, SUPPORTED_EXTS, ds, PILImage, LANG,
and _check_special_category are injected at startup by gdpr_scanner.py via
`from cpr_detector import *` AFTER those names are defined. This keeps the
module cleanly importable in isolation for unit tests (#26) while preserving
@ -22,7 +20,6 @@ from __future__ import annotations
import base64
import hashlib
import io
import re
import tempfile
import threading
from pathlib import Path
@ -50,17 +47,11 @@ except ImportError:
PILImage = None # type: ignore[assignment]
PIL_OK = False
VIDEO_EXTS = {
".mp4", ".mov", ".m4v", ".avi", ".mkv", ".wmv", ".flv", ".webm",
}
AUDIO_EXTS = {
".mp3", ".flac", ".ogg", ".m4a", ".aac", ".wma", ".wav", ".opus", ".aiff", ".aif",
}
SUPPORTED_EXTS = {
".pdf", ".docx", ".doc", ".xlsx", ".xlsm", ".csv",
".txt", ".eml", ".msg",
".jpg", ".jpeg", ".png", ".bmp", ".tiff", ".tif", ".webp",
} | VIDEO_EXTS | AUDIO_EXTS
}
PHOTO_EXTS = {
".jpg", ".jpeg", ".png", ".bmp", ".tiff", ".tif", ".webp", ".heic", ".heif",
}
@ -199,226 +190,49 @@ def _extract_exif(content: bytes, filename: str) -> dict:
return result
def _extract_video_metadata(content: bytes, filename: str) -> dict:
"""Extract PII-bearing metadata from a video file.
Returns the same structure as _extract_exif so callers can treat both
identically:
gps {lat, lon, lat_ref, lon_ref, maps_url} or None
pii_fields {label: value} for title/artist/comment/description
author str or None
datetime str or None
device str or None
has_pii bool
"""Detect faces in an image file using OpenCV Haar cascades.
MP4/MOV/M4V: reads QuickTime/MPEG-4 tags via mutagen (no system deps).
GPS is extracted from the ©xyz QuickTime atom (ISO 6709 string written by
iPhones and Android devices: "+55.6763+012.5681+005.000/").
AVI: parses the RIFF INFO list chunk without any external library.
All other extensions: returns empty result immediately.
Returns the number of faces detected, or 0 if cv2 is unavailable,
the file is not a supported image format, or decoding fails.
Face detection is intentionally strict (minNeighbors=8, min_size=80px) to
reduce false positives on background textures, labels, and artwork.
Haar cascades are tuned for compliance flagging, not exhaustive detection. (#9)
"""
result: dict = {"gps": None, "pii_fields": {}, "author": None,
"datetime": None, "device": None, "has_pii": False}
ext = Path(filename).suffix.lower()
if ext in {".mp4", ".mov", ".m4v"}:
_extract_mp4_tags(content, result)
elif ext == ".avi":
_extract_avi_info(content, result)
return result
def _extract_mp4_tags(content: bytes, result: dict) -> None:
"""Populate result dict from MPEG-4/QuickTime container tags via mutagen."""
if not SCANNER_OK:
return 0
try:
import mutagen.mp4
tags = mutagen.mp4.MP4(io.BytesIO(content)).tags
if not tags:
return
# Text fields that may contain personal data
_tag_label = {
"©nam": "Title",
"©cmt": "Comment",
"©des": "Description",
"desc": "Description",
"©lyr": "Lyrics",
}
for tag, label in _tag_label.items():
val = tags.get(tag)
if val:
text = str(val[0]).strip() if isinstance(val, list) else str(val).strip()
if len(text) >= _EXIF_PII_MIN_LEN:
result["pii_fields"][label] = text
result["has_pii"] = True
# Author — prefer ©ART (artist), fall back to album artist
for tag in ("©ART", "aART"):
val = tags.get(tag)
if val:
author = str(val[0]).strip() if isinstance(val, list) else str(val).strip()
if len(author) >= _EXIF_PII_MIN_LEN:
result["author"] = author
result["pii_fields"]["Artist"] = author
result["has_pii"] = True
break
# Recording date
val = tags.get("©day")
if val:
result["datetime"] = str(val[0]).strip() if isinstance(val, list) else str(val).strip()
# Device (QuickTime-specific tags written by iPhones)
make = tags.get("©mak")
model = tags.get("©mod")
if make or model:
result["device"] = " ".join(
str(v[0] if isinstance(v, list) else v).strip()
for v in (make, model) if v
)
# GPS — QuickTime ©xyz atom: "+55.6763+012.5681+005.000/" (ISO 6709)
import re as _re
for gps_tag in ("©xyz", "com.apple.quicktime.location.ISO6709"):
val = tags.get(gps_tag)
if val:
gps_str = str(val[0] if isinstance(val, list) else val).strip()
m = _re.match(r'([+-]\d+\.?\d*)([+-]\d+\.?\d*)', gps_str)
if m:
lat = round(float(m.group(1)), 7)
lon = round(float(m.group(2)), 7)
result["gps"] = {
"lat": lat,
"lon": lon,
"lat_ref": "N" if lat >= 0 else "S",
"lon_ref": "E" if lon >= 0 else "W",
"maps_url": f"https://www.google.com/maps?q={lat},{lon}",
}
result["has_pii"] = True
break
cv2_mod = getattr(ds, "_get_cv2", None)
if cv2_mod is None:
return 0
cv2, np = ds._get_cv2()
if cv2 is None or np is None:
return 0
except Exception:
pass
return 0
def _extract_avi_info(content: bytes, result: dict) -> None:
"""Populate result dict from RIFF INFO list chunk in an AVI file."""
try:
import struct
if len(content) < 12 or content[:4] != b"RIFF":
return
# Walk top-level RIFF chunks looking for the INFO LIST
i = 12
while i + 8 <= len(content):
chunk_id = content[i:i+4]
chunk_size = struct.unpack_from("<I", content, i + 4)[0]
if chunk_id == b"LIST" and content[i+8:i+12] == b"INFO":
_parse_riff_info(content, i + 12, i + 8 + chunk_size, result)
break
i += 8 + chunk_size + (chunk_size & 1) # RIFF chunks are word-aligned
# Decode image bytes → cv2 BGR array
arr = np.frombuffer(content, dtype=np.uint8)
img = cv2.imdecode(arr, cv2.IMREAD_COLOR)
if img is None:
# imdecode failed (e.g. HEIC without codec) — try PIL fallback
if PIL_OK:
try:
from PIL import Image as _PILImg
import io as _io
pil_img = _PILImg.open(_io.BytesIO(content)).convert("RGB")
pil_arr = np.array(pil_img)
img = cv2.cvtColor(pil_arr, cv2.COLOR_RGB2BGR)
except Exception:
return 0
else:
return 0
faces = ds.detect_faces_cv2(img, min_size=80, neighbors=8)
return len(faces)
except Exception:
pass
def _parse_riff_info(content: bytes, start: int, end: int, result: dict) -> None:
import struct
_info_labels = {
b"INAM": "Title",
b"IART": "Artist",
b"ICMT": "Comment",
b"ISBJ": "Subject",
b"ICRD": "Date",
}
i = start
while i + 8 <= end and i + 8 <= len(content):
sub_id = content[i:i+4]
sub_size = struct.unpack_from("<I", content, i + 4)[0]
label = _info_labels.get(sub_id)
if label:
raw = content[i+8 : i+8+sub_size]
val = raw.decode("utf-8", errors="replace").strip("\x00 ")
if val and len(val) >= _EXIF_PII_MIN_LEN:
result["pii_fields"][label] = val
result["has_pii"] = True
if label == "Artist" and not result["author"]:
result["author"] = val
if label == "Date" and not result["datetime"]:
result["datetime"] = val
i += 8 + sub_size + (sub_size & 1)
def _extract_audio_metadata(content: bytes, filename: str) -> dict:
"""Extract PII-bearing tags from an audio file.
Returns the same structure as _extract_exif / _extract_video_metadata.
No GPS extraction GPS is not embedded in audio containers in practice.
Uses mutagen.File(easy=True) which normalises tags to lowercase keys for
MP3 (ID3), M4A/AAC (MPEG-4), FLAC, OGG Vorbis, and AIFF. WMA/ASF tags
use mixed-case keys (e.g. "Title", "Author") these are lowercased during
normalisation so the same extraction logic covers all formats.
"""
result: dict = {"gps": None, "pii_fields": {}, "author": None,
"datetime": None, "device": None, "has_pii": False}
try:
import mutagen
f = mutagen.File(fileobj=io.BytesIO(content), filename=filename, easy=True)
if not f or not f.tags:
return result
# Normalise all tags to {lowercase_key: str_value} regardless of format
def _strval(v):
return str(v[0] if isinstance(v, list) and v else v).strip()
tags: dict[str, str] = {
k.lower(): _strval(v) for k, v in f.tags.items()
}
# Fields that may contain personal names or descriptions
_pii_keys = {
"title": "Title",
"artist": "Artist",
"albumartist": "Album Artist",
"composer": "Composer",
"lyricist": "Lyricist",
"conductor": "Conductor",
"author": "Author",
"copyright": "Copyright",
"comment": "Comment",
"description": "Description",
# WMA/ASF mixed-case keys survive as lowercase after normalisation
"wm/albumartist": "Album Artist",
"wm/composer": "Composer",
"wm/conductor": "Conductor",
"wm/lyrics": "Lyrics",
}
seen: set[str] = set() # avoid duplicate label entries
for key, label in _pii_keys.items():
val = tags.get(key, "")
if val and len(val) >= _EXIF_PII_MIN_LEN and label not in seen:
result["pii_fields"][label] = val
result["has_pii"] = True
seen.add(label)
# Author — most specific personal name field wins
for key in ("artist", "author", "albumartist", "wm/albumartist", "composer"):
val = tags.get(key, "")
if val and len(val) >= _EXIF_PII_MIN_LEN:
result["author"] = val
break
# Recording / release date
for key in ("date", "year", "wm/year"):
val = tags.get(key, "")
if val:
result["datetime"] = val
break
except Exception:
pass
return result
return 0
def _detect_photo_faces(content: bytes, filename: str) -> int:
"""Detect faces in an image file using OpenCV Haar cascades.
@ -463,151 +277,67 @@ def _detect_photo_faces(content: bytes, filename: str) -> int:
return 0
_EMAIL_RE = re.compile(
r'\b[a-zA-Z0-9][a-zA-Z0-9._%+\-]*@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}\b'
)
_PHONE_RE = re.compile(
r'(?:'
r'(?:\+45|0045)[\s\-]?[2-9]\d{3}[\s\-]?\d{4}' # +45/0045 DDDD DDDD
r'|(?:\+45|0045)[\s\-]?[2-9]\d(?:[\s\-]\d{2}){3}' # +45/0045 DD DD DD DD
r'|\b[2-9]\d{7}\b' # 8 consecutive digits
r'|\b[2-9]\d{3}[\s\-]\d{4}\b' # DDDD DDDD
r'|\b[2-9]\d(?:[\s\-]\d{2}){3}\b' # DD DD DD DD
r')'
)
def _extract_text_from_bytes(content: bytes, filename: str) -> str:
"""Extract plain text from file bytes for email/phone pattern matching.
Returns empty string for binary media files (photos, video, audio) and
on any parse error callers must never raise from this function.
"""
ext = Path(filename).suffix.lower()
try:
if ext in {".txt", ".csv", ".eml", ".msg"}:
return content.decode("utf-8", errors="replace")
if ext in {".docx", ".doc"}:
from docx import Document as _Doc
doc = _Doc(io.BytesIO(content))
parts = [p.text for p in doc.paragraphs]
for tbl in doc.tables:
for row in tbl.rows:
for cell in row.cells:
parts.append(cell.text)
return "\n".join(parts)
if ext in {".xlsx", ".xlsm"}:
import openpyxl as _xl
wb = _xl.load_workbook(io.BytesIO(content), read_only=True, data_only=True)
parts = [
str(cell.value)
for ws in wb.worksheets
for row in ws.iter_rows()
for cell in row
if cell.value is not None
]
wb.close()
return " ".join(parts)
if ext == ".pdf":
import pdfplumber as _pp
with _pp.open(io.BytesIO(content)) as pdf:
parts = [p.extract_text() or "" for p in pdf.pages]
return "\n".join(parts)
except Exception:
pass
if ext not in PHOTO_EXTS | VIDEO_EXTS | AUDIO_EXTS:
try:
return content.decode("utf-8", errors="replace")
except Exception:
pass
return ""
def _find_emails_phones(text: str) -> dict:
"""Extract unique email addresses and Danish phone numbers from text.
Returns {"emails": [{"formatted": str}, ...], "phones": [{"formatted": str}, ...]}.
Phones are normalised to digit-only strings (preserving a leading '+').
"""
if not text:
return {"emails": [], "phones": []}
emails = list(dict.fromkeys(m.group(0).lower() for m in _EMAIL_RE.finditer(text)))
phones = list(dict.fromkeys(
('+' + re.sub(r'[\s\-]', '', m.group(0)[1:]) if m.group(0).lstrip().startswith('+')
else re.sub(r'[\s\-]', '', m.group(0)))
for m in _PHONE_RE.finditer(text)
))
return {
"emails": [{"formatted": e} for e in emails],
"phones": [{"formatted": p} for p in phones],
}
def _scan_bytes(content: bytes, filename: str, poppler_path=None, lang: str = "dan+eng") -> dict:
"""Scan raw bytes for CPRs, emails, and phone numbers. Returns result dict."""
def _scan_bytes(content: bytes, filename: str, poppler_path=None) -> dict:
"""Scan raw bytes for CPRs. Returns scanner result dict."""
if not SCANNER_OK:
return {"cprs": [], "dates": [], "emails": [], "phones": [], "error": "scanner not available"}
return {"cprs": [], "dates": [], "error": "scanner not available"}
ext = Path(filename).suffix.lower()
with tempfile.NamedTemporaryFile(suffix=ext, delete=False) as tmp:
tmp.write(content)
tmp_path = Path(tmp.name)
result: dict = {"cprs": [], "dates": []}
try:
if ext == ".pdf":
# Check if the PDF has a text layer before running full scan_pdf.
# Image-only PDFs (scanned documents) have no text and would trigger
# Tesseract OCR subprocesses that hang indefinitely on some files.
try:
import pdfplumber as _pp
with _pp.open(io.BytesIO(content)) as _pdf:
import pdfplumber as _pp, io as _io
with _pp.open(_io.BytesIO(content)) as _pdf:
has_text = any(ds.is_text_page(p) for p in _pdf.pages)
if not has_text:
return {"cprs": [], "dates": [], "emails": [], "phones": []}
return {"cprs": [], "dates": []} # image-only PDF — no CPRs possible
except Exception:
pass # if pdfplumber fails, fall through to full scan_pdf
result = ds.scan_pdf(tmp_path, poppler_path=poppler_path, lang=lang)
return ds.scan_pdf(tmp_path, poppler_path=poppler_path)
elif ext in {".docx", ".doc"}:
result = ds.scan_docx(tmp_path)
return ds.scan_docx(tmp_path)
elif ext in {".xlsx", ".xlsm"}:
result = ds.scan_xlsx(tmp_path)
return ds.scan_xlsx(tmp_path)
elif ext == ".csv":
result = ds.scan_csv(tmp_path)
return ds.scan_csv(tmp_path)
elif ext == ".txt":
text = content.decode("utf-8", errors="replace")
cprs, dates = ds.extract_matches(text, 1, "text")
result = {"cprs": cprs, "dates": dates}
return {"cprs": cprs, "dates": dates}
elif ext in {".jpg", ".jpeg", ".png", ".bmp", ".tiff", ".tif", ".webp"}:
result = ds.scan_image(tmp_path, lang=lang)
return ds.scan_image(tmp_path)
else:
# Try plain text
try:
text = content.decode("utf-8", errors="replace")
cprs, dates = ds.extract_matches(text, 1, "text")
result = {"cprs": cprs, "dates": dates}
return {"cprs": cprs, "dates": dates}
except Exception:
pass
return {"cprs": [], "dates": []}
except Exception as e:
result = {"cprs": [], "dates": [], "error": str(e)}
return {"cprs": [], "dates": [], "error": str(e)}
finally:
try:
tmp_path.unlink()
except Exception:
pass
ep = _find_emails_phones(_extract_text_from_bytes(content, filename))
result["emails"] = ep["emails"]
result["phones"] = ep["phones"]
return result
def _worker_scan_pdf(pdf_path_str: str, result_q, lang: str = "dan+eng") -> None:
def _worker_scan_pdf(pdf_path_str: str, result_q) -> None:
"""Worker executed in a spawned subprocess — must be a module-level function."""
try:
import document_scanner as _ds
from pathlib import Path as _Path
result_q.put(_ds.scan_pdf(_Path(pdf_path_str), lang=lang))
result_q.put(_ds.scan_pdf(_Path(pdf_path_str)))
except Exception as e:
result_q.put({"cprs": [], "dates": [], "error": str(e)})
def _scan_bytes_timeout(content: bytes, filename: str, timeout: int = 60, lang: str = "dan+eng") -> dict:
def _scan_bytes_timeout(content: bytes, filename: str, timeout: int = 60) -> dict:
"""Like _scan_bytes but runs PDF scanning in a spawned subprocess with a hard timeout.
For non-PDF files delegates straight to _scan_bytes. For PDFs it writes the
@ -617,7 +347,7 @@ def _scan_bytes_timeout(content: bytes, filename: str, timeout: int = 60, lang:
"""
ext = Path(filename).suffix.lower()
if ext != ".pdf":
return _scan_bytes(content, filename, lang=lang)
return _scan_bytes(content, filename)
import multiprocessing
ctx = multiprocessing.get_context("spawn")
@ -630,7 +360,7 @@ def _scan_bytes_timeout(content: bytes, filename: str, timeout: int = 60, lang:
try:
with _pdf_subprocess_sem:
q = ctx.Queue()
p = ctx.Process(target=_worker_scan_pdf, args=(tmp_path_str, q, lang))
p = ctx.Process(target=_worker_scan_pdf, args=(tmp_path_str, q))
p.start()
p.join(timeout)
if p.is_alive():
@ -649,22 +379,19 @@ def _scan_bytes_timeout(content: bytes, filename: str, timeout: int = 60, lang:
def _scan_text_direct(text: str) -> dict:
"""Scan a plain text string for CPRs, emails, and phone numbers.
"""Scan a plain text string for CPRs using extract_matches.
Uses ds.extract_matches() directly rather than ds.scan_text() because
scan_text() calls extract_cpr_and_dates() which is not defined in
document_scanner.py (pre-existing bug).
"""
if not text:
return {"cprs": [], "dates": [], "emails": [], "phones": []}
ep = _find_emails_phones(text)
if not SCANNER_OK:
return {"cprs": [], "dates": [], **ep}
if not SCANNER_OK or not text:
return {"cprs": [], "dates": []}
try:
cprs, dates = ds.extract_matches(text, 1, "text")
return {"cprs": cprs, "dates": dates, **ep}
return {"cprs": cprs, "dates": dates}
except Exception:
return {"cprs": [], "dates": [], **ep}
return {"cprs": [], "dates": []}
def _html_esc(s: str) -> str:
"""HTML-escape a string for safe inline embedding."""
@ -706,11 +433,6 @@ def _placeholder_svg(ext: str, name: str) -> str:
}
bg, label = colors.get(ext, ("#9CA3AF", ext.upper().lstrip(".")))
short = name[:22] + "" if len(name) > 22 else name
# Escape label/name before embedding — served as image/svg+xml, so an
# unescaped value (from the ?name= query param via /api/thumb) would be a
# reflected-XSS vector when the URL is opened directly.
label = _html_esc(label)
short = _html_esc(short)
svg = f"""<svg xmlns="http://www.w3.org/2000/svg" width="280" height="360">
<rect width="280" height="360" fill="{bg}"/>
<rect x="20" y="20" width="240" height="280" rx="8" fill="rgba(255,255,255,0.12)"/>

View File

@ -1,6 +1,6 @@
# GDPR Scanner — Brugermanual
Version 1.7.9
Version 1.6.14
---
@ -33,7 +33,7 @@ Når der er fundet elementer, kan du gennemgå dem, beslutte hvad der skal ske m
**Hvad scanneren gennemgår:**
- Microsoft 365: Exchange e-mail, OneDrive, SharePoint, Teams
- Google Workspace: Gmail, Google Drev
- Lokale og netværksbaserede filmapper (herunder SMB/NAS-drev og SFTP-servere)
- Lokale og netværksbaserede filmapper (herunder SMB/NAS-drev)
**Hvad den finder:**
- CPR-numre
@ -50,16 +50,16 @@ Når der er fundet elementer, kan du gennemgå dem, beslutte hvad der skal ske m
Når du åbner scanneren, er skærmen inddelt i tre områder:
```
┌───────────────────────────────────────────────────────────────┐
│ Topbjælke: Scan-knap, profiler, handlinger
│ Venstre panel ──────────────────────────────────────────────┤
│ - Kilder │ Resultater / scanningsforløb
│ - Indstillinger
│ - Konti
│ - Statistik ──────────────────────────────────────────────┤
│ Aktivitetslog
└───────────────────────────────────────────────────────────────┘
┌─────────────────┬──────────────────────────────────────────┐
│ │ Topbjælke: Scan-knap, profiler, handlinger│
│ Venstre panel ├──────────────────────────────────────────┤
│ │ │
│ - Kilder │ Resultater / scanningsforløb │
│ - Indstillinger│ │
│ - Konti │ │
│ - Statistik ├──────────────────────────────────────────┤
│ │ Aktivitetslog │
└─────────────────┴──────────────────────────────────────────┘
```
**Venstre panel** — vælg hvad der skal scannes og hvordan.
@ -104,33 +104,17 @@ Fanen Google Workspace lader dig forbinde en Google Workspace-konto (tidligere G
| Gmail | Alle e-mails i den enkelte brugers indbakke og labels |
| Google Drev | Alle filer ejet af eller delt med den enkelte bruger |
### 3.3 Lokale, netværksbaserede og SFTP-filkilder
### 3.3 Lokale og netværksbaserede filer
Fanen **Filkilder** viser de lokale mapper, netværksdrev og SFTP-servere, du har konfigureret.
Fanen **Filkilder** viser de lokale mapper og netværksdrev, du har konfigureret.
**Sådan tilføjer du en ny filkilde:**
1. Indtast en **Betegnelse** — et navn du kan genkende (f.eks. "Skolens Fællesmappe").
2. Vælg **kildetype** med pillerne øverst i formularen:
**Lokal**
- Indtast **Stien** til mappen: `~/Dokumenter` eller `/Volumes/Drev`.
- Klik på **Tilføj**.
**Netværk (SMB)**
- Indtast **Stien** i UNC-format: `//nas-server/delt` eller `\\server\delt`.
- Udfyld **SMB-vært**, **Brugernavn** og **Adgangskode**. Adgangskoden gemmes sikkert i systemets nøglering.
- Klik på **Tilføj**.
**SFTP**
- Indtast **Vært** (værtsnavn eller IP-adresse på SSH/SFTP-serveren).
- Indtast **Port** (standard 22).
- Indtast **Brugernavn**.
- Indtast **Fjernsti**, der skal scannes (f.eks. `/home/delt` eller `/`).
- Vælg **Godkendelsestype**:
- **Adgangskode** — indtast adgangskoden. Den gemmes sikkert i systemets nøglering.
- **Privat nøgle** — klik på **Upload nøglefil** og vælg din SSH-privatnøgle (OpenSSH- eller PEM-format). Hvis nøglen er beskyttet med en adgangssætning, skal du indtaste den. Nøglefilen gemmes i scannerens datamappe med `600`-rettigheder.
- Klik på **Tilføj**.
2. Indtast **Stien**:
- Lokal mappe: `~/Dokumenter` eller `/Volumes/Drev`
- Netværksdrev: `//nas-server/delt` eller `\\server\delt`
3. Hvis det er et netværksdrev, udfyldes felterne **SMB-vært**, **Brugernavn** og **Adgangskode** automatisk. Adgangskoden gemmes sikkert i systemets nøglering.
4. Klik på **Tilføj**.
Du kan tilføje så mange filkilder, du har brug for. De vil fremgå som valgbare kilder i venstre panel, når du er klar til at scanne.
@ -170,10 +154,6 @@ Scan kun elementer ændret efter en bestemt dato. Hurtige forudindstillinger —
**Maks. e-mails pr. bruger** — stop efter at have scannet dette antal e-mails per person (standard 2.000). Øg det, hvis du har brug for fuld dækning.
**Kun CPR-tilstand** — når aktiveret, flagges kun elementer, der indeholder mindst ét kvalificerende CPR-nummer. Elementer, hvis eneste fund er e-mailadresser, telefonnumre, ansigter eller GPS/EXIF-metadata, springes over. Nyttigt, når du ønsker en fokuseret rapport udelukkende om CPR-eksponering.
**OCR-sprog** — vælg den sprogpakke, Tesseract bruger, når der læses tekst fra scannede PDF-filer og billeder. Standard er `Dansk + Engelsk`, som dækker langt de fleste dokumenter. Skift til en anden forudindstilling, hvis dine dokumenter overvejende er på et andet sprog.
### 4.4 Start scanningen
Klik på den blå **Scan**-knap i topbjælken.
@ -200,8 +180,6 @@ Klik på **▶ Genoptag** for at fortsætte fra det sted, scanningen slap. Klik
## 5. Forstå resultaterne
Når du åbner appen, viser gitteret **alle åbne fund** — alle markerede elementer, der stadig kræver handling (dvs. uden disposition), på tværs af alle dine scanninger og ikke kun den seneste. Efterhånden som du mærker elementer (behold, anonymisér, slet, falsk positiv …), forsvinder de fra denne visning, så det, der står tilbage, er dit udestående arbejde. Hvert element vises én gang med sin nyeste tilstand. Vil du i stedet se en enkelt tidligere scanning, så brug sessionsvælgeren (se *Gennemse tidligere scanningssessioner* nedenfor).
Hvert fundet element vises som et kort. Her er forklaringen på mærker og labels:
### Kildemærker
@ -214,8 +192,7 @@ Hvert fundet element vises som et kort. Her er forklaringen på mærker og label
| Teams | Fundet i en Teams-kanal |
| Gmail | Fundet i en Gmail-postkasse |
| Google Drev | Fundet i Google Drev |
| Lokal / Netværk | Fundet på et lokalt eller SMB-filshare |
| 🔒 SFTP | Fundet på en SFTP-server |
| Lokal / Netværk | Fundet på et filshare |
### Risikoniveau
@ -249,19 +226,6 @@ Brug filterbjælken over resultaterne til at indsnævre visningen:
- **Disposition** — vis elementer efter gennemgangsstatus.
- **Deling** — filtrer på delt / ekstern / alle.
- **Risiko** — vis kun Art. 9, fotos, GPS eller høj-risiko-elementer.
- **Rolle** — vis kun **Ansatte** eller **Elever**. Påvirker også eksporten: klikker du på **Excel** eller **Art.30**, mens en rolle er valgt, indeholder rapporten kun den pågældende gruppe, og filnavnet får suffikset `_elever` eller `_ansatte`.
### Gennemse tidligere scanningssessioner
Når en scanning er afsluttet, kan du gennemse resultaterne fra en tidligere scanningssession uden at køre en ny scanning.
- Klik på **Sessioner**-knappen i historikbanneret (der vises over resultatgitteret, når en scanning er afsluttet) for at åbne sessionsvælgeren.
- Hver række viser dato og tidspunkt, hvilke kilder der blev scannet, og hvor mange elementer der blev fundet. Et **Δ**-mærkat angiver delta-scanninger; **Seneste** markerer den nyeste session.
- Klik på en række for at indlæse den pågældende sessions resultater i gitteret. Et historikbanner erstatter statuslinjen med sessionens oplysninger.
- Klik på **Åbne fund** i banneret for at forlade den tidligere session og vende tilbage til standardvisningen med alle elementer, der stadig kræver handling.
- Start af en ny scanning afslutter automatisk historiktilstanden og skifter til live-resultater.
Alle filtre, eksporter og dispositionsmærkning fungerer normalt, mens du gennemser tidligere sessioner.
---
@ -276,7 +240,6 @@ Forhåndsvisningen viser:
- Alle fundne CPR-numre og deres kontekst
- Øvrige personoplysninger registreret (telefon, e-mailadresse, IBAN mv.)
- Deling og ekstern adgangsinformation
- **Relaterede dokumenter** — hvis andre elementer i samme scanningssession indeholder ét eller flere af de samme CPR-numre, vises de i et "Relaterede dokumenter"-afsnit. Klik på et element for at åbne dets forhåndsvisning. Det gør det nemmere at spore en persons data på tværs af flere filer eller e-mails.
### Angiv en disposition
@ -294,46 +257,6 @@ Hvert element har en **Disposition**-rullemenu i forhåndsvisningspanelet. Vælg
Klik på **Gem** efter valget. En lille **✓ Gemt**-bekræftelse vises.
### Redigér en fil på stedet
En **✂**-knap vises på resultatkort, hvor scanneren kan overskrive filen direkte. Klikker du på den, erstattes alle CPR-numre med `██████-████`-blokke, og handlingen registreres som en `"redacted"`-disposition. Kortet **bevares i gitteret indtil din næste scanning** — det vises nedtonet med et grønt **✏ Redigeret**-mærke, og dets handlingsknapper skjules, så det ikke kan behandles igen. På den måde kan du let se, hvad du har håndteret i sessionen; gitteret genopbygges, næste gang du scanner. Brug denne mulighed, når du ønsker at anonymisere en fil frem for at slette den helt.
Knappen er tilgængelig for følgende kildetyper og formater:
| Kilde | Understøttede formater |
|---|---|
| Lokale filer | DOCX, XLSX, CSV, TXT, PDF |
| Netværksdrev (SMB) | DOCX, XLSX, CSV, TXT, PDF |
| SFTP | DOCX, XLSX, CSV, TXT, PDF |
| OneDrive / SharePoint / Teams | DOCX, XLSX, PDF |
| Google Drev | DOCX, XLSX, PDF |
Knappen er **ikke** tilgængelig for e-mail-elementer (Exchange/Gmail) eller i visningsmode. Google Docs og Sheets, der er eksporteret som DOCX/XLSX under scanning, kan ikke redigeres på stedet — eksportér filen manuelt fra Google først og redigér derefter den hentede kopi.
> **PDF-sikkerhedsnote:** PDF-redigering sker fysisk — CPR-nummerteksten slettes fra PDF-datastrømmen og er ikke blot dækket over med en sort boks. En læser kan ikke gendanne den oprindelige tekst ved at markere under redigeringen eller ved programmatisk inspektion af filen. Billedbaserede (scannede) PDF-filer understøttes også: scanneren lokaliserer CPR-nummeret på sidebilledet via OCR og overskriver det pågældende område fysisk.
> **OneDrive / SharePoint / Teams-note:** Redigering skriver den ændrede fil tilbage via Microsoft Graph API og kræver tilladelsen `Files.ReadWrite.All`. Scanneren anmoder nu automatisk om denne tilladelse ved login. Hvis du har godkendt før denne opdatering, skal du logge ud og logge ind igen (Indstillinger → Microsoft 365 → Log ud), så scanneren henter et nyt token med skriveadgang. Ved app-only-opsætninger (serviceprincipal) skal en Global Administrator tildele applikationstilladelsen `Files.ReadWrite.All` i Azure → App-registreringer → API-tilladelser → Giv administratorsamtykke.
> **Google Drev-note:** Redigering i Google Drev kræver `drive`-scopet på servicekontoens domain-wide delegation (ikke blot `drive.readonly`). Hvis redigeringen fejler med en rettighedsfejl, bedes du kontakte din Google Workspace-administrator for at tilføje scopet `https://www.googleapis.com/auth/drive` til servicekontoens delegation i Admin Console.
> **SFTP-note:** SFTP-redigering er kun tilgængelig for elementer fundet i den aktuelle scansession. Gennemfør en ny scanning, hvis du gennemser historiske resultater.
### Massemarkering af flere elementer på én gang
Hvis du skal anvende den samme disposition på mange elementer, kan du bruge **Vælg-tilstand** i stedet for at åbne hvert kort enkeltvis.
1. Klik på **Vælg** i filterbjælken. Der vises afkrydsningsfelter på hvert resultatkort.
2. Sæt hak ved de elementer, du vil mærke, eller klik på **Vælg alle synlige** i massetag-bjælken nederst på skærmen for at vælge alt, der matcher de aktuelle filtre.
3. Vælg en disposition fra rullemenuen i massetag-bjælken.
4. Klik på **Anvend**. Alle valgte elementer opdateres med det samme.
5. Klik på **Afslut** (eller **Vælg**-knappen igen) for at forlade vælg-tilstanden.
> **Tip:** Brug filterbjælken til f.eks. at afgrænse til alle ikke-gennemgåede elevfund, og klik derefter på **Vælg alle synlige** — så kan du mærke en hel kategori med to klik.
### Dispositionsstatistikbjælke
En tynd statistikbjælke over resultatgitteret viser: **I alt · Ikke gennemgået · Opbevar · Slet** og en **% gennemgået**-angivelse. Den opdateres automatisk efter hvert gem og giver dig et løbende overblik over, hvor langt du er i gennemgangen.
### Find alle elementer for en bestemt person
Klik på **🔍** i venstre panel (under Statistik) for at åbne **Registreret person**-opslaget. Indtast et CPR-nummer, og scanneren finder alle fundne elementer, der indeholder dette nummer. Du kan derefter slette dem alle i ét trin — i overensstemmelse med retten til sletning (GDPR artikel 17).
@ -364,8 +287,6 @@ Klik på **Slet**-knappen i filterbjælken for at åbne massesletningsvinduet.
4. En statuslinje viser sletningerne i realtid. E-mails flyttes til **Slettet post**; filer flyttes til **papirkurven**.
Slettede elementer (uanset om det er en enkelt sletning, en massesletning eller en sletning efter anmodning fra en registreret) **bevares i gitteret indtil din næste scanning** — nedtonet med et rødt **🗑 Slettet**-mærke og med skjulte handlingsknapper — så du kan se, hvad der blev fjernet i sessionen. Hvis en massesletning delvist mislykkes, markeres kun de elementer, serveren faktisk slettede; de, der fejlede, forbliver aktive, så du kan forsøge igen. Gitteret genopbygges, næste gang du scanner.
En fuldstændig revisionslog over alle sletninger (hvad der er slettet, hvornår og hvorfor) medtages i artikel 30-rapporten.
---
@ -402,7 +323,7 @@ Klik på **Profiler** for at åbne profil­administrations­panelet. Her kan du:
Klik på **Excel** i filterbjælken for at downloade de aktuelle resultater som en Excel-projektmappe. Projektmappen indeholder:
- Et oversigtsfaneblad med scanningsdato, antal elementer og kildefordeling.
- Et separat faneblad for hver kildetype (Outlook, OneDrive, SharePoint, Teams, Gmail, Google Drive, Lokal, Netværk, SFTP).
- Et separat faneblad for hver kildetype (Outlook, OneDrive, SharePoint, Teams, Gmail, Google Drive, Lokal, Netværk).
- Alle fundne elementer, herunder kilde, konto, CPR-antal, risikoniveau, delingsstatus og disposition.
Knapperne **Excel** og **Art.30** er altid tilgængelige — også efter genstart af programmet — og eksporterer resultaterne fra den seneste afsluttede scanningssession uden at kræve en ny scanning.
@ -437,22 +358,15 @@ Du kan give en DPO, skoleleder eller compliance-koordinator skrivebeskyttet adga
Klik på **🔗**-knappen øverst til højre i topbjælken for at åbne delingspanelet.
1. Angiv eventuelt en **Betegnelse** for at identificere, hvem linket er til (f.eks. "DPO-gennemgang april 2026").
2. Vælg et **Omfang**:
- **Alle roller** — modtageren ser alle fundne elementer.
- **Ansatte** / **Elever** — modtageren ser kun elementer tilhørende den valgte rollegruppe. Rollefilteret er låst i deres visning.
- **Bruger** — modtageren ser kun elementer tilhørende en bestemt medarbejder. Vælg personen fra søgefeltet; scanneren matcher automatisk både deres M365- og Google Workspace-e-mailadresser. Brug denne mulighed, når du vil give en enkelt medarbejder adgang til sine egne scanningsresultater.
3. Angiv eventuelt et **Datointerval** — brug felterne "Elementer fra" og "Elementer til" for at begrænse modtagerens visning til elementer ændret inden for en bestemt periode. Lad begge felter stå tomme for ingen datobegrænsning.
4. Vælg en **Udløbsdato** — 7 dage, 30 dage, 90 dage, 1 år eller Aldrig.
5. Klik på **Opret**. Formularen ryddes, og det nye link vises øverst i listen **Aktive links** nedenfor, kortvarigt fremhævet.
6. Klik på **Kopiér** i linkets række for at kopiere det til udklipsholderen, og send det til gennemgangeren.
2. Vælg en **Udløbsdato** — 7 dage, 30 dage, 90 dage, 1 år eller Aldrig.
3. Klik på **Opret**. Der genereres et unikt link: `http://host:5100/view?token=…`
4. Klik på **Kopiér** for at kopiere linket til udklipsholderen, og send det til gennemgangeren.
Gennemgangeren åbner linket i en browser. De kan se resultatgitteret (afgrænset til det tilladte rolleomfang) og mærke dispositioner, men kan ikke starte scanninger, ændre indstillinger, se loginoplysninger eller slette elementer.
Gennemgangeren åbner linket i en browser. De kan se det fulde resultatgitter og mærke dispositioner, men kan ikke starte scanninger, ændre indstillinger, se loginoplysninger eller slette elementer.
**Administrer eksisterende links**
Delingspanelet viser alle aktive links. Hver række viser betegnelse, rollemærkat (hvis afgrænset), udløbsdato og hvornår linket sidst blev brugt. Klik på **Kopiér** for at kopiere et link igen, eller **Tilbagekald** for at gøre det ugyldigt med det samme.
> **Tip:** I skoler og kommuner er det almindeligt at have separate DPO'er eller compliance-ansvarlige for henholdsvis ansatte og elever. Opret ét afgrænset link til hver — eleve-DPO'en vil kun se elevdata, og ansatte-DPO'en vil kun se ansattedata.
Delingspanelet viser alle aktive links. Hver række viser betegnelse, udløbsdato og hvornår linket sidst blev brugt. Klik på **Kopiér** for at kopiere et link igen, eller **Tilbagekald** for at gøre det ugyldigt med det samme.
### 10.2 Viewer-PIN
@ -460,7 +374,7 @@ Som alternativ til token-links kan du angive en numerisk PIN-kode (48 cifre)
For at angive eller ændre PIN-koden skal du indtaste den nye kode i feltet **Ny PIN** og klikke på **Gem PIN**. Klik på **Ryd PIN** for at fjerne den.
> **Sikkerhedsnote:** Token-links er mere sikre end en PIN-kode, fordi hvert link kan tilbagekaldes individuelt, har en udløbsdato og kan afgrænses til en bestemt rollegruppe. Brug PIN-indstillingen kun til betroede interne gennemgangere på dit lokale netværk, der har brug for adgang til alle resultater.
> **Sikkerhedsnote:** Token-links er mere sikre end en PIN-kode, fordi hvert link kan tilbagekaldes individuelt og har en udløbsdato. Brug PIN-indstillingen kun til betroede interne gennemgangere på dit lokale netværk.
### 10.3 Hvad gennemgangeren kan gøre
@ -477,7 +391,6 @@ For at angive eller ændre PIN-koden skal du indtaste den nye kode i feltet **Ny
| Slette elementer | Nej |
| Tilgå indstillinger | Nej |
| Oprette eller tilbagekalde viewer-links | Nej |
| Se elementer uden for deres rolleomfang | Nej |
---
@ -496,7 +409,6 @@ Gå til **Indstillinger → Planlægger** for at konfigurere automatiske scannin
7. Aktiver eventuelt:
- **Send rapport automatisk** — send Excel-rapporten pr. e-mail til dine konfigurerede modtagere efter hver scanning.
- **Håndhæv opbevaringspolitik** — slet automatisk elementer ældre end din opbevaringspolitik efter hver scanning.
- **Kun rapport** — spring scanningen over og send blot de seneste resultater fra databasen som e-mail. Nyttigt til regelmæssige opsummerings-e-mails uden at køre en ny scanning. Når aktiveret, kræves ingen profil, og M365-godkendelse er ikke nødvendig.
8. Klik på **Gem**.
Planlæggerikatoren i topbjælken viser dato og tidspunkt for den næste planlagte scanning ("Næste: …").
@ -528,17 +440,7 @@ Klik på **Gem** for at gemme, og klik derefter på **Test** for at sende en tes
> Hvis din konto har MFA (to-faktor-godkendelse) aktiveret, kan du ikke bruge din almindelige adgangskode. Du skal oprette en **app-adgangskode** i din kontos sikkerhedsindstillinger:
> - **Personlig Microsoft-konto**: account.microsoft.com/security → App-adgangskoder
> - **Gmail / Google Workspace**: myaccount.google.com → Sikkerhed → 2-trinsbekræftelse → App-adgangskoder (for Google Workspace-konti skal din administrator først tillade app-adgangskoder eller opsætte et SMTP-relay)
### Send altid via SMTP (spring Microsoft Graph over)
Når scanneren er logget på Microsoft 365, sender den normalt e-mail gennem Microsoft 365 direkte, uden at bruge SMTP-indstillingerne ovenfor. Det er praktisk, men det kan ikke levere til visse adresser — især en adresse på et Google-hostet underdomæne af dit Microsoft 365-domæne, som Microsoft 365 opfatter som intern og kasserer i stilhed (ingen levering, ingen fejl).
Slå **Send altid via SMTP (spring Microsoft Graph over)** til for at tvinge al e-mail — test-e-mails, manuelle rapporter og automatisk e-mail efter scanning — gennem den SMTP-server, du har konfigureret ovenfor. Brug dette, når dine rapporter sendes til en postkasse, som Microsoft 365 ikke kan levere til (f.eks. en Google Workspace-adresse), med `smtp.gmail.com` / `smtp-relay.gmail.com` som SMTP-vært.
### Send rapport efter manuel scanning
Slå **Send rapport efter manuel scanning** til for automatisk at sende rapporten pr. e-mail til dine konfigurerede modtagere, hver gang en manuel scanning er færdig.
> - **Gmail**: myaccount.google.com → Sikkerhed → 2-trinsbekræftelse → App-adgangskoder
### Send en rapport manuelt
@ -578,7 +480,6 @@ Klik på **Nulstil database** for at slette alle scanningsdata, dispositioner og
| Indstilling | Beskrivelse |
|-------------|-------------|
| Tema | Mørkt eller lyst |
| Softwareopdatering | Søg efter og installér nye versioner af scanneren direkte fra browseren, eller slå automatisk daglig opdatering til. Vises kun på serverinstallationer, der kører fra et git-checkout (ikke i skrivebordsappen). Programmet genstarter selv efter installation; opdatering afvises, mens en scanning kører, og næste scanning efter en opdatering fortsætter normalt. |
### Fanen Sikkerhed
@ -586,7 +487,6 @@ Klik på **Nulstil database** for at slette alle scanningsdata, dispositioner og
|-------------|-------------|
| Admin-PIN | Valgfri PIN-kode, der beskytter destruktive handlinger (nulstil database, erstat ved import) |
| Viewer-PIN | Valgfri 48-cifret PIN-kode, der giver alle adgang til `/view` i en browser som skrivebeskyttet gennemganger uden et token-link |
| Interface-PIN | Valgfri 48-cifret PIN-kode, der skal indtastes, inden man får adgang til selve scannerens brugerflade. Alle, der tilgår scanner-URL'en, omdirigeres til en loginside, indtil den korrekte kode er indtastet. Adgang via `/view` er ikke berørt. |
### Avancerede scanningsindstillinger
@ -596,31 +496,6 @@ Disse indstillinger findes i venstre panel under **Indstillinger**:
**Søg efter ansigter i billeder** — langsommere scanning, der registrerer fotografier med genkendelige menneskelige ansigter. Markerer dem som artikel 9 biometriske data. Anbefales til skoler, der opbevarer elevfotos.
**Ignorer GPS i billeder** — når aktiveret, flagges billeder ikke, hvis GPS-koordinater i billedets metadata er det eneste PII-signal. Nyttigt ved scanning af elevkonti: smartphones indlejrer automatisk GPS-koordinater i alle kamerabilleder, hvilket ellers ville generere mange lavprioriterede fund i en skolekontekst. Hvis et billede allerede er flagget af en anden årsag (ansigter, EXIF-forfatterfelter), vises GPS-koordinaterne stadig i detaljekortet.
**Min. CPR-antal pr. fil** — en fil flagges kun, hvis den indeholder mindst dette antal *distinkte* CPR-numre. Standardværdien er 1 (nuværende adfærd). Sæt til 2 for at undgå falske positive ved elevscanninger: en elevs samtykkeerklæring eller indmeldelsesformular indeholder typisk kun elevens eget CPR-nummer, mens en klasselist eller karakteroversigt med flere elevers CPR-numre stadig vil blive rapporteret.
**Kun CPR-tilstand** — når aktiveret, springes elementer uden CPR-numre over (kun e-mailadresser, telefonnumre, ansigter eller GPS/EXIF-data). Brug dette, når du ønsker en rapport, der udelukkende fokuserer på CPR-eksponering.
**OCR-sprog** — vælger den sprogpakke, Tesseract bruger, når der læses tekst fra scannede PDF-filer og billeder. Standard: `Dansk + Engelsk`. Skift til en anden forudindstilling for dokumenter på tysk, svensk eller fransk.
### Fanen AI / NER
Gå til **Indstillinger → AI / NER** for at konfigurere Claude AI-drevet navnegenkendelse.
Som standard bruger scanneren spaCy (en lokal maskinlæringsmodel) til at genkende personnavne, adresser og organisationsnavne i dokumenttekst. Aktivering af Claude NER erstatter dette med kald til Claude Haiku API, som er betydeligt mere nøjagtig — særligt for danske dobbeltefternavne (f.eks. "Hansen-Nielsen"), fremmedsprogede navne og navne uden omgivende kontekst (f.eks. isolerede celler i et regneark).
**Sådan aktiverer du:**
1. Opret en Anthropic API-nøgle på [console.anthropic.com](https://console.anthropic.com).
2. Indsæt nøglen i feltet **Anthropic API-nøgle** og klik på **Gem**.
3. Slå **Aktiver Claude NER**-kontakten til og klik på **Gem** igen.
4. Klik på **Test nøgle** for at bekræfte, at nøglen er gyldig og API'et er tilgængeligt.
**Pris:** Claude Haiku faktureres pr. token efter Anthropics offentliggjorte priser. Et typisk dokument koster en brøkdel af en øre. Scanningsresultater caches pr. dokument, så genskanning af den samme fil aldrig medfører en ny opkrævning.
**Fallback:** Hvis `anthropic`-pakken ikke er installeret, eller API-nøglen mangler, falder scanneren automatisk tilbage til spaCy uden fejl — kontakten har blot ingen effekt.
**Opbevaringspolitik** — når aktiveret, markeres elementer ældre end det angivne antal år som forældet. Regnskabsårets afslutning bestemmer, hvordan skæringsdatoen beregnes:
| Indstilling | Beregning af skæringsdato |
@ -629,12 +504,6 @@ Som standard bruger scanneren spaCy (en lokal maskinlæringsmodel) til at genken
| 31 dec (Bogføringsloven) | Seneste 31. december minus N år |
| 30 jun / 31 mar | Seneste forekomst af den dato minus N år |
### Fanen Revisionslog
Gå til **Indstillinger → Revisionslog** for at se en uforanderlig log over alle væsentlige administrative handlinger i scanneren. Hver post viser tidspunkt, handlingstype, detaljer og klientens IP-adresse. Registrerede hændelser omfatter: gem/slet profil, opret/tilbagekald viewer-token, PIN-ændringer, tilføj/opdater/slet filkilde, gem/slet planlagt job, start/stop scanning, gem SMTP-konfiguration, dispositionsændringer, slet element og redigér element.
Loggen er skrivebeskyttet og gemmes i scannerdatabasen sammen med scanningsresultaterne. Den er inkluderet i databaseeksporter og kan hjælpe dig med at dokumentere ansvarlighed over for en tilsynsmyndighed.
---
## 15. Ofte stillede spørgsmål
@ -646,10 +515,10 @@ Nej. CPR-numre fundet under en scanning gemmes kun som et antal (f.eks. "3 CPR-n
E-mails flyttes til brugerens **Slettet post**-mappe i Exchange — de slettes ikke permanent og kan gendannes af brugeren eller en administrator. Filer flyttes til **papirkurven** i den pågældende tjeneste (OneDrive, SharePoint, filsystem). Permanent sletning kræver en efterfølgende handling af brugeren eller administrator.
**Kan jeg scanne uden at forbinde til Microsoft 365?**
Ja. Du kan scanne lokale mapper, SMB/NAS-drev og SFTP-servere uden nogen M365- eller Google-forbindelse. Åbn **Kilder**, gå til fanen **Filkilder**, og tilføj dine filstier eller SFTP-serveroplysninger.
Ja. Du kan scanne lokale og SMB-filshares uden nogen M365- eller Google-forbindelse. Åbn **Kilder**, gå til fanen **Filkilder**, og tilføj dine filstier.
**Hvad er delta-scanning, og hvornår skal jeg bruge det?**
Delta-scanning bruger Microsoft Graphs ændringstokens (for M365) og Google Drive Changes API (for Google Workspace) til kun at hente elementer ændret siden den seneste scanning. Det er ideelt til regelmæssige (f.eks. ugentlige) compliance-tjek efter, at du har gennemført en fuld basisscan. Aktiver det i afsnittet Indstillinger i venstre panel.
Delta-scanning bruger Microsoft Graphs ændringstokens til kun at hente elementer ændret siden den seneste scanning. Det er ideelt til regelmæssige (f.eks. ugentlige) compliance-tjek efter, at du har gennemført en fuld basisscan. Aktiver det i afsnittet Indstillinger i venstre panel.
**Scanningen stoppede — kan jeg fortsætte, hvor den slap?**
Ja. Når du starter scanningen igen, vil et gult banner tilbyde at genoptage fra kontrolpunktet. Klik på **▶ Genoptag** for at fortsætte. Hvis du foretrækker at starte forfra, klikker du på **Start forfra**.
@ -666,21 +535,9 @@ I kontoafsnittet i venstre panel er der et felt **+ Tilføj konto manuelt**. Ind
**Kører scanneren? Jeg kan ikke se en statuslinje.**
Tjek aktivitetsloggen nederst på skærmen. Hvis en scanning kører, vises der beskeder her. Hvis du ikke ser noget, er scanningen muligvis afsluttet eller ikke startet. Kontrollér også, at du har valgt mindst én kilde og mindst én konto.
**Kan jeg beskytte scanneren med adgangskode, så elever eller kolleger ikke kan tilgå den på netværket?**
Ja. Gå til **Indstillinger → Sikkerhed → Interface-PIN** og angiv en 48-cifret PIN-kode. Fra da af vises alle, der åbner scanner-URL'en i en browser, en loginside og kan ikke komme videre uden den korrekte kode. Interface-PIN er adskilt fra Admin-PIN (der beskytter destruktive handlinger) og Viewer-PIN (der beskytter skrivebeskyttet adgang). Eksisterende viewer-token-links fungerer fortsat uden interface-PIN.
**Kan en gennemganger mærke dispositioner uden adgang til scanningskontrollerne?**
Ja. Brug **🔗 Del**-knappen til at oprette et skrivebeskyttet viewer-link eller angiv en Viewer-PIN under Indstillinger → Sikkerhed. Gennemgangeren åbner linket i sin browser og kan gennemse resultater og mærke dispositioner uden at se loginoplysninger, kilder eller scanningsknapper. Se afsnit 10 for detaljer.
**Kan jeg begrænse et delelink til en bestemt tidsperiode?**
Ja. Brug felterne "Elementer fra" og "Elementer til" i delingspanelet, når du opretter et token-link. Modtageren vil kun se elementer, hvis ændringsdate falder inden for det angivne interval.
**Hvor kan jeg se, hvem der har ændret hvad i scanneren?**
Gå til **Indstillinger → Revisionslog**. Alle væsentlige administrative handlinger logges med tidsstempel, handlingstype, detaljer og IP-adresse.
**Vil aktivering af Claude NER øge omkostningerne væsentligt?**
For en typisk skole- eller kommunescanning er omkostningen ubetydelig — Claude Haiku faktureres i brøkdele af en øre pr. dokument, og resultater caches, så det samme dokument aldrig faktureres to gange. En fuld scanning af 10.000 dokumenter koster typisk under 7 kr. Den største gevinst er i navnetætte dokumenter (klasselister, sagsmapper), hvor spaCy tidligere gik glip af mange navne.
---
*GDPR Scanner v1.7.9 — teknisk opsætning og konfiguration: se README.md*
*GDPR Scanner v1.6.14 — teknisk opsætning og konfiguration: se README.md*

View File

@ -1,6 +1,6 @@
# GDPR Scanner — User Manual
Version 1.7.9
Version 1.6.14
---
@ -33,7 +33,7 @@ When items are found, you can review them, decide what to do with each one (keep
**What it scans:**
- Microsoft 365: Exchange email, OneDrive, SharePoint, Teams
- Google Workspace: Gmail, Google Drive
- Local and network file shares (including SMB/NAS drives and SFTP servers)
- Local and network file shares (including SMB/NAS drives)
**What it finds:**
- CPR numbers (Danish civil registration numbers)
@ -50,16 +50,16 @@ When items are found, you can review them, decide what to do with each one (keep
When you open the scanner, the screen is divided into three areas:
```
┌─────────────────┬──────────────────────────────────────────
┌─────────────────┬──────────────────────────────────────────┐
│ │ Top bar: Scan button, profiles, actions │
│ Left sidebar ├──────────────────────────────────────────
│ Left sidebar ├──────────────────────────────────────────┤
│ │ │
│ - Sources │ Results / scan progress │
│ - Options │ │
│ - Accounts │ │
│ - Stats ├──────────────────────────────────────────
│ - Stats ├──────────────────────────────────────────┤
│ │ Activity log │
└─────────────────┴──────────────────────────────────────────
└─────────────────┴──────────────────────────────────────────┘
```
**Left sidebar** — choose what to scan and how.
@ -104,33 +104,17 @@ The Google Workspace tab lets you connect a Google Workspace (formerly G Suite)
| Gmail | All emails in each user's inbox and labels |
| Google Drive | All files owned by or shared with each user |
### 3.3 Local, Network, and SFTP File Sources
### 3.3 Local and Network File Shares
The **Filkilder** (File Sources) tab lists any local folders, network drives, or SFTP servers you have configured.
The **Filkilder** (File Sources) tab lists any local folders or network drives you have configured.
**To add a new file source:**
1. Enter a **Label** — a friendly name you will recognise (e.g. "Skolens Fællesmappe").
2. Select the **source type** using the pill selector at the top of the form:
**Local**
- Enter the **Path** to the folder: `~/Documents` or `/Volumes/Share`.
- Click **Tilføj** (Add).
**Network (SMB)**
- Enter the **Path** in UNC format: `//nas-server/shared` or `\\server\share`.
- Fill in the **SMB Host**, **Username**, and **Password** that appear. The password is stored securely in your system keychain.
- Click **Tilføj** (Add).
**SFTP**
- Enter the **Host** (hostname or IP address of the SSH/SFTP server).
- Enter the **Port** (default 22).
- Enter the **Username**.
- Enter the **Remote path** to scan (e.g. `/home/shared` or `/`).
- Choose the **Authentication type**:
- **Password** — enter the password. It is stored securely in your system keychain.
- **Private key** — click **Upload key file** and select your SSH private key (OpenSSH or PEM format). If the key is passphrase-protected, enter the passphrase. The key file is stored in the scanner's data directory with `600` permissions.
- Click **Tilføj** (Add).
2. Enter the **Path**:
- Local folder: `~/Documents` or `/Volumes/Share`
- Network share: `//nas-server/shared` or `\\server\share`
3. If it is a network share, fill in the **SMB Host**, **Username**, and **Password** that appear automatically. The password is stored securely in your system keychain.
4. Click **Tilføj** (Add).
You can add as many file sources as you need. Each one will appear as a selectable source in the main sidebar when you are ready to scan.
@ -170,10 +154,6 @@ Only scan items modified after a certain date. Quick presets — **1 år**, **2
**Max emails per user** — stop after scanning this many emails per person (default 2,000). Increase if you need complete coverage.
**CPR-only mode** — when enabled, only items containing at least one qualifying CPR number are flagged. Items whose only hits are email addresses, phone numbers, detected faces, or EXIF/GPS metadata are skipped. Useful when you want a focused CPR-only report without noise from other data types.
**OCR language** — choose the language pack(s) Tesseract uses when reading text from scanned PDFs and images. The default `Danish + English` covers the vast majority of documents. Switch to a different preset if your documents are predominantly in another language.
### 4.4 Start the Scan
Click the blue **Scan** button in the top bar.
@ -200,8 +180,6 @@ Click **▶ Genoptag** to continue from where the scan left off. Click **Start f
## 5. Understanding the Results
When you open the app, the grid shows **all open items** — every flagged item that still needs action (i.e. has no disposition), across all of your scans, not just the most recent one. As you tag items (kept, redacted, deleted, false positive, …) they drop out of this view, so what remains is your outstanding work. Each item appears once, showing its most recent state. To look at a single past scan instead, use the session picker (see *Browsing past scan sessions* below).
Each flagged item appears as a card. Here is what the badges and labels mean:
### Source badges
@ -214,8 +192,7 @@ Each flagged item appears as a card. Here is what the badges and labels mean:
| Teams | Found in a Teams channel |
| Gmail | Found in a Gmail mailbox |
| Google Drive | Found in Google Drive |
| Local / Network | Found on a local or SMB file share |
| 🔒 SFTP | Found on an SFTP server |
| Local / Network | Found on a file share |
### Risk level
@ -249,19 +226,6 @@ Use the filter bar above the results to narrow down what you see:
- **Disposition dropdown** — show items by their review status.
- **Transfer dropdown** — filter by shared / external / all.
- **Risk dropdown** — show only Art. 9, photos, GPS, or high-risk items.
- **Role dropdown** — show only **Ansatte** (staff) or **Elever** (students). Also scopes exports: clicking **Excel** or **Art.30** while a role is selected produces a report containing only that group, with `_elever` or `_ansatte` appended to the filename.
### Browsing past scan sessions
Once a scan has completed, you can review results from any earlier scan session without running a new scan.
- Click the **Sessions** button in the history banner (which appears above the results grid after a scan completes) to open the session picker.
- Each row shows the date and time, which sources were scanned, and how many items were flagged. A **Δ** badge marks delta scans; **Latest** marks the most recent session.
- Click any row to load that session's results into the grid. A history banner replaces the progress bar, showing the session details.
- Click **Open items** in the banner to leave the past session and return to the default view of all items still needing action.
- Starting a new scan automatically exits history mode and switches back to live results.
All filters, exports, and disposition tagging work normally while browsing past sessions.
---
@ -276,7 +240,6 @@ The preview shows:
- All CPR numbers found and their context
- Other personal data detected (phone, email address, IBAN, etc.)
- Sharing and external-access information
- **Related documents** — if other items in the same scan session share one or more CPR numbers with this item, a "Related documents" section lists them. Click any row to open that item's preview. This helps you track the same person's data across multiple files or emails.
### Setting a disposition
@ -292,47 +255,7 @@ Every item has a **Disposition** dropdown in the preview panel. Choose one of:
| Privat brug — uden for scope | Personal item, not in scope for GDPR processing |
| Slettet | Already deleted (set automatically when you delete an item) |
After choosing, click **Save**. A small **✓ Saved** confirmation appears.
### Redacting a file in-place
A **✂** button appears on result cards where the scanner can overwrite the file directly. Clicking it replaces all CPR numbers with `██████-████` blocks and logs the action as a `"redacted"` disposition. The card is **kept in the grid until your next scan** — it is greyed out, shows a green **✏ Redacted** badge, and its action buttons are hidden so it cannot be processed again. This lets you see at a glance what you handled during the session; the grid is rebuilt the next time you scan. This is useful when you want to sanitise a file rather than delete it entirely.
The button is available for the following source types and formats:
| Source | Supported formats |
|---|---|
| Local files | DOCX, XLSX, CSV, TXT, PDF |
| Network share (SMB) | DOCX, XLSX, CSV, TXT, PDF |
| SFTP | DOCX, XLSX, CSV, TXT, PDF |
| OneDrive / SharePoint / Teams | DOCX, XLSX, PDF |
| Google Drive | DOCX, XLSX, PDF |
The button is **not** available for email items (Exchange/Gmail) or viewer mode. Google Docs and Sheets that were exported as DOCX/XLSX during scanning cannot be redacted in-place — export the file from Google manually first, then redact the downloaded copy.
> **PDF security note:** PDF redaction uses physical removal — the CPR number text is erased from the PDF data stream, not just painted over with a black box. A reader cannot recover the original text by selecting under the redaction or inspecting the file programmatically. Image-based (scanned) PDFs are also supported: the scanner locates the CPR number on the page image via OCR and physically overwrites that region.
> **OneDrive / SharePoint / Teams note:** Redaction writes the modified file back via the Microsoft Graph API and requires the `Files.ReadWrite.All` permission. The scanner now requests this permission automatically during sign-in. If you authenticated before this update, sign out and sign back in (Settings → Microsoft 365 → Sign out) so the scanner obtains a new token with write access. For app-only (service principal) setups, a Global Admin must grant the `Files.ReadWrite.All` application permission in Azure → App registrations → API permissions → Grant admin consent.
> **Google Drive note:** Drive redaction requires the `drive` scope on the service account's domain-wide delegation grant (not just `drive.readonly`). If redaction fails with a permission error, ask your Google Workspace admin to add the `https://www.googleapis.com/auth/drive` scope to the service account delegation in the Admin Console.
> **SFTP note:** SFTP redaction is only available for items found in the current scan session. If you are browsing historical results, re-run the scan first.
### Bulk tagging multiple items at once
If you need to apply the same disposition to many items, use **Select mode** instead of opening each card individually.
1. Click **Vælg** (Select) in the filter bar. Per-card checkboxes appear on every result card.
2. Tick the items you want to tag, or click **Select all visible** in the bulk tag bar at the bottom of the screen to select everything matching the current filters.
3. Choose a disposition from the dropdown in the bulk tag bar.
4. Click **Apply**. All selected items are updated immediately.
5. Click **Done** (or the same **Vælg** button again) to leave select mode.
> **Tip:** Use the filter bar to narrow down to, for example, all unreviewed student items before clicking **Select all visible** — this lets you tag an entire category in two clicks.
### Disposition stats bar
A thin stats bar sits above the results grid showing: **Total · Unreviewed · Retain · Delete** counts and a **% reviewed** figure. It updates automatically after every disposition save, giving you a live overview of how far through the review you are.
After choosing, click **Gem**. A small **✓ Gemt** confirmation appears.
### Finding all items for a specific person
@ -364,8 +287,6 @@ Click the **Delete** button in the filter bar to open the bulk delete modal.
4. A progress bar shows deletions as they happen. Emails go to **Deleted Items**; files go to the **recycle bin**.
Deleted items (whether from a single delete, a bulk delete, or a data-subject erasure) are **kept in the grid until your next scan** — greyed out with a red **🗑 Deleted** badge and their action buttons hidden — so you can see what was removed during the session. When a bulk delete partially fails, only the items the server actually deleted are marked; any that failed stay active so you can retry them. The grid is rebuilt the next time you scan.
A full audit log of every deletion (what was deleted, when, and why) is included in the Article 30 report.
---
@ -402,7 +323,7 @@ Click **Profiles** to open the profile management panel. Here you can:
Click **Excel** in the filter bar to download the current results as an Excel workbook. The workbook contains:
- A summary tab with scan date, item counts, and source breakdown.
- A separate tab for each source type (Outlook, OneDrive, SharePoint, Teams, Gmail, Google Drive, Local, Network, SFTP).
- A separate tab for each source type (Outlook, OneDrive, SharePoint, Teams, Gmail, Google Drive, Local, Network).
- Every flagged item, including source, account, CPR count, risk level, sharing status, and disposition.
The **Excel** and **Art.30** buttons are always available — even after restarting the application — and will export the results from the most recent completed scan session without requiring a new scan.
@ -437,22 +358,15 @@ You can give a DPO, school principal, or compliance coordinator read-only access
Click the **🔗** button in the top-right of the top bar to open the Share panel.
1. Optionally enter a **Label** to identify who the link is for (e.g. "DPO review April 2026").
2. Choose a **Scope**:
- **All roles** — the recipient sees all flagged items.
- **Ansatte** / **Elever** — the recipient sees only items belonging to that role group. The role filter is locked in their view.
- **User** — the recipient sees only the items belonging to a specific employee. Select the person from the search box; the scanner matches both their M365 and Google Workspace email addresses automatically. Use this when you want to give an individual employee access to their own scan results.
3. Optionally set a **Date range** — use the "Items from" and "Items until" date fields to limit the recipient to items modified within a specific period. This lets you, for example, create a link covering only last year's scan results. Leave both fields blank for no date restriction.
4. Choose an **Expiry** — 7 days, 30 days, 90 days, 1 year, or Never.
5. Click **Create**. The form clears and the new link appears at the top of the **Active links** list below, briefly highlighted.
6. Click **Copy** on that link's row to copy it to your clipboard, then send it to the reviewer.
2. Choose an **Expiry** — 7 days, 30 days, 90 days, 1 year, or Never.
3. Click **Create**. A unique link is generated: `http://host:5100/view?token=…`
4. Click **Copy** to copy the link to your clipboard, then send it to the reviewer.
The reviewer opens the link in any browser. They see the results grid (filtered to their permitted scope) and can tag dispositions but cannot start scans, change settings, view credentials, or delete items.
The reviewer opens the link in any browser. They see the full results grid and can tag dispositions but cannot start scans, change settings, view credentials, or delete items.
**Managing existing links**
The Share panel lists all active links. Each row shows the label, role badge (if scoped), expiry date, and when the link was last used. Click **Copy** to copy a link again, or **Revoke** to invalidate it immediately.
> **Tip:** In schools and municipalities it is common to have separate DPOs or compliance officers for staff data and student data. Create one scoped link for each — the student DPO will only ever see student items, and the staff DPO will only see staff items.
The Share panel lists all active links. Each row shows the label, expiry date, and when the link was last used. Click **Copy** to copy a link again, or **Revoke** to invalidate it immediately.
### 10.2 Viewer PIN
@ -460,7 +374,7 @@ As an alternative to token links, you can set a numeric PIN (48 digits) in **
To set or change the PIN, enter the new PIN in the **New PIN** field and click **Save PIN**. To remove it, click **Clear PIN**.
> **Security note:** Token links are more secure than a PIN because each link can be individually revoked, has an expiry date, and can be role-scoped. Use the PIN option only for trusted internal reviewers on your local network who need access to all results.
> **Security note:** Token links are more secure than a PIN because each link can be individually revoked and has an expiry date. Use the PIN option only for trusted internal reviewers on your local network.
### 10.3 What the reviewer can do
@ -477,7 +391,6 @@ To set or change the PIN, enter the new PIN in the **New PIN** field and click *
| Delete items | No |
| Access Settings | No |
| Create or revoke viewer links | No |
| See items outside their role scope | No |
---
@ -496,7 +409,6 @@ Go to **Settings → Planlægger** to configure automatic scans.
7. Optionally enable:
- **Send rapport automatisk** — email the Excel report to your configured recipients after each scan.
- **Håndhæv opbevaringspolitik** — automatically delete items older than your retention policy after each scan.
- **Report only** — skip the scan entirely and just email the latest results already in the database. Useful for sending a regular summary email without running a new scan. When enabled, no profile is needed and M365 authentication is not required.
8. Click **Gem** (Save).
The scheduler indicator in the top bar shows the date and time of the next scheduled scan ("Next: …").
@ -528,17 +440,7 @@ Click **Gem** to save, then click **Test** to send a test email and verify the c
> If your account has MFA (two-factor authentication) enabled, you cannot use your regular password. You need to create an **App Password** in your account security settings:
> - **Microsoft personal account**: account.microsoft.com/security → App passwords
> - **Gmail / Google Workspace**: myaccount.google.com → Security → 2-Step Verification → App passwords (for Google Workspace accounts your administrator must first allow App Passwords, or set up an SMTP relay)
### Always send via SMTP (skip Microsoft Graph)
When the scanner is signed in to Microsoft 365, it normally sends email through Microsoft 365 directly, without using the SMTP settings above. This is convenient, but it cannot deliver to some addresses — most notably an address on a Google-hosted subdomain of your Microsoft 365 domain, which Microsoft 365 treats as internal and silently discards (no delivery, no error).
Turn on **Send altid via SMTP (spring Microsoft Graph over)** to force all email — test emails, manual reports, and the after-scan auto-email — through the SMTP server you configured above. Use this when your reports go to a mailbox Microsoft 365 won't deliver to (for example a Google Workspace address), with `smtp.gmail.com` / `smtp-relay.gmail.com` as the SMTP host.
### Email report after manual scan
Turn on **Send rapport efter manuel scanning** to automatically email the report to your configured recipients every time a manual scan finishes.
> - **Gmail**: myaccount.google.com → Security → 2-Step Verification → App passwords
### Sending a report manually
@ -578,7 +480,6 @@ Click **Reset DB** to wipe all scan data, dispositions, and deletion log. This i
| Setting | Description |
|---------|-------------|
| Theme | Dark or light mode |
| Software update | Check for and install new versions of the scanner directly from the browser, or enable automatic daily updates. Only shown on server installations running from a git checkout (not in the desktop app). The app restarts itself after installing; updating is refused while a scan is running, and the next scan after an update continues normally. |
### Security tab
@ -586,7 +487,6 @@ Click **Reset DB** to wipe all scan data, dispositions, and deletion log. This i
|---------|-------------|
| Admin PIN | Optional PIN that protects destructive actions (database reset, replace import) |
| Viewer PIN | Optional 48 digit PIN that lets anyone open `/view` in a browser for read-only access to results without a token link |
| Interface PIN | Optional 48 digit PIN that must be entered before accessing the main scanner interface. Anyone reaching the scanner URL is redirected to a login page until the correct PIN is entered. Viewer access via `/view` is not affected. |
### Advanced scan options
@ -596,31 +496,6 @@ These options are in the left sidebar under **Indstillinger**:
**Scan photos for faces** — slower scan that detects photographs containing recognisable human faces. Flags them as Article 9 biometric data. Recommended for schools storing student photos.
**Ignore GPS in images** — when enabled, images whose only PII signal is an embedded GPS location are not flagged. Useful when scanning student accounts: smartphones embed GPS coordinates in every photo taken with the camera app, which would otherwise generate large numbers of flags that are low-priority for a school context. If an image is already flagged for another reason (faces, EXIF author field), the GPS coordinate is still shown in the detail card.
**Min. CPR count per file** — only flag a file if it contains at least this many *distinct* CPR numbers. The default is 1 (current behaviour). Setting it to 2 avoids false positives in student scans: a student's own consent form or registration document typically contains only their own CPR number, while a class list or grade sheet containing multiple students' CPRs will still be reported.
**CPR-only mode** — when enabled, items with no CPR numbers (only email addresses, phone numbers, faces, or GPS/EXIF data) are skipped entirely. Use this when you want a lean report focused exclusively on CPR exposure.
**OCR language** — selects the Tesseract language pack(s) used when reading scanned PDFs and images. Default: `Danish + English`. Change to a different preset if your documents are in another language (German, Swedish, French presets are available).
### AI / NER tab
Go to **Settings → AI / NER** to configure Claude AI-powered Named Entity Recognition.
By default the scanner uses spaCy (a local machine-learning model) to detect person names, addresses, and organisation names in document text. Enabling Claude NER replaces this with calls to the Claude Haiku API, which is significantly more accurate — especially for Danish hyphenated surnames (e.g. "Hansen-Nielsen"), foreign-origin names, and names that appear without surrounding context (such as isolated cells in a spreadsheet).
**To enable:**
1. Obtain an Anthropic API key from [console.anthropic.com](https://console.anthropic.com).
2. Paste the key into the **Anthropic API key** field and click **Save**.
3. Turn on the **Enable Claude NER** toggle and click **Save** again.
4. Click **Test key** to confirm the key is valid and the API is reachable.
**Cost:** Claude Haiku is charged per token at Anthropic's published rates. A typical document costs less than a fraction of a cent. Scan results are cached per document, so re-scanning the same file never incurs a second charge.
**Fallback:** If the `anthropic` package is not installed or the API key is missing, the scanner automatically falls back to spaCy with no error — the toggle simply has no effect.
**Retention policy** — when enabled, marks items older than the specified number of years as overdue. The fiscal year end setting determines how the cutoff date is calculated:
| Option | Cutoff date calculation |
@ -629,12 +504,6 @@ By default the scanner uses spaCy (a local machine-learning model) to detect per
| 31 dec (Bogføringsloven) | Last 31 December minus N years |
| 30 jun / 31 mar | Last occurrence of that date minus N years |
### Audit Log tab
Go to **Settings → Audit Log** to view an immutable log of all significant admin actions performed in the scanner. Each entry shows the time, action type, detail, and client IP address. Recorded events include: profile save/delete, viewer token create/revoke, PIN changes, file source add/update/delete, scheduler job save/delete, scan start/stop, SMTP config save, dispositions, item delete, and item redact.
The log is read-only and is stored in the scanner database alongside scan results. It is included in database exports and can help you demonstrate accountability to a supervisory authority.
---
## 15. Frequently Asked Questions
@ -646,10 +515,10 @@ No. CPR numbers found during a scan are stored only as a count (e.g. "3 CPR numb
Emails are moved to the user's **Deleted Items** folder in Exchange — they are not permanently deleted and can be recovered by the user or an administrator. Files are moved to the **recycle bin** of the relevant service (OneDrive, SharePoint, file system). A permanent deletion requires a second action by the user or admin.
**Can I scan without connecting to Microsoft 365?**
Yes. You can scan local folders, SMB/NAS drives, and SFTP servers without any M365 or Google connection. Open **Sources**, go to the **Filkilder** tab, and add your file paths or SFTP server details.
Yes. You can scan local and SMB file shares without any M365 or Google connection. Open **Sources**, go to the **Filkilder** tab, and add your file paths.
**What is delta scanning and when should I use it?**
Delta scanning uses Microsoft Graph change tokens (for M365) and the Google Drive Changes API (for Google Workspace) to fetch only items modified since the last scan. It is ideal for regular (e.g. weekly) compliance checks after you have done a full baseline scan. Enable it in the Options section of the sidebar.
Delta scanning uses Microsoft Graph change tokens to fetch only items modified since the last scan. It is ideal for regular (e.g. weekly) compliance checks after you have done a full baseline scan. Enable it in the Options section of the sidebar.
**The scan stopped — can I continue where it left off?**
Yes. When you restart the scan, a yellow banner will offer to resume from the checkpoint. Click **▶ Genoptag** to continue. If you prefer to start over, click **Start fresh**.
@ -666,21 +535,9 @@ In the accounts section of the sidebar, there is an **+ Tilføj konto manuelt**
**Is the scanner running? I cannot see a progress bar.**
Check the activity log at the bottom of the screen. If a scan is running it will show messages there. If you see nothing, the scan may have completed or not started. Also check that you have at least one source ticked and at least one account selected.
**Can I password-protect the scanner so students or colleagues cannot access it on the network?**
Yes. Go to **Settings → Security → Interface PIN** and set a 48 digit PIN. From that point on, anyone who opens the scanner URL in a browser is shown a PIN entry page and cannot proceed without the correct code. This is separate from the Admin PIN (which protects destructive actions) and the Viewer PIN (which protects read-only access). Existing viewer token links still work without the interface PIN.
**Can a reviewer tag dispositions without access to the scan controls?**
Yes. Use the **🔗 Share** button to create a read-only viewer link or set a Viewer PIN in Settings → Security. The reviewer opens the link in their browser and can browse results and tag dispositions without seeing credentials, sources, or scan buttons. See section 10 for details.
**Can I limit a reviewer's link to a specific time period?**
Yes. When creating a token link, use the "Items from" and "Items until" date fields to restrict the link to items modified within that range. The reviewer will only see items whose modification date falls within the window you specified.
**Where can I see who changed what in the scanner?**
Go to **Settings → Audit Log**. Every significant admin action is recorded there with a timestamp, action type, detail, and IP address.
**Will enabling Claude NER increase costs significantly?**
For a typical school or municipality scan the cost is negligible — Claude Haiku charges fractions of a cent per document, and results are cached so the same file is never billed twice. A full scan of 10 000 documents typically costs under $1. The biggest gain is on name-dense documents (class lists, case files) where spaCy previously missed many names.
---
*GDPR Scanner v1.7.9 — for technical setup and configuration see README.md*
*GDPR Scanner v1.6.14 — for technical setup and configuration see README.md*

View File

@ -1,148 +0,0 @@
# HTTPS via Zoraxy Reverse Proxy
Step-by-step guide for putting GDPRScanner behind [Zoraxy](https://github.com/tobychui/zoraxy) with a Let's Encrypt certificate, on a LAN-only deployment.
Why bother on an internal network:
- **Encryption in transit** — the scanner streams CPR numbers, document previews, and share links. Serving that over plain HTTP to DPO reviewers is itself a compliance finding.
- **Secure context** — the browser Clipboard API (share-link Copy buttons) only exists on HTTPS or localhost. Over plain HTTP the app falls back to a legacy copy mechanism.
- **A real hostname**`https://gdprscanner.example.dk` instead of `http://10.x.x.x:5100` in share links, bookmarks, and emails.
This guide assumes Zoraxy runs **on the same host** as the scanner. If it runs elsewhere, replace `127.0.0.1:5100` with the scanner host's LAN IP and firewall port 5100 to the Zoraxy host only.
---
## 1. DNS record
Create an A-record for the hostname pointing at the server's **LAN IP**:
```
gdprscanner.example.dk A 10.x.x.x
```
A public DNS record pointing at a private IP is fine — outsiders can resolve the name but cannot route to the address, which is exactly the "LAN-only" goal.
> **Consequence:** because the server is not reachable from the internet, Let's Encrypt's default HTTP-01 challenge cannot work. The certificate **must** be issued via the **DNS-01 challenge** (step 4). If you prefer not to publish the internal IP at all, use an internal/split-horizon DNS record instead — DNS-01 still works since it validates against the public DNS zone, not the server.
---
## 2. Install Zoraxy
```bash
mkdir -p /opt/zoraxy && cd /opt/zoraxy
wget -O zoraxy https://github.com/tobychui/zoraxy/releases/latest/download/zoraxy_linux_amd64
chmod +x zoraxy
```
`/etc/systemd/system/zoraxy.service`:
```ini
[Unit]
Description=Zoraxy reverse proxy
After=network.target
[Service]
WorkingDirectory=/opt/zoraxy
ExecStart=/opt/zoraxy/zoraxy
Restart=always
[Install]
WantedBy=multi-user.target
```
```bash
systemctl daemon-reload && systemctl enable --now zoraxy
```
Open the management UI at `http://<server-ip>:8000` and create the admin account.
> Menu names below may differ slightly between Zoraxy versions — the concepts to look for are: ACME certificate with DNS challenge, host-based proxy rule, TLS on the incoming port.
---
## 3. Incoming port and TLS
In Zoraxy's global settings:
- Set the incoming proxy port to **443** and enable **TLS**.
- Enable **force-redirect port 80 → 443** so plain-HTTP visits upgrade automatically.
---
## 4. Certificate via ACME (DNS-01)
In **TLS / SSL Certificates → ACME**:
1. Enter the hostname (`gdprscanner.example.dk`).
2. Enable the **DNS challenge** and select the DNS provider that hosts your zone (Cloudflare, Simply.com, etc.).
3. Paste the provider's **API token/credentials** — created in the DNS provider's control panel.
4. Request the certificate. Zoraxy renews it automatically.
If your DNS host has no API, Zoraxy can generate a **self-signed certificate** as a fallback — it works, but every client machine must trust it manually. Getting a DNS API token is the better one-time investment.
---
## 5. Proxy rule
**HTTP Proxy → New Proxy Rule**:
| Field | Value |
|---|---|
| Matching hostname | `gdprscanner.example.dk` |
| Target | `127.0.0.1:5100` |
| TLS to target | Off (the scanner speaks plain HTTP locally) |
---
## 6. Close the side doors
**Bind the scanner to loopback** so only Zoraxy can reach Flask. Wherever the scanner is started (systemd unit or `start_gdpr.sh`), add:
```bash
--host 127.0.0.1
```
After a restart, `http://<server-ip>:5100` stops responding by design. The in-app self-update restart preserves the argument.
Optional hardening:
- Add a Zoraxy **Access Rule** whitelisting your LAN CIDR (e.g. `10.0.0.0/8`) on the proxy rule.
- Firewall the Zoraxy **management port 8000** to admin machines only.
---
## 7. Firewall / perimeter checklist
The Zoraxy whitelist (step 6) is an **application-layer** control — a rejected request has still completed the TCP and TLS handshake against your box, and any proxy host you forget to tag is fully exposed. The firewall is the real perimeter. Work this checklist whenever you stand up or replace the edge firewall:
- [ ] **No inbound port-forward unless a service is intentionally public.** A LAN-only deployment needs *zero* inbound forwards — DNS-01 (step 4) is outbound-only, so certificates issue and renew with the firewall fully closed.
- [ ] **If any service is intentionally public** (e.g. a media server), forward **443 only to the Zoraxy host** — never to individual app hosts. Everything then enters through Zoraxy, where the per-host Access Rule decides public vs. private.
- [ ] **The per-host whitelist stays your public/private boundary even with the firewall in place** — it is not made redundant by the firewall. Public hosts use the `default` rule; every internal-only host gets **Local Access Only**.
- [ ] **New proxy hosts default to public.** Zoraxy applies the `default` rule to any host with no rule set, so a freshly-added internal service is reachable the moment it exists. Set its Access Rule to **Local Access Only** *at creation time*.
- [ ] **Management ports are LAN-only.** Zoraxy admin (`:8000`) and any app admin UI must never be forwarded; tag them **Local Access Only** as well.
- [ ] **Verify from off-network.** From a connection outside the LAN (e.g. a phone on mobile data), confirm private hostnames are blocked and only the intentionally-public ones respond:
```bash
curl -v https://gdprscanner.example.dk # should fail/refuse from outside
nmap -Pn -p 80,443,5100 <your-public-IP> # only intentionally-open ports listed
```
---
## 8. Verify the scanner-specific behaviour
1. `https://gdprscanner.example.dk` loads with a valid padlock; `http://` redirects.
2. **Run a scan and watch result cards stream in live** — that is the Server-Sent Events connection (`/api/scan/stream`) passing through the proxy. If progress stalls while the scan log advances, look at proxy buffering/timeout settings.
3. Create a **share link** — it must start with `https://gdprscanner.example.dk/view?token=…`. The app uses the page origin automatically on HTTPS (the LAN-IP rewrite only applies when browsing at localhost). The Copy buttons now use the native Clipboard API.
4. **Settings → General → Software update → Check for updates** still works (outbound git fetch is unaffected by the proxy).
---
## Troubleshooting
| Symptom | Cause / fix |
|---|---|
| Certificate request fails | HTTP-01 attempted against an unreachable host — make sure the **DNS challenge** is selected and the API credentials are for the zone's actual DNS host |
| Cards don't stream during scans | Proxy buffering the SSE response — check Zoraxy timeout/buffering settings for the rule |
| Share links still show the LAN IP | Page was loaded via the old `http://<ip>:5100` URL — use the HTTPS hostname; links follow the page origin |
| `http://<ip>:5100` still reachable | The `--host 127.0.0.1` flag is missing from the scanner's launch command |

View File

@ -53,21 +53,6 @@ import sys
from datetime import date, datetime, timedelta
from pathlib import Path
try:
import psutil as _psutil
_PSUTIL_OK = True
except ImportError:
_PSUTIL_OK = False
_OCR_MEM_THRESHOLD_MB = 500
def _ocr_mem_ok() -> bool:
"""Return False if available RAM is below the threshold for OCR rendering."""
if not _PSUTIL_OK:
return True
return _psutil.virtual_memory().available >= _OCR_MEM_THRESHOLD_MB * 1024 * 1024
# Suppress pdfminer's noisy font-descriptor warnings that appear when PDFs
# contain malformed or incomplete font definitions. These do not affect text
# extraction or CPR detection — the warning is informational only.
@ -117,12 +102,6 @@ try:
except ImportError:
SPACY_OK = False
try:
import anthropic as _anthropic
ANTHROPIC_OK = True
except ImportError:
ANTHROPIC_OK = False
try:
from docx import Document as DocxDocument
DOCX_OK = True
@ -238,91 +217,6 @@ def load_nlp():
return None
# ── Claude NER ────────────────────────────────────────────────────────────────
def _get_claude_ner_config() -> "tuple[bool, str]":
"""Read Claude NER settings from config.json. Small file — OS-cached."""
try:
from app_config import _load_config, get_claude_api_key
cfg = _load_config()
return bool(cfg.get("claude_ner")), get_claude_api_key()
except Exception:
return False, ""
_CLAUDE_NER_CACHE: "dict[int, list[dict]]" = {}
_CLAUDE_NER_LOCK = None
def _claude_lock():
global _CLAUDE_NER_LOCK
if _CLAUDE_NER_LOCK is None:
import threading as _th
_CLAUDE_NER_LOCK = _th.Lock()
return _CLAUDE_NER_LOCK
def _ner_claude(text: str, api_key: str) -> "list[dict]":
"""
Extract named entities via Claude Haiku. Returns list of
{"text": str, "type": "NAME"|"ADDRESS"|"ORG"}.
In-memory cache keyed by hash(text); evicts oldest when > 2000 entries.
"""
if not ANTHROPIC_OK or not api_key:
return []
cache_key = hash(text)
lock = _claude_lock()
with lock:
if cache_key in _CLAUDE_NER_CACHE:
return _CLAUDE_NER_CACHE[cache_key]
try:
import json as _json
client = _anthropic.Anthropic(api_key=api_key)
CHUNK = 8_000
entities: "list[dict]" = []
for i in range(0, min(len(text), CHUNK * 10), CHUNK):
chunk = text[i : i + CHUNK]
if not chunk.strip():
continue
msg = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=512,
messages=[{
"role": "user",
"content": (
"Extract personal data from the text. "
"Return ONLY valid JSON: "
"{\"entities\":[{\"text\":\"<exact substring>\","
"\"type\":\"NAME\"|\"ADDRESS\"|\"ORG\"}]}. "
"NAME=person names, ADDRESS=physical addresses, "
"ORG=organisation names. "
"Skip CPR numbers, emails, phones, dates. "
"Return {\"entities\":[]} if none.\n\nTEXT:\n" + chunk
),
}],
)
raw = msg.content[0].text.strip()
if "```" in raw:
raw = raw.split("```")[1]
if raw.startswith("json\n"):
raw = raw[5:]
entities.extend(_json.loads(raw).get("entities", []))
result = [e for e in entities
if isinstance(e, dict) and e.get("text") and e.get("type")]
except Exception:
result = []
with lock:
if len(_CLAUDE_NER_CACHE) >= 2_000:
try:
del _CLAUDE_NER_CACHE[next(iter(_CLAUDE_NER_CACHE))]
except Exception:
pass
_CLAUDE_NER_CACHE[cache_key] = result
return result
# ── OCR page cache ───────────────────────────────────────────────────────────
_OCR_CACHE_PATH = Path.home() / ".document_scanner_ocr_cache.db"
@ -834,27 +728,20 @@ def count_pii_types(text: str, use_ner: bool = True) -> dict:
if 1 <= int(reg) <= 9999 and len(acct) >= 6:
counts["BANK_ACCOUNT"] += 1
# NER-based counts — Claude (if enabled) else spaCy
# NER-based counts — only run if model is loaded and text is non-trivial
if use_ner and len(text.strip()) > 20:
_claude_on, _claude_key = _get_claude_ner_config()
if _claude_on and ANTHROPIC_OK and _claude_key:
for ent in _ner_claude(text, _claude_key):
_t = ent.get("type")
if _t in counts:
counts[_t] += 1
else:
nlp = load_nlp()
if nlp:
NER_LIMIT = 20_000
for chunk_start in range(0, min(len(text), NER_LIMIT * 10), NER_LIMIT):
chunk = text[chunk_start:chunk_start + NER_LIMIT]
if not chunk.strip():
continue
doc = nlp(chunk)
for ent in doc.ents:
mapped = NER_REDACT_LABELS.get(ent.label_)
if mapped in counts:
counts[mapped] += 1
nlp = load_nlp()
if nlp:
NER_LIMIT = 20_000
for chunk_start in range(0, min(len(text), NER_LIMIT * 10), NER_LIMIT):
chunk = text[chunk_start:chunk_start + NER_LIMIT]
if not chunk.strip():
continue
doc = nlp(chunk)
for ent in doc.ents:
mapped = NER_REDACT_LABELS.get(ent.label_)
if mapped in counts:
counts[mapped] += 1
return counts
@ -1000,44 +887,39 @@ def find_pii_spans_in_text(text: str, use_ner: bool = True) -> list[tuple[int, i
if _is_name_match(m):
spans.append((m.start(), m.end(), "NAME"))
# NER spans — Claude (if enabled) else spaCy
# NER (names, addresses, orgs)
# Cap at 20 000 chars per call — spaCy NER is O(n) but dense tabular text
# (e.g. Excel-converted PDFs) can have thousands of tokens per page and stall.
#
# Context boosting: spaCy needs sentence context to recognise isolated names.
# For short text (< 80 chars, e.g. a single cell or line) we prepend a label
# so the model sees "Navn: Peter Hansen" instead of bare "Peter Hansen".
# Matches are shifted back by the prefix length before being recorded.
if use_ner:
_claude_on, _claude_key = _get_claude_ner_config()
if _claude_on and ANTHROPIC_OK and _claude_key:
for ent in _ner_claude(text, _claude_key):
_label = ent.get("type")
_ent_text = ent.get("text", "")
if not _ent_text or _label not in ("NAME", "ADDRESS", "ORG"):
nlp = load_nlp()
if nlp:
NER_LIMIT = 20_000
PREFIX = "Navn: "
PLEN = len(PREFIX)
# Only inject prefix for short/isolated text
if len(text.strip()) < 80:
ner_input = PREFIX + text
ner_offset = -PLEN
else:
ner_input = text
ner_offset = 0
for chunk_start in range(0, min(len(ner_input), NER_LIMIT * 10), NER_LIMIT):
chunk = ner_input[chunk_start:chunk_start + NER_LIMIT]
if not chunk.strip():
continue
for _m in re.finditer(re.escape(_ent_text), text):
spans.append((_m.start(), _m.end(), _label))
else:
# spaCy NER — cap at 20 000 chars per call (dense tabular text can stall).
# Context boosting: prepend "Navn: " for short/isolated text so spaCy
# sees sentence context; shift match positions back by prefix length.
nlp = load_nlp()
if nlp:
NER_LIMIT = 20_000
PREFIX = "Navn: "
PLEN = len(PREFIX)
if len(text.strip()) < 80:
ner_input = PREFIX + text
ner_offset = -PLEN
else:
ner_input = text
ner_offset = 0
for chunk_start in range(0, min(len(ner_input), NER_LIMIT * 10), NER_LIMIT):
chunk = ner_input[chunk_start:chunk_start + NER_LIMIT]
if not chunk.strip():
continue
doc = nlp(chunk)
for ent in doc.ents:
if ent.label_ in NER_REDACT_LABELS:
s = chunk_start + ent.start_char + ner_offset
e = chunk_start + ent.end_char + ner_offset
if e <= 0: # entity was entirely within the prefix
continue
spans.append((max(s, 0), e, NER_REDACT_LABELS[ent.label_]))
doc = nlp(chunk)
for ent in doc.ents:
if ent.label_ in NER_REDACT_LABELS:
s = chunk_start + ent.start_char + ner_offset
e = chunk_start + ent.end_char + ner_offset
if e <= 0: # entity was entirely within the prefix
continue
spans.append((max(s, 0), e, NER_REDACT_LABELS[ent.label_]))
# Merge overlapping spans
spans.sort()
@ -1262,6 +1144,11 @@ def redact_pdf_secure(input_path: Path, output_path: Path, results: dict,
page_methods = results["page_methods"]
images = None
ocr_pages = [p for p, m in page_methods.items() if m == "ocr"]
if ocr_pages and OCR_AVAILABLE:
images = convert_from_path(str(input_path), dpi=dpi, poppler_path=poppler_path)
total = 0
doc = _fitz.open(str(input_path))
@ -1274,20 +1161,10 @@ def redact_pdf_secure(input_path: Path, output_path: Path, results: dict,
if method == "text":
bboxes = (find_pii_char_bboxes(plumb_page, use_ner=use_ner)
if use_ner else find_cpr_char_bboxes(plumb_page))
elif method == "ocr" and OCR_AVAILABLE:
if not _ocr_mem_ok():
print(f" Page {page_num}: skipped redact — less than {_OCR_MEM_THRESHOLD_MB} MB RAM available.", flush=True)
bboxes = []
else:
_imgs = convert_from_path(
str(input_path), dpi=dpi, poppler_path=poppler_path,
first_page=page_num, last_page=page_num,
)
img = _imgs[0]
del _imgs
bboxes = (find_pii_image_bboxes(img, lang, use_ner=use_ner)
if use_ner else find_cpr_image_bboxes(img, lang))
del img
elif method == "ocr" and images is not None:
img = images[page_num - 1]
bboxes = (find_pii_image_bboxes(img, lang, use_ner=use_ner)
if use_ner else find_cpr_image_bboxes(img, lang))
else:
bboxes = []
@ -1350,6 +1227,11 @@ def redact_pdf(input_path: Path, output_path: Path, results: dict,
reader = PdfReader(str(input_path))
writer = PdfWriter()
images = None
ocr_pages = [p for p, m in page_methods.items() if m == "ocr"]
if ocr_pages and OCR_AVAILABLE:
images = convert_from_path(str(input_path), dpi=dpi, poppler_path=poppler_path)
total = 0
with pdfplumber.open(input_path) as plumb_pdf:
for page_num, plumb_page in enumerate(plumb_pdf.pages, start=1):
@ -1365,17 +1247,8 @@ def redact_pdf(input_path: Path, output_path: Path, results: dict,
else:
writer.add_page(reader_page)
elif method == "ocr" and OCR_AVAILABLE:
if not _ocr_mem_ok():
print(f" Page {page_num}: skipped redact — less than {_OCR_MEM_THRESHOLD_MB} MB RAM available.", flush=True)
writer.add_page(reader_page)
continue
_imgs = convert_from_path(
str(input_path), dpi=dpi, poppler_path=poppler_path,
first_page=page_num, last_page=page_num,
)
img = _imgs[0]
del _imgs
elif method == "ocr" and images is not None:
img = images[page_num - 1]
bboxes = (find_pii_image_bboxes(img, lang, use_ner=use_ner)
if use_ner else find_cpr_image_bboxes(img, lang))
if bboxes:
@ -1387,7 +1260,6 @@ def redact_pdf(input_path: Path, output_path: Path, results: dict,
total += len(bboxes)
else:
writer.add_page(reader_page)
del img
else:
writer.add_page(reader_page)
@ -2176,31 +2048,30 @@ def scan_pdf(pdf_path: Path, force_ocr=False, lang="dan+eng",
results = {"cprs": [], "dates": [], "page_methods": {}}
with pdfplumber.open(pdf_path) as pdf:
images = None
if OCR_AVAILABLE:
needs_ocr = (list(range(len(pdf.pages))) if force_ocr
else [i for i, p in enumerate(pdf.pages) if not is_text_page(p)])
if needs_ocr:
print(f" Rendering pages to images for OCR (DPI={dpi})...", flush=True)
images = convert_from_path(str(pdf_path), dpi=dpi, poppler_path=poppler_path)
for page_num, page in enumerate(pdf.pages, start=1):
use_text = not force_ocr and is_text_page(page)
if use_text:
method = "text"
text = page.extract_text() or ""
cprs, dates = extract_matches(text, page_num, "text")
elif OCR_AVAILABLE:
if not _ocr_mem_ok():
print(f" Page {page_num}: skipped — less than {_OCR_MEM_THRESHOLD_MB} MB RAM available.", flush=True)
method = "skipped"
cprs, dates = [], []
else:
print(f" Rendering page {page_num} for OCR (DPI={dpi})...", flush=True)
_imgs = convert_from_path(
str(pdf_path), dpi=dpi, poppler_path=poppler_path,
first_page=page_num, last_page=page_num,
)
_img = _imgs[0]
del _imgs
method = "ocr"
cprs, dates = extract_matches(ocr_page_cached(_img, lang), page_num, "ocr")
del _img
elif OCR_AVAILABLE and images is not None:
method = "ocr"
_img = images[page_num-1]
images[page_num-1] = None # release PIL image as soon as OCR is done
cprs, dates = extract_matches(ocr_page_cached(_img, lang), page_num, "ocr")
del _img
else:
method = "skipped"
print(f" Page {page_num}: image-based but OCR unavailable.")
if not OCR_AVAILABLE:
print(f" Page {page_num}: image-based but OCR unavailable.")
cprs, dates = [], []
results["page_methods"][page_num] = method

View File

@ -24,8 +24,6 @@ import hashlib
from pathlib import Path, PurePosixPath
from typing import Iterator
from cpr_detector import SUPPORTED_EXTS as DEFAULT_EXTENSIONS
# ── Optional dependency flags ─────────────────────────────────────────────────
try:
@ -60,8 +58,19 @@ except ImportError:
KEYCHAIN_SERVICE = "gdpr-scanner-nas"
# DEFAULT_EXTENSIONS is imported from cpr_detector.SUPPORTED_EXTS — single source of truth.
# Adding a new file type to cpr_detector.py automatically extends local/SMB scans too.
# File extensions passed through to _scan_bytes(). Matches SUPPORTED_EXTS in
# gdpr_scanner.py; kept here too so FileScanner can filter without importing it.
DEFAULT_EXTENSIONS = {
".pdf", ".docx", ".doc", ".xlsx", ".xlsm", ".csv",
".txt", ".eml", ".msg",
".jpg", ".jpeg", ".png", ".bmp", ".tiff", ".tif", ".webp",
".heic", ".heif",
}
# Extensions for local/SMB file scans — PDFs now included; OCR runs in a spawned
# subprocess with a 60-second hard timeout via _scan_bytes_timeout so hanging
# Tesseract/Poppler processes can never block the scan thread indefinitely.
FILE_SCAN_EXTENSIONS = DEFAULT_EXTENSIONS
# Maximum file size to load into memory (bytes). Files larger than this are
# skipped with a warning — same guard used by the M365 attachment scanner.
@ -138,7 +147,7 @@ def store_smb_password(smb_host: str, smb_user: str,
class FileScanner:
"""Unified local + SMB/CIFS file iterator."""
FILE_SCAN_EXTENSIONS = DEFAULT_EXTENSIONS
FILE_SCAN_EXTENSIONS = FILE_SCAN_EXTENSIONS # excludes .pdf
"""Unified iterator over local paths and SMB/CIFS network shares.
Usage::
@ -200,7 +209,7 @@ class FileScanner:
Args:
extensions: Set of lowercase extensions to include, e.g. {".pdf", ".docx"}.
Defaults to DEFAULT_EXTENSIONS (cpr_detector.SUPPORTED_EXTS).
Defaults to DEFAULT_EXTENSIONS.
progress_cb: Optional callable(rel_path) called before each file is read,
so the caller can update a progress indicator.
@ -551,68 +560,6 @@ def _smb_read_file(tree, smb_path: str) -> bytes:
fh.close(get_attributes=False)
def write_smb_file(smb_path_uri: str, content: bytes,
username: str, password: str, domain: str = "") -> None:
"""Overwrite an SMB file at smb_path_uri (e.g. '//host/share/folder/file.docx').
Raises RuntimeError if smbprotocol is not installed.
Raises ValueError if the path cannot be parsed.
All SMB errors propagate as-is.
"""
if not SMB_OK:
raise RuntimeError("smbprotocol not installed — run: pip install smbprotocol")
norm = smb_path_uri.replace("\\", "/").lstrip("/")
parts = norm.split("/", 2)
if len(parts) < 2:
raise ValueError(f"Cannot parse SMB path '{smb_path_uri}' — expected //host/share[/path]")
host = parts[0]
share = parts[1]
file_rel = parts[2].replace("/", "\\") if len(parts) > 2 else ""
if not host or not share or not file_rel:
raise ValueError(f"Cannot parse SMB path '{smb_path_uri}'")
import uuid as _uuid
conn = Connection(_uuid.uuid4(), host, 445)
conn.connect(timeout=30)
try:
session = Session(conn, username=username, password=password,
require_encryption=False)
if domain:
session.username = f"{domain}\\{username}"
session.connect()
try:
tree = TreeConnect(session, f"\\\\{host}\\{share}")
tree.connect()
try:
fh = Open(tree, file_rel)
fh.create(
ImpersonationLevel.Impersonation,
FilePipePrinterAccessMask.FILE_WRITE_DATA |
FilePipePrinterAccessMask.FILE_WRITE_ATTRIBUTES,
FileAttributes.FILE_ATTRIBUTE_NORMAL,
ShareAccess.FILE_SHARE_NONE,
CreateDisposition.FILE_SUPERSEDE,
CreateOptions.FILE_NON_DIRECTORY_FILE,
)
try:
chunk_size = 1024 * 1024
offset = 0
while offset < len(content):
chunk = content[offset:offset + chunk_size]
fh.write(chunk, offset)
offset += len(chunk)
finally:
fh.close(get_attributes=False)
finally:
tree.disconnect()
finally:
session.disconnect()
finally:
conn.disconnect()
def _smb_ts(windows_ts: int) -> str:
"""Convert Windows FILETIME (100ns intervals since 1601-01-01) to YYYY-MM-DD."""
if not windows_ts:

View File

@ -6,7 +6,7 @@ Stores scan results alongside the existing JSON cache. Neither replaces the
other: JSON is fast and portable, SQLite enables querying, trending, and the
data-subject index.
Database location: ~/.gdprscanner/scanner.db (configurable via DB_PATH)
Database location: ~/.gdpr_scanner.db (configurable via DB_PATH)
Schema
------
@ -29,14 +29,11 @@ Usage (from gdpr_scanner.py)
import hashlib
import json
import logging
import sqlite3
import time
from pathlib import Path
from typing import Iterator
logger = logging.getLogger(__name__)
from pathlib import Path as _P
_DATA_DIR = _P.home() / ".gdprscanner"
_DATA_DIR.mkdir(exist_ok=True)
@ -183,17 +180,6 @@ CREATE INDEX IF NOT EXISTS idx_dellog_time ON deletion_log(deleted_at);
CREATE INDEX IF NOT EXISTS idx_dellog_item ON deletion_log(item_id);
CREATE INDEX IF NOT EXISTS idx_dellog_reason ON deletion_log(reason);
CREATE TABLE IF NOT EXISTS audit_log (
id INTEGER PRIMARY KEY AUTOINCREMENT,
ts REAL NOT NULL,
action TEXT NOT NULL DEFAULT '',
actor TEXT NOT NULL DEFAULT '',
detail TEXT NOT NULL DEFAULT '',
ip TEXT NOT NULL DEFAULT ''
);
CREATE INDEX IF NOT EXISTS idx_audit_ts ON audit_log(ts);
CREATE INDEX IF NOT EXISTS idx_audit_action ON audit_log(action);
-- Indexes
CREATE INDEX IF NOT EXISTS idx_items_scan ON flagged_items(scan_id);
CREATE INDEX IF NOT EXISTS idx_items_source ON flagged_items(source_type);
@ -214,9 +200,6 @@ _MIGRATIONS: list[tuple[int, str]] = [
(4, "ALTER TABLE flagged_items ADD COLUMN face_count INTEGER NOT NULL DEFAULT 0"),
(5, "ALTER TABLE flagged_items ADD COLUMN exif_json TEXT NOT NULL DEFAULT '{}'"),
(6, "ALTER TABLE flagged_items ADD COLUMN full_path TEXT NOT NULL DEFAULT ''"),
(8, "ALTER TABLE flagged_items ADD COLUMN email_count INTEGER NOT NULL DEFAULT 0"),
(9, "ALTER TABLE flagged_items ADD COLUMN phone_count INTEGER NOT NULL DEFAULT 0"),
(10, "ALTER TABLE flagged_items ADD COLUMN body_excerpt TEXT NOT NULL DEFAULT ''"),
(7, """CREATE TABLE IF NOT EXISTS schedule_runs (
id INTEGER PRIMARY KEY AUTOINCREMENT,
started_at REAL NOT NULL,
@ -228,7 +211,6 @@ _MIGRATIONS: list[tuple[int, str]] = [
emailed INTEGER NOT NULL DEFAULT 0,
error TEXT NOT NULL DEFAULT ''
)"""),
(11, "ALTER TABLE flagged_items ADD COLUMN account_name TEXT NOT NULL DEFAULT ''"),
]
@ -329,9 +311,8 @@ class ScanDB:
(id, scan_id, name, source, source_type, account_id, folder,
url, drive_id, size_kb, modified, cpr_count, risk,
thumb_b64, thumb_mime, attachments, user_role, transfer_risk,
special_category, face_count, exif_json, full_path,
email_count, phone_count, body_excerpt, account_name, scanned_at)
VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)""",
special_category, face_count, exif_json, full_path, scanned_at)
VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)""",
(
card.get("id", ""),
scan_id,
@ -355,10 +336,6 @@ class ScanDB:
card.get("face_count", 0),
json.dumps(card.get("exif", {})),
card.get("full_path", ""),
card.get("email_count", 0),
card.get("phone_count", 0),
card.get("body_excerpt", ""),
card.get("account_name", ""),
now,
),
)
@ -437,33 +414,6 @@ class ScanDB:
c.commit()
def finalize_orphan_scans(self) -> int:
"""Finalise scans left unfinished by a crash, kill, or mid-scan restart.
After a fresh process start nothing is scanning, so any scan still
carrying finished_at IS NULL is dead the process that owned it is gone.
Its already-saved flagged_items were stranded: both get_session_items
and get_open_items require finished_at, so those items are invisible and
effectively lost. Finalising the orphans on startup makes them show up
and prevents permanent data loss from interrupted scans (the M365 and
Google engines return early on abort and never reach finish_scan; only
the file scan finalises in a finally block).
Safe to call only when no scan is running (i.e. at startup). Returns the
number of scans finalised.
"""
rows = self._connect().execute(
"SELECT id, total_scanned FROM scans WHERE finished_at IS NULL"
).fetchall()
count = 0
for sid, total in rows:
try:
self.finish_scan(sid, total or 0)
count += 1
except Exception as e:
logger.warning("[db] finalize_orphan_scans: scan %s failed: %s", sid, e)
return count
# ── Query helpers ─────────────────────────────────────────────────────────
def latest_scan_id(self) -> int | None:
@ -492,172 +442,34 @@ class ScanDB:
result.append(d)
return result
def get_sessions(self, limit: int = 50, window_seconds: int = 300) -> list[dict]:
"""Return scan sessions (groups of concurrent scans) newest-first.
Concurrent M365 + Google + File scans each get their own scan_id but start
within seconds of each other. This method groups them into logical sessions
by the same 300-second window used by get_session_items().
"""
rows = self._connect().execute(
"""SELECT id, started_at, finished_at, sources, flagged_count, total_scanned, delta
FROM scans WHERE finished_at IS NOT NULL ORDER BY started_at ASC"""
).fetchall()
# Group consecutive scans started within window_seconds of each other
groups: list[list[dict]] = []
for r in rows:
d = dict(r)
d["sources"] = json.loads(d.get("sources") or "[]")
if groups and d["started_at"] - groups[-1][0]["started_at"] <= window_seconds:
groups[-1].append(d)
else:
groups.append([d])
# Build session summaries newest-first
sessions: list[dict] = []
for grp in reversed(groups):
ref = grp[-1] # highest scan_id in group (last in ASC order)
sessions.append({
"ref_scan_id": ref["id"],
"started_at": grp[0]["started_at"],
"finished_at": ref.get("finished_at"),
"sources": list({s for g in grp for s in g["sources"]}),
"flagged_count": sum(g["flagged_count"] or 0 for g in grp),
"total_scanned": sum(g["total_scanned"] or 0 for g in grp),
"delta": any(bool(g["delta"]) for g in grp),
})
if len(sessions) >= limit:
break
return sessions
def get_session_items(self, window_seconds: int = 300,
ref_scan_id: int | None = None) -> list[dict]:
def get_session_items(self, window_seconds: int = 300) -> list[dict]:
"""Return flagged items from all scans in the same session as the latest scan.
A session is all scans whose started_at is within *window_seconds* of the
most recently started completed scan. This captures concurrent M365, Google,
and file scans which each create their own scan_id but start within seconds
of each other.
If *ref_scan_id* is given, the session is anchored to that scan's started_at
instead of the latest scan.
"""
if ref_scan_id:
row = self._connect().execute(
"SELECT started_at FROM scans WHERE id=?", (ref_scan_id,)
).fetchone()
else:
row = self._connect().execute(
"SELECT started_at FROM scans WHERE finished_at IS NOT NULL ORDER BY id DESC LIMIT 1"
).fetchone()
if not row:
return []
latest_start = row[0]
rows = self._connect().execute(
"""SELECT fi.*, COALESCE(d.status, 'unreviewed') AS disposition
FROM flagged_items fi
JOIN scans s ON fi.scan_id = s.id
LEFT JOIN dispositions d ON d.item_id = fi.id
WHERE s.started_at BETWEEN ? AND ? AND s.finished_at IS NOT NULL
ORDER BY fi.cpr_count DESC""",
(latest_start - window_seconds, latest_start + window_seconds),
).fetchall()
result = []
for r in rows:
d = dict(r)
d["attachments"] = json.loads(d.get("attachments") or "[]")
result.append(d)
return result
def get_open_items(self) -> list[dict]:
"""Return every flagged item across all scans that has no action taken.
"Open" means the item has no disposition row (or a row whose status is
still 'unreviewed'). Unlike get_session_items this is NOT limited to the
latest scan window it surfaces all outstanding items so nothing slips
out of view once a newer scan starts a fresh session.
flagged_items has a composite PK of (id, scan_id), so the same logical
item appears once per scan that flagged it. We deduplicate by id, keeping
the row from the most recent finished scan, so each open item shows once.
"""
rows = self._connect().execute(
"""SELECT fi.*, COALESCE(d.status, 'unreviewed') AS disposition
FROM flagged_items fi
JOIN scans s ON fi.scan_id = s.id
LEFT JOIN dispositions d ON d.item_id = fi.id
WHERE s.finished_at IS NOT NULL
AND (d.item_id IS NULL OR d.status = 'unreviewed')
AND fi.scan_id = (
SELECT MAX(fi2.scan_id)
FROM flagged_items fi2
JOIN scans s2 ON fi2.scan_id = s2.id
WHERE fi2.id = fi.id AND s2.finished_at IS NOT NULL
)
ORDER BY fi.cpr_count DESC""",
).fetchall()
result = []
for r in rows:
d = dict(r)
d["attachments"] = json.loads(d.get("attachments") or "[]")
result.append(d)
return result
def get_related_items(self, item_id: str, ref_scan_id: int | None = None,
window_seconds: int = 300) -> list[dict]:
"""Return flagged items from the same session that share at least one CPR
hash with *item_id*, ordered by number of shared CPRs descending."""
if ref_scan_id:
row = self._connect().execute(
"SELECT started_at FROM scans WHERE id=?", (ref_scan_id,)
).fetchone()
else:
row = self._connect().execute(
"SELECT started_at FROM scans WHERE finished_at IS NOT NULL ORDER BY id DESC LIMIT 1"
).fetchone()
if not row:
return []
latest_start = row[0]
rows = self._connect().execute(
"""SELECT fi.*, COUNT(DISTINCT ci2.cpr_hash) AS shared_cprs
FROM cpr_index ci1
JOIN cpr_index ci2 ON ci2.cpr_hash = ci1.cpr_hash
JOIN flagged_items fi ON fi.id = ci2.item_id
JOIN scans s ON fi.scan_id = s.id
WHERE ci1.item_id = ?
AND fi.id != ?
AND s.started_at BETWEEN ? AND ?
AND s.finished_at IS NOT NULL
GROUP BY fi.id
ORDER BY shared_cprs DESC, fi.cpr_count DESC""",
(item_id, item_id, latest_start - window_seconds, latest_start + window_seconds),
).fetchall()
return [dict(r) for r in rows]
def get_session_sources(self, window_seconds: int = 300) -> set:
"""Return the union of all source keys scanned in the current session.
Reads the ``sources`` JSON array stored in each scan record that belongs
to the same session as the latest completed scan. This is used by the
export builders so they can show every scanned source in summary tables
even when a source produced zero flagged items.
"""
row = self._connect().execute(
"SELECT started_at FROM scans WHERE finished_at IS NOT NULL ORDER BY id DESC LIMIT 1"
).fetchone()
if not row:
return set()
return []
latest_start = row[0]
rows = self._connect().execute(
"""SELECT sources FROM scans
WHERE started_at >= ? AND finished_at IS NOT NULL""",
"""SELECT fi.*, COALESCE(d.status, 'unreviewed') AS disposition
FROM flagged_items fi
JOIN scans s ON fi.scan_id = s.id
LEFT JOIN dispositions d ON d.item_id = fi.id
WHERE s.started_at >= ? AND s.finished_at IS NOT NULL
ORDER BY fi.cpr_count DESC""",
(latest_start - window_seconds,),
).fetchall()
result: set = set()
result = []
for r in rows:
try:
result.update(json.loads(r[0] or "[]"))
except Exception:
pass
d = dict(r)
d["attachments"] = json.loads(d.get("attachments") or "[]")
result.append(d)
return result
def lookup_data_subject(self, cpr: str) -> list[dict]:
@ -886,34 +698,6 @@ class ScanDB:
).fetchone()[0] or 0
return {"total": total, "by_reason": by_reason, "cpr_hits_deleted": cpr_deleted}
# ── Compliance audit log ──────────────────────────────────────────────────
def log_audit(self, action: str, detail: str = "",
actor: str = "", ip: str = "") -> None:
"""Write an immutable compliance audit record."""
c = self._connect()
c.execute(
"INSERT INTO audit_log (ts, action, actor, detail, ip) VALUES (?,?,?,?,?)",
(time.time(), action, actor, detail, ip),
)
c.commit()
def get_audit_log(self, limit: int = 200,
action: str | None = None) -> list[dict]:
"""Return audit records, most recent first."""
c = self._connect()
if action:
rows = c.execute(
"SELECT * FROM audit_log WHERE action=? ORDER BY ts DESC LIMIT ?",
(action, limit),
).fetchall()
else:
rows = c.execute(
"SELECT * FROM audit_log ORDER BY ts DESC LIMIT ?",
(limit,),
).fetchall()
return [dict(r) for r in rows]
def delete_item_record(self, item_id: str, scan_id: int | None = None) -> None:
"""Remove a flagged item from the DB (after it has been deleted in M365)."""
c = self._connect()
@ -1162,15 +946,6 @@ class ScanDB:
_db: ScanDB | None = None
def log_audit_event(action: str, detail: str = "",
actor: str = "", ip: str = "") -> None:
"""Write an audit record to the shared DB. Silently no-ops if DB unavailable."""
try:
get_db().log_audit(action, detail, actor=actor, ip=ip)
except Exception:
pass
def get_db(path: Path = DB_PATH) -> ScanDB:
"""Return the module-level ScanDB singleton, creating it if needed."""
global _db

View File

@ -146,7 +146,7 @@ _migrate_to_data_dir()
# ── Flask ─────────────────────────────────────────────────────────────────────
try:
from flask import Flask, Response, jsonify, redirect, render_template, request, session
from flask import Flask, Response, jsonify, render_template, request, session
except ImportError:
print("Flask required: pip install flask")
sys.exit(1)
@ -251,7 +251,7 @@ from app_config import (
from checkpoint import (
_checkpoint_key, _save_checkpoint, _load_checkpoint, _clear_checkpoint,
_load_delta_tokens, _save_delta_tokens,
_cp_path, _DELTA_PATH,
_CHECKPOINT_PATH, _DELTA_PATH,
)
from sse import broadcast, _sse_queues, _sse_buffer
@ -260,8 +260,8 @@ import sse as _sse_mod # for _current_scan_id access at call time
from cpr_detector import (
_scan_bytes, _scan_bytes_timeout, _scan_text_direct, _html_esc, _get_pii_counts,
_make_thumb, _placeholder_svg,
_extract_exif, _extract_video_metadata, _extract_audio_metadata, _detect_photo_faces,
SUPPORTED_EXTS, PHOTO_EXTS, VIDEO_EXTS, AUDIO_EXTS,
_extract_exif, _detect_photo_faces,
SUPPORTED_EXTS, PHOTO_EXTS,
_EXIF_PII_TAGS,
)
# Inject runtime deps into cpr_detector
@ -285,16 +285,12 @@ _se.FILE_SCANNER_OK = FILE_SCANNER_OK
_se.CONNECTOR_OK = CONNECTOR_OK
_se.DB_OK = DB_OK
_se.PHOTO_EXTS = PHOTO_EXTS
_se.VIDEO_EXTS = VIDEO_EXTS
_se.AUDIO_EXTS = AUDIO_EXTS
_se.SUPPORTED_EXTS = SUPPORTED_EXTS
# cpr helpers
_se._scan_bytes = _scan_bytes
_se._scan_bytes_timeout = _scan_bytes_timeout
_se._detect_photo_faces = _detect_photo_faces
_se._extract_exif = _extract_exif
_se._extract_video_metadata = _extract_video_metadata
_se._extract_audio_metadata = _extract_audio_metadata
_se._make_thumb = _make_thumb
_se._placeholder_svg = _placeholder_svg
_se._check_special_category = _check_special_category
@ -317,11 +313,6 @@ app = Flask(__name__,
template_folder=_os.path.join(_BASE_DIR, "templates"),
static_folder=_os.path.join(_BASE_DIR, "static"))
# Static files must revalidate on every load (cheap 304s via ETag). Without
# this there is no Cache-Control header and browsers cache JS/CSS heuristically
# for days — after a self-update the backend is new but the UI stays stale.
app.config["SEND_FILE_MAX_AGE_DEFAULT"] = 0
# Session secret — derived from machine_id so it survives restarts without a separate file.
# machine_id is also the Fernet key (base64-encoded 32 bytes); we use its raw bytes as the secret.
try:
@ -377,72 +368,7 @@ def _sync_state():
# JavaScript served from static/app.js via Flask static file handling.
# ── Interface PIN auth ────────────────────────────────────────────────────────
_iface_pin_attempts: dict[str, list[float]] = {}
_IFACE_MAX_ATTEMPTS = 5
_IFACE_WINDOW_S = 300
def _iface_rate_limited(ip: str) -> bool:
now = time.time()
times = [t for t in _iface_pin_attempts.get(ip, []) if now - t < _IFACE_WINDOW_S]
_iface_pin_attempts[ip] = times
return len(times) >= _IFACE_MAX_ATTEMPTS
@app.before_request
def _require_interface_pin():
from app_config import get_interface_pin_hash
if not get_interface_pin_hash():
return # feature disabled — open access
path = request.path
# Always-exempt paths
if (path.startswith("/static/")
or path in ("/login", "/view", "/manual", "/favicon.ico")
or path == "/api/interface/pin/verify"
or path == "/api/viewer/pin/verify"):
return
# Authenticated sessions (interface or viewer) pass through
if session.get("interface_ok") or session.get("viewer_ok"):
return
if path.startswith("/api/"):
return jsonify({"error": "authentication required"}), 401
return redirect("/login")
@app.route("/login")
def login_page():
from app_config import get_interface_pin_hash
if not get_interface_pin_hash():
return redirect("/")
if session.get("interface_ok"):
return redirect("/")
return render_template("interface_login.html", LANG=LANG)
@app.route("/api/interface/pin/verify", methods=["POST"])
def interface_pin_verify():
from app_config import verify_interface_pin
ip = request.remote_addr or "unknown"
if _iface_rate_limited(ip):
return jsonify({"error": "Too many failed attempts. Try again later."}), 429
body = request.get_json(silent=True) or {}
pin = str(body.get("pin", "")).strip()
if not verify_interface_pin(pin):
_iface_pin_attempts.setdefault(ip, []).append(time.time())
return jsonify({"error": "Incorrect PIN"}), 401
_iface_pin_attempts.pop(ip, None)
session["interface_ok"] = True
return jsonify({"ok": True})
@app.route("/api/interface/logout", methods=["POST"])
def interface_logout():
session.pop("interface_ok", None)
return jsonify({"ok": True})
# ── Auth state ─────────────────────────────────────────────────────────────────
# ── Routes ────────────────────────────────────────────────────────────────────
@app.route("/")
@ -457,21 +383,17 @@ def viewer():
from app_config import validate_viewer_token, get_viewer_pin_hash
token = request.args.get("token", "").strip()
if token:
entry = validate_viewer_token(token)
if entry is None:
if validate_viewer_token(token) is None:
return render_template("viewer_denied.html"), 403
# Bind a session so the viewer doesn't need the token on every navigation
session["viewer_ok"] = True
session["viewer_scope"] = entry.get("scope", {})
session["viewer_ok"] = True
return render_template("index.html", app_version=APP_VERSION,
lang_json=json.dumps(LANG, ensure_ascii=False),
viewer_mode=True,
viewer_scope=json.dumps(entry.get("scope", {}), ensure_ascii=False))
viewer_mode=True)
if session.get("viewer_ok"):
return render_template("index.html", app_version=APP_VERSION,
lang_json=json.dumps(LANG, ensure_ascii=False),
viewer_mode=True,
viewer_scope=json.dumps(session.get("viewer_scope", {}), ensure_ascii=False))
viewer_mode=True)
# No token, no session — show PIN form if a PIN is configured, else deny
pin_hash = get_viewer_pin_hash()
if pin_hash:
@ -1577,11 +1499,10 @@ from routes.scheduler import bp as scheduler_bp
from routes.google_auth import bp as google_auth_bp
from routes.google_scan import bp as google_scan_bp
from routes.viewer import bp as viewer_bp
from routes.updates import bp as updates_bp
for _bp in [auth_bp, users_bp, scan_bp, sources_bp, profiles_bp,
email_bp, database_bp, export_bp, app_routes_bp, scheduler_bp,
google_auth_bp, google_scan_bp, viewer_bp, updates_bp]:
google_auth_bp, google_scan_bp, viewer_bp]:
app.register_blueprint(_bp)
# ── Entry point ───────────────────────────────────────────────────────────────
@ -1598,10 +1519,10 @@ Headless (scheduled) usage:
environment variables: M365_CLIENT_ID, M365_TENANT_ID, M365_CLIENT_SECRET
or a settings JSON: --settings /path/to/settings.json
Scan options are loaded from ~/.gdprscanner/settings.json (saved automatically
Scan options are loaded from ~/.gdpr_scanner_settings.json (saved automatically
after any interactive scan), or overridden in the --settings file.
SMTP config is loaded from ~/.gdprscanner/smtp.json (saved in the UI) or from
SMTP config is loaded from ~/.gdpr_scanner_smtp.json (saved in the UI) or from
an 'smtp' key in the --settings file.
Example cron (weekly, Mondays at 06:00):
@ -1636,7 +1557,7 @@ Example --settings file with SMTP:
parser.add_argument("--output", default=".",
help="Output directory for Excel export in headless mode (default: .)")
parser.add_argument("--settings", default=None,
help="Path to a JSON settings file (overrides ~/.gdprscanner/settings.json)")
help="Path to a JSON settings file (overrides ~/.gdpr_scanner_settings.json)")
parser.add_argument("--email-to", default=None,
help="Comma-separated recipient addresses — send Excel report by email (headless only)")
parser.add_argument("--retention-years", type=int, default=None,
@ -1644,7 +1565,7 @@ Example --settings file with SMTP:
parser.add_argument("--fiscal-year-end", default=None,
help="Fiscal year end as MM-DD for retention cutoff (e.g. 12-31 for Bogforingsloven). Omit for rolling window.")
parser.add_argument("--reset-db", action="store_true",
help="Reset the results database (~/.gdprscanner/scanner.db) — permanently deletes all scan history, "
help="Reset the results database (~/.gdpr_scanner.db) — permanently deletes all scan history, "
"dispositions, and deletion log. Prompts for confirmation unless --yes is also passed.")
parser.add_argument("--yes", action="store_true",
help="Skip confirmation prompts (use with --reset-db for scripted resets)")
@ -1848,7 +1769,7 @@ Example --settings file with SMTP:
(_SETTINGS_PATH, "Headless scan settings"),
(_ROLE_OVERRIDES_PATH, "Manual role overrides"),
(_FILE_SOURCES_PATH, "File source definitions"),
(_cp_path("m365"), "Scan checkpoint (resume state)"),
(_CHECKPOINT_PATH, "Scan checkpoint (resume state)"),
(_DELTA_PATH, "Delta scan tokens"),
(_LANG_OVERRIDE_FILE, "Language preference"),
(Path.home() / ".gdprscanner" / "schedule.json", "Scheduler configuration"),
@ -1935,12 +1856,10 @@ Example --settings file with SMTP:
print(" ✖ m365_db not available — cannot reset")
_sys.exit(1)
# Also clear all checkpoints so the UI starts with no cached results
from pathlib import Path as _Path
for _cpf in (_Path.home() / ".gdprscanner").glob("checkpoint_*.json"):
try: _cpf.unlink()
except Exception: pass
print(f" ✔ Checkpoints cleared")
# Also clear the JSON checkpoint so the UI starts with no cached results
_clear_checkpoint()
if not _CHECKPOINT_PATH.exists():
print(f" ✔ Checkpoint cleared")
# Clear delta tokens too — stale after a full DB reset
if _DELTA_PATH.exists():
@ -2149,7 +2068,7 @@ Example --settings file with SMTP:
email_to = getattr(args, "email_to", None)
if email_to:
recipients = [r.strip() for r in email_to.replace(";", ",").split(",") if r.strip()]
# SMTP config: --settings file takes priority, then saved ~/.gdprscanner/smtp.json
# SMTP config: --settings file takes priority, then saved ~/.gdpr_scanner_smtp.json
smtp_cfg = _load_smtp_config()
if cfg.get("smtp"):
smtp_cfg = {**smtp_cfg, **cfg["smtp"]}
@ -2266,33 +2185,14 @@ Example --settings file with SMTP:
# Find a free port — auto-increment from the requested port if in use.
import socket as _socket
def _can_bind(p: int, host: str) -> bool:
with _socket.socket(_socket.AF_INET, _socket.SOCK_STREAM) as s:
# Probe with SO_REUSEADDR, matching how Werkzeug binds.
# Without it, connections left in TIME_WAIT by a previous
# instance (e.g. the in-app update restart) make the port
# look occupied and the app silently moves to the next one.
s.setsockopt(_socket.SOL_SOCKET, _socket.SO_REUSEADDR, 1)
try:
s.bind((host, p))
return True
except OSError:
return False
def _find_free_port(start: int, host: str) -> int:
# Give the requested port a grace period — after a self-restart
# the previous process may not have released it yet.
deadline = time.time() + 10
while True:
if _can_bind(start, host):
return start
if time.time() >= deadline:
break
time.sleep(0.5)
for p in range(start + 1, start + 100):
if _can_bind(p, host):
return p
for p in range(start, start + 100):
with _socket.socket(_socket.AF_INET, _socket.SOCK_STREAM) as s:
try:
s.bind((host, p))
return p
except OSError:
continue
raise RuntimeError(f"No free port found in range {start}{start + 99}")
actual_port = _find_free_port(args.port, args.host)
@ -2305,19 +2205,6 @@ Example --settings file with SMTP:
print(f"\n GDPRScanner\n ──────────────────────────────")
print(f" Open: http://{args.host}:{args.port}")
# Recover scans left unfinished by a crash / kill / mid-scan restart.
# Nothing is scanning at startup, so any scan with finished_at IS NULL is
# dead; finalising it makes its already-saved items visible again instead
# of stranding them (both get_session_items and get_open_items require a
# finished scan). Must run before the scheduler can start a new scan.
try:
if DB_OK:
_recovered = _get_db().finalize_orphan_scans()
if _recovered:
print(f" Recovered {_recovered} unfinished scan(s) from a prior restart")
except Exception as _orphan_err:
print(f" Orphan-scan recovery: failed ({_orphan_err})")
# Start in-process scheduler (#19)
try:
import scan_scheduler as _sched_mod
@ -2334,14 +2221,5 @@ Example --settings file with SMTP:
except Exception as _sched_err:
print(f" Scheduler: failed to start ({_sched_err})")
# Auto-update background thread (Settings → General → Software update)
try:
from routes.updates import start_auto_update_thread
from app_config import get_update_config as _get_upd_cfg
if start_auto_update_thread() and _get_upd_cfg().get("auto_update"):
print(" Auto-update: enabled (checked daily)")
except Exception as _upd_err:
print(f" Auto-update: failed to start ({_upd_err})")
print(f" Press Ctrl+C to stop\n")
app.run(host=args.host, port=args.port, debug=False, threaded=True)

View File

@ -70,9 +70,6 @@ GMAIL_SCOPES = [
DRIVE_SCOPES = [
"https://www.googleapis.com/auth/drive.readonly",
]
DRIVE_WRITE_SCOPES = [
"https://www.googleapis.com/auth/drive",
]
ADMIN_SCOPES = [
"https://www.googleapis.com/auth/admin.directory.user.readonly",
]
@ -263,50 +260,6 @@ class GoogleConnector:
raise GoogleError(f"Drive auth failed for {user_email}: {e}") from e
yield from _drive_iter(service, user_email, max_files, max_file_mb)
def get_drive_start_token(self, user_email: str) -> str:
"""Return the current Changes API start page token for user's Drive."""
try:
creds = self._creds_for(user_email, DRIVE_SCOPES)
service = build("drive", "v3", credentials=creds, cache_discovery=False)
except HttpError as e:
raise GoogleError(f"Drive auth failed for {user_email}: {e}") from e
return _drive_get_start_page_token(service)
def get_drive_changes(
self,
user_email: str,
page_token: str,
max_files: int = 5000,
max_file_mb: float = 50.0,
) -> "tuple[list[tuple[dict, bytes]], str]":
"""Return (changed_files, new_page_token) since page_token."""
try:
creds = self._creds_for(user_email, DRIVE_SCOPES)
service = build("drive", "v3", credentials=creds, cache_discovery=False)
except HttpError as e:
raise GoogleError(f"Drive auth failed for {user_email}: {e}") from e
return _drive_changes_collect(service, user_email, page_token, max_files, max_file_mb)
# ── Drive write-back (redaction) ──────────────────────────────────────────
def get_drive_file_mime(self, user_email: str, file_id: str) -> str:
"""Return the mimeType of a Drive file."""
creds = self._creds_for(user_email, DRIVE_WRITE_SCOPES)
service = build("drive", "v3", credentials=creds, cache_discovery=False)
return _get_drive_file_mime(service, file_id)
def download_drive_file_by_id(self, user_email: str, file_id: str) -> bytes:
"""Download raw bytes of a non-Google-native Drive file by ID."""
creds = self._creds_for(user_email, DRIVE_WRITE_SCOPES)
service = build("drive", "v3", credentials=creds, cache_discovery=False)
return _download_drive_file_by_id(service, file_id)
def update_drive_file(self, user_email: str, file_id: str, content: bytes, mime_type: str) -> None:
"""Replace Drive file content in-place. Requires drive (not drive.readonly) scope."""
creds = self._creds_for(user_email, DRIVE_WRITE_SCOPES)
service = build("drive", "v3", credentials=creds, cache_discovery=False)
_update_drive_file_content(service, file_id, content, mime_type)
# ── Persistence helpers ───────────────────────────────────────────────────────
@ -459,101 +412,6 @@ def _gmail_iter(
yield (att_meta, data)
def _download_drive_file(
service,
f: dict,
user_email: str,
max_bytes: int,
) -> "tuple[dict, bytes] | None":
"""Download one Drive file entry. Returns (meta, data) or None if skipped."""
mime = f.get("mimeType", "")
fid = f.get("id", "")
fname = f.get("name", "")
size = int(f.get("size", 0) or 0)
meta = {
"id": f"gdrive:{fid}",
"name": fname,
"_source": "gdrive",
"_source_type": "gdrive",
"_account": user_email,
"_account_id": user_email,
"_url": f.get("webViewLink", ""),
"lastModifiedDateTime": f.get("modifiedTime", "")[:10],
"size": size,
}
if mime in _EXPORT_MAP:
export_mime, ext = _EXPORT_MAP[mime]
try:
req = service.files().export_media(fileId=fid, mimeType=export_mime)
buf = io.BytesIO()
dl = MediaIoBaseDownload(buf, req, chunksize=4 * 1024 * 1024)
done = False
total = 0
while not done:
_, done = dl.next_chunk()
total = buf.tell()
if total > _MAX_EXPORT_BYTES:
break
if total > _MAX_EXPORT_BYTES:
return None
meta["name"] = fname + ext
meta["size"] = total
data = buf.getvalue()
del buf
return (meta, data)
except HttpError as e:
if "exportSizeLimitExceeded" in str(e):
print(
f"[gdrive] skip '{fname}' — file too large for Google export API"
f" (exportSizeLimitExceeded); fid={fid}",
flush=True,
)
return None
else:
if mime.startswith("application/vnd.google-apps."):
return None
if size == 0 or size > max_bytes:
return None
try:
req = service.files().get_media(fileId=fid)
buf = io.BytesIO()
dl = MediaIoBaseDownload(buf, req, chunksize=4 * 1024 * 1024)
done = False
while not done:
_, done = dl.next_chunk()
data = buf.getvalue()
del buf
return (meta, data)
except HttpError:
return None
def _get_drive_file_mime(service, file_id: str) -> str:
"""Return the mimeType of a Drive file."""
info = service.files().get(fileId=file_id, fields="mimeType").execute()
return info.get("mimeType", "")
def _download_drive_file_by_id(service, file_id: str) -> bytes:
"""Download raw bytes of a non-Google-native Drive file by ID."""
req = service.files().get_media(fileId=file_id)
buf = io.BytesIO()
dl = MediaIoBaseDownload(buf, req, chunksize=4 * 1024 * 1024)
done = False
while not done:
_, done = dl.next_chunk()
return buf.getvalue()
def _update_drive_file_content(service, file_id: str, content: bytes, mime_type: str) -> None:
"""Replace a Drive file's content in-place."""
from googleapiclient.http import MediaInMemoryUpload
media = MediaInMemoryUpload(content, mimetype=mime_type, resumable=False)
service.files().update(fileId=file_id, media_body=media).execute()
def _drive_iter(
service,
user_email: str,
@ -581,77 +439,74 @@ def _drive_iter(
for f in resp.get("files", []):
fetched += 1
result = _download_drive_file(service, f, user_email, max_bytes)
if result:
yield result
mime = f.get("mimeType", "")
fid = f.get("id", "")
fname = f.get("name", "")
size = int(f.get("size", 0) or 0)
meta = {
"id": f"gdrive:{fid}",
"name": fname,
"_source": "gdrive",
"_source_type": "gdrive",
"_account": user_email,
"_account_id": user_email,
"_url": f.get("webViewLink", ""),
"lastModifiedDateTime": f.get("modifiedTime", "")[:10],
"size": size,
}
if mime in _EXPORT_MAP:
export_mime, ext = _EXPORT_MAP[mime]
try:
req = service.files().export_media(fileId=fid, mimeType=export_mime)
buf = io.BytesIO()
dl = MediaIoBaseDownload(buf, req, chunksize=4 * 1024 * 1024)
done = False
total = 0
while not done:
status, done = dl.next_chunk()
total = buf.tell()
if total > _MAX_EXPORT_BYTES:
break
if total > _MAX_EXPORT_BYTES:
continue
meta["name"] = fname + ext
meta["size"] = total
data = buf.getvalue()
del buf
yield (meta, data)
except HttpError as e:
if "exportSizeLimitExceeded" in str(e):
print(
f"[gdrive] skip '{fname}' — file too large for Google export API"
f" (exportSizeLimitExceeded); fid={fid}",
flush=True,
)
continue
else:
if mime.startswith("application/vnd.google-apps."):
continue # other native formats we can't export — skip
if size == 0 or size > max_bytes:
continue
try:
req = service.files().get_media(fileId=fid)
buf = io.BytesIO()
dl = MediaIoBaseDownload(buf, req, chunksize=4 * 1024 * 1024)
done = False
while not done:
_, done = dl.next_chunk()
data = buf.getvalue()
del buf
yield (meta, data)
except HttpError:
continue
page_token = resp.get("nextPageToken")
if not page_token:
break
def _drive_get_start_page_token(service) -> str:
"""Return the current Changes API start page token for this Drive."""
resp = service.changes().getStartPageToken().execute()
return resp["startPageToken"]
def _drive_changes_collect(
service,
user_email: str,
page_token: str,
max_files: int,
max_file_mb: float,
) -> "tuple[list[tuple[dict, bytes]], str]":
"""
Collect Drive changes since page_token using the Changes API.
Returns (list_of_(meta, data)_tuples, new_start_page_token).
Skips removed/trashed files.
Raises GoogleError on API failure so the caller can fall back to a full scan.
"""
max_bytes = int(max_file_mb * 1024 * 1024)
fields = (
"nextPageToken,newStartPageToken,"
"changes(removed,file(id,name,mimeType,size,webViewLink,modifiedTime,owners,parents))"
)
results: list = []
new_token = page_token
fetched = 0
while fetched < max_files:
params: dict = {
"pageToken": page_token,
"spaces": "drive",
"fields": fields,
"includeRemoved": True,
"pageSize": min(1000, max_files - fetched),
}
try:
resp = service.changes().list(**params).execute()
except HttpError as e:
raise GoogleError(f"Drive changes error for {user_email}: {e}") from e
for change in resp.get("changes", []):
if change.get("removed"):
continue
f = change.get("file")
if not f:
continue
fetched += 1
result = _download_drive_file(service, f, user_email, max_bytes)
if result:
results.append(result)
if "newStartPageToken" in resp:
new_token = resp["newStartPageToken"]
break
page_token = resp.get("nextPageToken")
if not page_token:
break
return results, new_token
# ── Personal Google account (OAuth device-code) connector ────────────────────
class PersonalGoogleConnector:
@ -766,50 +621,6 @@ class PersonalGoogleConnector:
raise GoogleError(f"Drive auth failed: {e}") from e
yield from _drive_iter(service, user_email, max_files, max_file_mb)
def get_drive_start_token(self, user_email: str) -> str:
"""Return the current Changes API start page token for this Drive."""
self._refresh_if_needed()
try:
service = build("drive", "v3", credentials=self._creds, cache_discovery=False)
except HttpError as e:
raise GoogleError(f"Drive auth failed: {e}") from e
return _drive_get_start_page_token(service)
def get_drive_changes(
self,
user_email: str,
page_token: str,
max_files: int = 5000,
max_file_mb: float = 50.0,
) -> "tuple[list[tuple[dict, bytes]], str]":
"""Return (changed_files, new_page_token) since page_token."""
self._refresh_if_needed()
try:
service = build("drive", "v3", credentials=self._creds, cache_discovery=False)
except HttpError as e:
raise GoogleError(f"Drive auth failed: {e}") from e
return _drive_changes_collect(service, user_email, page_token, max_files, max_file_mb)
# ── Drive write-back (redaction) ──────────────────────────────────────────
def get_drive_file_mime(self, user_email: str, file_id: str) -> str:
"""Return the mimeType of a Drive file."""
self._refresh_if_needed()
service = build("drive", "v3", credentials=self._creds, cache_discovery=False)
return _get_drive_file_mime(service, file_id)
def download_drive_file_by_id(self, user_email: str, file_id: str) -> bytes:
"""Download raw bytes of a non-Google-native Drive file by ID."""
self._refresh_if_needed()
service = build("drive", "v3", credentials=self._creds, cache_discovery=False)
return _download_drive_file_by_id(service, file_id)
def update_drive_file(self, user_email: str, file_id: str, content: bytes, mime_type: str) -> None:
"""Replace Drive file content in-place. Requires drive (not drive.readonly) scope."""
self._refresh_if_needed()
service = build("drive", "v3", credentials=self._creds, cache_discovery=False)
_update_drive_file_content(service, file_id, content, mime_type)
@staticmethod
def get_device_code_flow(client_id: str, client_secret: str) -> dict:
"""

View File

@ -103,13 +103,6 @@
"lbl_time": "Tid",
"lbl_space": "Mellemrum",
"lbl_loading": "Indlæser…",
"history_lbl": "Historik",
"history_items": "fund",
"history_btn_sessions": "Sessioner",
"history_btn_latest": "Åbne fund",
"history_picker_empty": "Ingen tidligere scanninger",
"history_delta_badge": "Delta",
"history_latest_badge": "Seneste",
"lbl_blurred": "Sløret",
"lbl_none": "Ingen",
"lbl_scanner": "Scanner",
@ -348,9 +341,8 @@
"m365_resuming": "Genoptager — springer allerede skannede elementer over…",
"m365_opt_delta": "Delta-scanning",
"m365_opt_delta_hint": "Kun ændrede elementer (efter første fulde scanning)",
"m365_delta_tokens_saved": "Tokens gemt for {n} kilde(r)",
"m365_delta_tokens_saved": "Tokens gemt",
"m365_delta_clear": "Ryd tokens",
"m365_delta_tokens_hint": "Gemte ændringstokens gør, at delta-scanninger kun henter elementer ændret siden sidste scanning. Ryd tokens tvinger næste scanning til at være en fuld scanning.",
"m365_delta_cleared": "Delta-tokens ryddet — næste scanning bliver fuld scanning.",
"m365_delta_mode": "Delta-tilstand — henter kun ændrede elementer…",
"m365_smtp_title": "✉ Send rapport",
@ -365,8 +357,6 @@
"m365_smtp_recipients": "Modtagere",
"m365_smtp_recipients_hint": "Adskil med komma eller semikolon",
"m365_smtp_save": "Gem",
"m365_smtp_auto_email_manual": "Send rapport efter manuel scanning",
"m365_smtp_prefer_smtp": "Send altid via SMTP (spring Microsoft Graph over)",
"m365_smtp_send": "Send nu",
"m365_smtp_saved": "Indstillinger gemt.",
"m365_smtp_sending": "Sender…",
@ -561,32 +551,15 @@
"m365_db_import_mode": "Tilstand:",
"m365_db_import_merge": "Sammenflet (sikker)",
"m365_db_import_replace": "Erstat (fuld gendannelse)",
"m365_db_import_replace_warn": "⚠ Erstatningstilstand sletter alle eksisterende scanningsdata inden gendannelse. Sørg for at have en sikkerhedskopi af ~/.gdprscanner/scanner.db først.",
"m365_db_import_replace_confirm": "Erstatningstilstand sletter ALLE eksisterende scanningsdata og gendanner fra arkivet.\\n\\nSørg for at have en manuel sikkerhedskopi af ~/.gdprscanner/scanner.db.\\n\\nFortsæt?",
"m365_db_import_replace_warn": "⚠ Erstatningstilstand sletter alle eksisterende scanningsdata inden gendannelse. Sørg for at have en sikkerhedskopi af ~/.gdpr_scanner.db først.",
"m365_db_import_replace_confirm": "Erstatningstilstand sletter ALLE eksisterende scanningsdata og gendanner fra arkivet.\\n\\nSørg for at have en manuel sikkerhedskopi af ~/.gdpr_scanner.db.\\n\\nFortsæt?",
"m365_db_import_no_file": "Vælg venligst en ZIP-fil først.",
"m365_db_importing": "Importerer…",
"m365_db_imported": "Importeret",
"m365_db_import_run": "Importer",
"m365_opt_scan_photos": "Søg efter ansigter i billeder",
"m365_opt_scan_photos_hint": "Markerer billeder med registrerede ansigter som Art. 9 biometriske data. Langsommere — aktivér efter behov.",
"m365_opt_skip_gps": "Ignorer GPS i billeder",
"m365_opt_skip_gps_hint": "Billeder med GPS-koordinater flagges ikke — nyttigt ved elevscanninger, hvor smartphones indlejrer placering i alle fotos.",
"m365_opt_min_cpr": "Min. CPR-antal pr. fil",
"m365_opt_scan_emails": "Søg efter e-mailadresser",
"m365_opt_scan_emails_hint": "Flagger filer med e-mailadresser. Slået fra som standard — e-mailadresser er meget almindelige og kan give mange resultater.",
"m365_opt_scan_phones": "Søg efter telefonnumre",
"m365_opt_scan_phones_hint": "Flagger filer med danske telefonnumre (8 cifre). Nyttigt til at finde kontaktlister og forældrekorrespondance.",
"m365_badge_emails": "e-mail",
"m365_badge_phones": "tlf.",
"m365_opt_min_cpr_hint": "Filer med færre distinkte CPR-numre end denne tærskel rapporteres ikke. Sæt til 2 for at undgå falske positive, når elever har egne CPR-numre i filer.",
"m365_opt_cpr_only": "Kun CPR-tilstand",
"m365_opt_cpr_only_hint": "Flagger kun filer med CPR-numre. Filer med kun e-mailadresser, telefonnumre, ansigter eller EXIF-metadata ignoreres.",
"m365_opt_ocr_lang": "OCR-sprog",
"m365_opt_ocr_lang_hint": "Tesseract-sprogpakke(r) der bruges ved scanning af scannede PDF'er og billeder. Sprogpakker skal være installeret på serveren (f.eks. tesseract-ocr-dan). Flere pakker: dan+eng.",
"m365_filter_photo_only": "📷 Billeder / biometrisk",
"m365_filter_all_roles": "Alle roller",
"m365_filter_staff": "Ansatte",
"m365_filter_student": "Elever",
"m365_badge_faces": "ansigter",
"a30_photo_items": "Billeder med registrerede ansigter (Art. 9 biometrisk)",
"a30_photo_note": "Fotografier af identificerbare personer er biometriske data i henhold til Art. 9 GDPR. Opbevaring kræver et dokumenteret retsgrundlag i henhold til Art. 9(2). For skolefotografier af elever under 15 år er forældrenes samtykke påkrævet (Databeskyttelsesloven §6). Se Datatilsynets vejledning om fotografering i skoler.",
@ -610,47 +583,16 @@
"m365_file_sources_empty": "Ingen filkilder konfigureret. Tilføj en lokal mappe eller netværksdeling nedenfor.",
"m365_file_sources_add": "Tilføj kilde",
"m365_fsrc_label": "Betegnelse",
"m365_fsrc_name": "Navn",
"m365_fsrc_sftp_auth": "Auth",
"m365_fsrc_path": "Sti",
"m365_fsrc_smb_detected": "SMB/CIFS-netværksdeling registreret",
"m365_fsrc_smb_host": "SMB-vært",
"m365_fsrc_smb_user": "Brugernavn",
"m365_fsrc_smb_pw": "Adgangskode",
"m365_fsrc_smb_pw_hint": "Adgangskoden gemmes i nøglekæden — aldrig i en fil.",
"m365_fsrc_pw_keychain_placeholder": "Gemt i OS-nøglering",
"m365_fsrc_add_btn": "Tilføj",
"m365_fsrc_saved": "Kilde gemt",
"m365_fsrc_saving": "Gemmer...",
"m365_fsrc_path_required": "Sti er påkrævet.",
"m365_fsrc_type_local": "Lokal mappe",
"m365_fsrc_type_smb": "Netværksdrev (SMB)",
"m365_fsrc_type_sftp": "SFTP-server",
"m365_fsrc_sftp_host": "SFTP-host",
"m365_fsrc_sftp_port": "Port",
"m365_fsrc_sftp_user": "Brugernavn",
"m365_fsrc_sftp_remote_path": "Fjernsti",
"m365_fsrc_sftp_auth_password": "Adgangskode",
"m365_fsrc_sftp_auth_key": "SSH-nøgle",
"m365_fsrc_sftp_pw": "Adgangskode",
"m365_fsrc_sftp_pw_hint": "Adgangskoden gemmes i OS-nøgleringe — aldrig i en fil.",
"m365_fsrc_sftp_key_upload": "Privat nøglefil",
"m365_fsrc_sftp_key_btn": "Upload nøgle",
"m365_fsrc_sftp_key_uploaded": "Nøgle uploadet",
"m365_fsrc_sftp_passphrase": "Adgangssætning (hvis nøglen er krypteret)",
"m365_fsrc_sftp_passphrase_hint": "Adgangssætningen gemmes i OS-nøgleringe — aldrig i en fil.",
"m365_fsrc_sftp_not_installed": "paramiko er ikke installeret — kør: pip install paramiko",
"m365_fsrc_name_placeholder": "f.eks. Lærerfiler, NAS-arkiv",
"m365_fsrc_path_placeholder": "~/Dokumenter eller //nas/shares",
"m365_fsrc_smb_host_placeholder": "nas.skole.dk",
"m365_fsrc_smb_user_placeholder": "DOMÆNE\\brugernavn",
"m365_fsrc_smb_user_edit_placeholder": "DOMÆNE\\brugernavn eller brugernavn",
"m365_fsrc_sftp_host_placeholder": "sftp.skole.dk",
"m365_fsrc_sftp_user_placeholder": "backup_user",
"m365_fsrc_sftp_path_placeholder": "/var/data",
"m365_fsrc_sftp_passphrase_placeholder": "Lad stå tomt hvis nøglen ikke er krypteret",
"m365_fsrc_sftp_host_required": "SFTP-host er påkrævet.",
"m365_fsrc_sftp_user_required": "SFTP-brugernavn er påkrævet.",
"m365_fsrc_scan_btn": "Scan",
"m365_fsrc_scan_start": "Starter filscanning",
"m365_src_group_files": "Filkilder",
@ -677,14 +619,6 @@
"m365_settings_tab_general": "Generelt",
"m365_settings_tab_email": "E-mailrapport",
"m365_settings_tab_database": "Database",
"m365_settings_tab_auditlog": "Revisionslog",
"m365_audit_title": "Compliance-revisionslog",
"m365_audit_col_time": "Tidspunkt",
"m365_audit_col_action": "Handling",
"m365_audit_col_detail": "Detalje",
"m365_audit_col_ip": "IP",
"m365_audit_loading": "Indlæser…",
"m365_audit_empty": "Ingen revisionsbegivenheder registreret endnu.",
"m365_settings_appearance": "Udseende",
"m365_settings_language": "Sprog",
"m365_settings_theme": "Tema",
@ -721,23 +655,7 @@
"m365_smtp_test": "Test",
"m365_smtp_testing": "Sender test-email…",
"m365_smtp_test_ok": "Test-email sendt",
"m365_smtp_test_ok_graph": "Test-email sendt via Microsoft Graph til",
"m365_smtp_test_ok_smtp": "Test-email sendt via SMTP til",
"m365_smtp_graph_also_failed": "(⚠ Graph mislykkedes også — Mail.Send ikke tildelt)",
"m365_smtp_test_fail": "Forbindelse mislykkedes",
"bulk_select_mode": "Vælg",
"bulk_select_all": "Vælg alle synlige",
"bulk_deselect_all": "Fravælg alle",
"bulk_apply": "Anvend",
"bulk_done": "Afslut",
"bulk_selected": "valgt",
"bulk_applied": "opdateret",
"disp_stats_total": "total",
"disp_stats_unreviewed": "ikke gennemgået",
"disp_stats_retain": "behold",
"disp_stats_delete": "slet",
"disp_stats_other": "andet",
"disp_stats_reviewed": "gennemgået",
"m365_fsrc_edit_btn": "Rediger",
"m365_fsrc_save_changes": "Gem ændringer",
"m365_settings_tab_scheduler": "Planlægger",
@ -755,8 +673,6 @@
"m365_sched_after_scan": "Efter scanning",
"m365_sched_auto_email": "Send rapport automatisk",
"m365_sched_auto_retention": "Håndhæv opbevaringspolitik",
"m365_sched_report_only": "Kun rapport",
"m365_sched_report_only_hint": "Send de seneste scanningsresultater uden at køre en ny scanning. Kræver scanningsresultater i databasen.",
"m365_sched_status": "Status",
"m365_sched_run_now": "▶ Kør nu",
"m365_sched_add": "+ Tilføj planlagt scanning",
@ -765,9 +681,6 @@
"m365_sched_editor_edit": "Rediger planlagt scanning",
"m365_sched_name_required": "Navn er påkrævet",
"m365_sched_no_runs": "Ingen planlagte kørsler endnu",
"m365_sched_no_jobs": "Ingen planlagte scanninger endnu.",
"m365_sched_running": "Kører...",
"m365_sched_disabled": "Deaktiveret",
"m365_sched_freq_daily": "Dagligt",
"m365_sched_freq_weekly": "Ugentligt",
"m365_sched_freq_monthly": "Månedligt",
@ -815,7 +728,9 @@
"role_staff": "Ansat",
"role_student": "Elev",
"role_other": "Anden",
"m365_settings_tab_security": "Sikkerhed",
"share_modal_title": "Del resultater",
"share_modal_desc": "Skrivebeskyttede links lader en DPO eller gennemganger se resultater og tilknytte dispositioner uden adgang til scanningskontroller eller legitimationsoplysninger.",
"share_new_link": "Nyt link",
@ -844,18 +759,7 @@
"share_create_error": "Kunne ikke oprette link:",
"share_revoke_confirm": "Tilbagekald dette link? Alle der bruger det, mister straks adgang.",
"share_revoke_error": "Kunne ikke tilbagekalde:",
"share_scope_lbl": "Omfang",
"share_scope_all": "Alle",
"share_scope_type_role": "Rolle",
"share_scope_type_user": "Bruger",
"share_date_from": "Emner fra",
"share_date_to": "Emner til og med",
"share_scope_role_lbl": "Rolle",
"share_scope_user_lbl": "Brugerens e-mail",
"share_scope_user_placeholder": "alice@skole.dk",
"share_scope_user_invalid": "Angiv venligst en gyldig e-mailadresse for brugeromfanget.",
"share_scope_staff": "Ansatte",
"share_scope_student": "Elever",
"viewer_pin_group_title": "Seerens PIN",
"viewer_pin_desc": "En numerisk PIN (48 cifre), der lader alle åbne <code style=\"font-size:10px\">/view</code> i en browser for skrivebeskyttet adgang til resultater uden et token-link.",
"viewer_pin_clear": "Ryd PIN",
@ -865,44 +769,5 @@
"viewer_pin_saving": "Gemmer…",
"viewer_pin_saved": "PIN gemt",
"viewer_pin_clear_confirm": "Fjern seerens PIN? /view vil igen kræve et token-link.",
"viewer_pin_cleared": "PIN ryddet",
"interface_pin_group_title": "Interface-PIN",
"interface_pin_desc": "En numerisk PIN-kode (48 cifre), der skal indtastes, inden man får adgang til selve scanneren. Seere, der tilgår <code style=\"font-size:10px\">/view</code>, er ikke berørt.",
"interface_pin_clear": "Ryd PIN",
"interface_pin_is_set": "Interface-PIN er angivet",
"interface_pin_not_set_msg": "Ingen PIN angivet — grænsefladen er åben for alle på netværket",
"interface_pin_saved": "PIN gemt",
"interface_pin_clear_confirm": "Fjern interface-PIN? Scanneren vil herefter være tilgængelig for alle på netværket.",
"interface_pin_cleared": "PIN ryddet",
"interface_pin_login_desc": "Indtast interface-PIN for at fortsætte.",
"interface_pin_login_btn": "Fortsæt",
"interface_pin_err_incorrect": "Forkert PIN.",
"interface_pin_err_too_many": "For mange forsøg. Prøv igen om lidt.",
"interface_pin_err_network": "Netværksfejl. Prøv igen.",
"m365_settings_tab_ai": "AI / NER",
"m365_ai_title": "AI-forbedret navnegenkendelse",
"m365_ai_desc": "Brug Claude AI i stedet for spaCy til navn-, adresse- og organisationsgenkendelse. Betydeligt mere nøjagtig på dansk tekst — særligt dobbeltefternavne og fremmedsprogede navne. Kræver en Anthropic API-nøgle; faktureres pr. token.",
"m365_ai_enable": "Aktiver Claude NER",
"m365_ai_api_key_label": "Anthropic API-nøgle",
"m365_ai_show_key": "Vis",
"m365_ai_hide_key": "Skjul",
"m365_ai_key_set": "API-nøgle gemt",
"m365_ai_key_not_set": "Ingen API-nøgle gemt",
"m365_ai_test": "Test nøgle",
"m365_ai_testing": "Tester…",
"m365_ai_test_ok": "API-nøgle er gyldig",
"m365_ai_test_fail": "Test mislykkedes",
"m365_ai_saved": "Gemt",
"m365_ai_model_note": "Model: claude-haiku-4-5 · faktureres efter Anthropics token-priser · resultater caches pr. dokument.",
"m365_settings_updates": "Softwareopdatering",
"m365_update_idle": "Tjek om der findes en nyere version.",
"m365_update_auto": "Installér opdateringer automatisk (tjekkes dagligt — programmet genstarter selv)",
"m365_update_check": "Søg efter opdateringer",
"m365_update_install": "Installér opdatering",
"m365_update_checking": "Tjekker…",
"m365_update_uptodate": "Du kører den nyeste version.",
"m365_update_available": "Opdatering tilgængelig",
"m365_update_installing": "Installerer opdatering — programmet genstarter…",
"m365_update_failed": "Opdateringstjek mislykkedes",
"m365_update_scan_running": "Kan ikke opdatere, mens en scanning kører."
"viewer_pin_cleared": "PIN ryddet"
}

View File

@ -164,13 +164,6 @@
"lbl_working": "Wird bearbeitet…",
"lbl_stopping": "Wird gestoppt…",
"lbl_loading": "Wird geladen…",
"history_lbl": "Verlauf",
"history_items": "Treffer",
"history_btn_sessions": "Sessionen",
"history_btn_latest": "Offene Einträge",
"history_picker_empty": "Keine früheren Scans",
"history_delta_badge": "Delta",
"history_latest_badge": "Aktuell",
"lbl_blurred": "Unscharf gemacht",
"lbl_none": "Keine",
"lbl_size": "Größe",
@ -348,9 +341,8 @@
"m365_resuming": "Fortsetzen — bereits gescannte Elemente werden übersprungen…",
"m365_opt_delta": "Delta-Scan",
"m365_opt_delta_hint": "Nur geänderte Elemente (nach erstem Vollscan)",
"m365_delta_tokens_saved": "Tokens für {n} Quelle(n) gespeichert",
"m365_delta_tokens_saved": "Tokens gespeichert",
"m365_delta_clear": "Tokens löschen",
"m365_delta_tokens_hint": "Gespeicherte Änderungstokens lassen Delta-Scans nur Elemente abrufen, die seit dem letzten Scan geändert wurden. Tokens löschen erzwingt beim nächsten Scan einen Vollscan.",
"m365_delta_cleared": "Delta-Tokens gelöscht — nächster Scan wird ein Vollscan.",
"m365_delta_mode": "Delta-Modus — nur geänderte Elemente werden abgerufen…",
"m365_smtp_title": "✉ Bericht senden",
@ -365,8 +357,6 @@
"m365_smtp_recipients": "Empfänger",
"m365_smtp_recipients_hint": "Komma- oder semikolongetrennt",
"m365_smtp_save": "Speichern",
"m365_smtp_auto_email_manual": "Bericht nach manueller Suche senden",
"m365_smtp_prefer_smtp": "Immer via SMTP senden (Microsoft Graph überspringen)",
"m365_smtp_send": "Jetzt senden",
"m365_smtp_saved": "Einstellungen gespeichert.",
"m365_smtp_sending": "Senden…",
@ -561,32 +551,15 @@
"m365_db_import_mode": "Modus:",
"m365_db_import_merge": "Zusammenführen (sicher)",
"m365_db_import_replace": "Ersetzen (vollständige Wiederherstellung)",
"m365_db_import_replace_warn": "⚠ Der Ersetzungsmodus löscht alle vorhandenen Scandaten vor der Wiederherstellung. Stellen Sie sicher, dass Sie zuerst eine Sicherungskopie von ~/.gdprscanner/scanner.db haben.",
"m365_db_import_replace_confirm": "Der Ersetzungsmodus löscht ALLE vorhandenen Scandaten und stellt aus dem Archiv wieder her.\\n\\nStellen Sie sicher, dass Sie eine manuelle Sicherungskopie von ~/.gdprscanner/scanner.db haben.\\n\\nFortfahren?",
"m365_db_import_replace_warn": "⚠ Der Ersetzungsmodus löscht alle vorhandenen Scandaten vor der Wiederherstellung. Stellen Sie sicher, dass Sie zuerst eine Sicherungskopie von ~/.gdpr_scanner.db haben.",
"m365_db_import_replace_confirm": "Der Ersetzungsmodus löscht ALLE vorhandenen Scandaten und stellt aus dem Archiv wieder her.\\n\\nStellen Sie sicher, dass Sie eine manuelle Sicherungskopie von ~/.gdpr_scanner.db haben.\\n\\nFortfahren?",
"m365_db_import_no_file": "Bitte wählen Sie zuerst eine ZIP-Datei aus.",
"m365_db_importing": "Importiere…",
"m365_db_imported": "Importiert",
"m365_db_import_run": "Importieren",
"m365_opt_scan_photos": "Fotos nach Gesichtern durchsuchen",
"m365_opt_scan_photos_hint": "Markiert Bilder mit erkannten Gesichtern als biometrische Daten gem. Art. 9. Langsamer — bei Bedarf aktivieren.",
"m365_opt_skip_gps": "GPS in Bildern ignorieren",
"m365_opt_skip_gps_hint": "Bilder mit GPS-Koordinaten werden nicht markiert — nützlich beim Scannen von Schüler-Konten, deren Smartphones Standort in jedes Foto einbetten.",
"m365_opt_min_cpr": "Min. CPR-Anzahl pro Datei",
"m365_opt_scan_emails": "E-Mail-Adressen scannen",
"m365_opt_scan_emails_hint": "Markiert Dateien mit E-Mail-Adressen. Standardmäßig deaktiviert — E-Mail-Adressen sind sehr häufig und können viele Treffer erzeugen.",
"m365_opt_scan_phones": "Telefonnummern scannen",
"m365_opt_scan_phones_hint": "Markiert Dateien mit dänischen Telefonnummern (8 Ziffern). Nützlich zum Auffinden von Kontaktlisten.",
"m365_badge_emails": "E-Mail",
"m365_badge_phones": "Tel.",
"m365_opt_min_cpr_hint": "Dateien mit weniger eindeutigen CPR-Nummern als dieser Schwellenwert werden nicht gemeldet. Auf 2 setzen, um Falsch-Positive zu vermeiden, wenn Schüler eigene CPR-Nummern in Dateien haben.",
"m365_opt_cpr_only": "Nur-CPR-Modus",
"m365_opt_cpr_only_hint": "Markiert nur Dateien mit CPR-Nummern. Dateien mit nur E-Mail-Adressen, Telefonnummern, Gesichtern oder EXIF-Metadaten werden ignoriert.",
"m365_opt_ocr_lang": "OCR-Sprache",
"m365_opt_ocr_lang_hint": "Tesseract-Sprachpaket(e) für das Scannen von gescannten PDFs und Bildern. Pakete müssen auf dem Server installiert sein (z.B. tesseract-ocr-dan). Mehrere Pakete: dan+eng.",
"m365_filter_photo_only": "📷 Fotos / biometrisch",
"m365_filter_all_roles": "Alle Rollen",
"m365_filter_staff": "Personal",
"m365_filter_student": "Schüler",
"m365_badge_faces": "Gesichter",
"a30_photo_items": "Fotos mit erkannten Gesichtern (Art. 9 biometrisch)",
"a30_photo_note": "Fotografien identifizierbarer Personen sind biometrische Daten gemäß Art. 9 DSGVO. Die Aufbewahrung erfordert eine dokumentierte Rechtsgrundlage gemäß Art. 9(2). Für Schulfotos von Schülern unter 15 Jahren ist die elterliche Einwilligung erforderlich (Databeskyttelsesloven §6). Siehe Leitfaden des Datatilsynet zur Schulfotografie.",
@ -610,47 +583,16 @@
"m365_file_sources_empty": "Keine Dateiquellen konfiguriert. Fügen Sie unten einen lokalen Ordner oder eine Netzwerkfreigabe hinzu.",
"m365_file_sources_add": "Quelle hinzufügen",
"m365_fsrc_label": "Bezeichnung",
"m365_fsrc_name": "Name",
"m365_fsrc_sftp_auth": "Auth",
"m365_fsrc_path": "Pfad",
"m365_fsrc_smb_detected": "SMB/CIFS-Netzwerkfreigabe erkannt",
"m365_fsrc_smb_host": "SMB-Host",
"m365_fsrc_smb_user": "Benutzername",
"m365_fsrc_smb_pw": "Passwort",
"m365_fsrc_smb_pw_hint": "Das Passwort wird im OS-Schlüsselbund gespeichert — nie in einer Datei.",
"m365_fsrc_pw_keychain_placeholder": "Im OS-Schlüsselbund gespeichert",
"m365_fsrc_add_btn": "Hinzufügen",
"m365_fsrc_saved": "Quelle gespeichert",
"m365_fsrc_saving": "Speichern...",
"m365_fsrc_path_required": "Pfad ist erforderlich.",
"m365_fsrc_type_local": "Lokaler Ordner",
"m365_fsrc_type_smb": "Netzwerkfreigabe (SMB)",
"m365_fsrc_type_sftp": "SFTP-Server",
"m365_fsrc_sftp_host": "SFTP-Host",
"m365_fsrc_sftp_port": "Port",
"m365_fsrc_sftp_user": "Benutzername",
"m365_fsrc_sftp_remote_path": "Remote-Pfad",
"m365_fsrc_sftp_auth_password": "Passwort",
"m365_fsrc_sftp_auth_key": "SSH-Schlüssel",
"m365_fsrc_sftp_pw": "Passwort",
"m365_fsrc_sftp_pw_hint": "Passwort wird im OS-Schlüsselbund gespeichert — nie in einer Datei.",
"m365_fsrc_sftp_key_upload": "Private Schlüsseldatei",
"m365_fsrc_sftp_key_btn": "Schlüssel hochladen",
"m365_fsrc_sftp_key_uploaded": "Schlüssel hochgeladen",
"m365_fsrc_sftp_passphrase": "Passphrase (wenn Schlüssel verschlüsselt ist)",
"m365_fsrc_sftp_passphrase_hint": "Passphrase wird im OS-Schlüsselbund gespeichert — nie in einer Datei.",
"m365_fsrc_sftp_not_installed": "paramiko nicht installiert — ausführen: pip install paramiko",
"m365_fsrc_name_placeholder": "z.B. Lehrerdateien, NAS-Archiv",
"m365_fsrc_path_placeholder": "~/Dokumente oder //nas/freigaben",
"m365_fsrc_smb_host_placeholder": "nas.schule.de",
"m365_fsrc_smb_user_placeholder": "DOMÄNE\\Benutzername",
"m365_fsrc_smb_user_edit_placeholder": "DOMÄNE\\Benutzername oder Benutzername",
"m365_fsrc_sftp_host_placeholder": "sftp.schule.de",
"m365_fsrc_sftp_user_placeholder": "backup_user",
"m365_fsrc_sftp_path_placeholder": "/var/data",
"m365_fsrc_sftp_passphrase_placeholder": "Leer lassen, wenn der Schlüssel nicht verschlüsselt ist",
"m365_fsrc_sftp_host_required": "SFTP-Host ist erforderlich.",
"m365_fsrc_sftp_user_required": "SFTP-Benutzername ist erforderlich.",
"m365_fsrc_scan_btn": "Scannen",
"m365_fsrc_scan_start": "Datei-Scan wird gestartet",
"m365_src_group_files": "Dateiquellen",
@ -677,14 +619,6 @@
"m365_settings_tab_general": "Allgemein",
"m365_settings_tab_email": "E-Mail-Bericht",
"m365_settings_tab_database": "Datenbank",
"m365_settings_tab_auditlog": "Prüfprotokoll",
"m365_audit_title": "Compliance-Prüfprotokoll",
"m365_audit_col_time": "Zeitpunkt",
"m365_audit_col_action": "Aktion",
"m365_audit_col_detail": "Detail",
"m365_audit_col_ip": "IP",
"m365_audit_loading": "Wird geladen…",
"m365_audit_empty": "Noch keine Prüfereignisse aufgezeichnet.",
"m365_settings_appearance": "Erscheinungsbild",
"m365_settings_language": "Sprache",
"m365_settings_theme": "Design",
@ -721,23 +655,7 @@
"m365_smtp_test": "Testen",
"m365_smtp_testing": "Test-E-Mail wird gesendet…",
"m365_smtp_test_ok": "Test-E-Mail gesendet",
"m365_smtp_test_ok_graph": "Test-E-Mail über Microsoft Graph gesendet an",
"m365_smtp_test_ok_smtp": "Test-E-Mail über SMTP gesendet an",
"m365_smtp_graph_also_failed": "(⚠ Graph fehlgeschlagen — Mail.Send nicht erteilt)",
"m365_smtp_test_fail": "Verbindung fehlgeschlagen",
"bulk_select_mode": "Auswählen",
"bulk_select_all": "Alle sichtbaren auswählen",
"bulk_deselect_all": "Alle abwählen",
"bulk_apply": "Anwenden",
"bulk_done": "Fertig",
"bulk_selected": "ausgewählt",
"bulk_applied": "aktualisiert",
"disp_stats_total": "gesamt",
"disp_stats_unreviewed": "nicht überprüft",
"disp_stats_retain": "behalten",
"disp_stats_delete": "löschen",
"disp_stats_other": "sonstige",
"disp_stats_reviewed": "überprüft",
"m365_fsrc_edit_btn": "Bearbeiten",
"m365_fsrc_save_changes": "Änderungen speichern",
"m365_settings_tab_scheduler": "Zeitplaner",
@ -755,8 +673,6 @@
"m365_sched_after_scan": "Nach dem Scan",
"m365_sched_auto_email": "Bericht automatisch senden",
"m365_sched_auto_retention": "Aufbewahrungsrichtlinie durchsetzen",
"m365_sched_report_only": "Nur Bericht",
"m365_sched_report_only_hint": "Letzte Scanergebnisse senden, ohne einen neuen Scan durchzuführen. Erfordert Scanergebnisse in der Datenbank.",
"m365_sched_status": "Status",
"m365_sched_run_now": "▶ Jetzt ausführen",
"m365_sched_add": "+ Geplante Suche hinzufügen",
@ -765,9 +681,6 @@
"m365_sched_editor_edit": "Geplante Suche bearbeiten",
"m365_sched_name_required": "Name ist erforderlich",
"m365_sched_no_runs": "Noch keine geplanten Läufe",
"m365_sched_no_jobs": "Noch keine geplanten Scans.",
"m365_sched_running": "Läuft...",
"m365_sched_disabled": "Deaktiviert",
"m365_sched_freq_daily": "Täglich",
"m365_sched_freq_weekly": "Wöchentlich",
"m365_sched_freq_monthly": "Monatlich",
@ -815,7 +728,9 @@
"role_staff": "Personal",
"role_student": "Schüler",
"role_other": "Andere",
"m365_settings_tab_security": "Sicherheit",
"share_modal_title": "Ergebnisse teilen",
"share_modal_desc": "Schreibgeschützte Links ermöglichen einem Datenschutzbeauftragten oder Prüfer, Ergebnisse einzusehen und Verwendungszwecke zuzuweisen, ohne Zugriff auf Scansteuerung oder Anmeldedaten.",
"share_new_link": "Neuer Link",
@ -844,20 +759,9 @@
"share_create_error": "Link konnte nicht erstellt werden:",
"share_revoke_confirm": "Diesen Link widerrufen? Alle Nutzer verlieren sofort den Zugriff.",
"share_revoke_error": "Widerrufen fehlgeschlagen:",
"share_scope_lbl": "Bereich",
"share_scope_all": "Alle",
"share_scope_type_role": "Rolle",
"share_scope_type_user": "Benutzer",
"share_date_from": "Elemente ab",
"share_date_to": "Elemente bis",
"share_scope_role_lbl": "Rolle",
"share_scope_user_lbl": "Benutzer-E-Mail",
"share_scope_user_placeholder": "alice@schule.de",
"share_scope_user_invalid": "Bitte gib eine gültige E-Mail-Adresse für den Benutzerbereich an.",
"share_scope_staff": "Mitarbeitende",
"share_scope_student": "Schüler",
"viewer_pin_group_title": "Betrachter-PIN",
"viewer_pin_desc": "Eine numerische PIN (48 Stellen), die es jedem ermöglicht, <code style=\"font-size:10px\">/view</code> im Browser zu öffnen und schreibgeschützt auf Ergebnisse zuzugreifen ohne Token-Link.",
"viewer_pin_desc": "Eine numerische PIN (48 Stellen), die es jedem ermöglicht, <code style=\"font-size:10px\">/view</code> im Browser zu öffnen und schreibgeschützt auf Ergebnisse zuzugreifen \u2013 ohne Token-Link.",
"viewer_pin_clear": "PIN löschen",
"viewer_pin_is_set": "Betrachter-PIN ist festgelegt",
"viewer_pin_not_set_msg": "Keine PIN festgelegt — /view erfordert einen Token-Link",
@ -865,44 +769,5 @@
"viewer_pin_saving": "Wird gespeichert…",
"viewer_pin_saved": "PIN gespeichert",
"viewer_pin_clear_confirm": "Betrachter-PIN entfernen? /view erfordert dann wieder einen Token-Link.",
"viewer_pin_cleared": "PIN gelöscht",
"interface_pin_group_title": "Interface-PIN",
"interface_pin_desc": "Eine numerische PIN (48 Stellen), die eingegeben werden muss, bevor auf die Scanner-Oberfläche zugegriffen werden kann. Betrachter, die <code style=\"font-size:10px\">/view</code> aufrufen, sind nicht betroffen.",
"interface_pin_clear": "PIN löschen",
"interface_pin_is_set": "Interface-PIN ist gesetzt",
"interface_pin_not_set_msg": "Keine PIN gesetzt — Oberfläche ist für alle im Netzwerk offen",
"interface_pin_saved": "PIN gespeichert",
"interface_pin_clear_confirm": "Interface-PIN entfernen? Der Scanner ist dann für alle im Netzwerk zugänglich.",
"interface_pin_cleared": "PIN gelöscht",
"interface_pin_login_desc": "Interface-PIN eingeben, um fortzufahren.",
"interface_pin_login_btn": "Weiter",
"interface_pin_err_incorrect": "Falsche PIN.",
"interface_pin_err_too_many": "Zu viele Versuche. Bitte später erneut versuchen.",
"interface_pin_err_network": "Netzwerkfehler. Bitte erneut versuchen.",
"m365_settings_tab_ai": "KI / NER",
"m365_ai_title": "KI-gestützte Entitätserkennung",
"m365_ai_desc": "Claude KI statt spaCy für Name-, Adress- und Organisationserkennung verwenden. Deutlich genauer bei dänischen Texten — insbesondere bei Doppelnamen und fremdsprachigen Namen. Benötigt einen Anthropic-API-Schlüssel; Abrechnung per Token.",
"m365_ai_enable": "Claude NER aktivieren",
"m365_ai_api_key_label": "Anthropic-API-Schlüssel",
"m365_ai_show_key": "Anzeigen",
"m365_ai_hide_key": "Ausblenden",
"m365_ai_key_set": "API-Schlüssel gespeichert",
"m365_ai_key_not_set": "Kein API-Schlüssel gespeichert",
"m365_ai_test": "Schlüssel testen",
"m365_ai_testing": "Wird getestet…",
"m365_ai_test_ok": "API-Schlüssel gültig",
"m365_ai_test_fail": "Test fehlgeschlagen",
"m365_ai_saved": "Gespeichert",
"m365_ai_model_note": "Modell: claude-haiku-4-5 · Abrechnung nach Anthropic-Token-Tarifen · Ergebnisse werden pro Dokument gecacht.",
"m365_settings_updates": "Softwareaktualisierung",
"m365_update_idle": "Prüfen, ob eine neuere Version verfügbar ist.",
"m365_update_auto": "Updates automatisch installieren (tägliche Prüfung — die App startet sich selbst neu)",
"m365_update_check": "Nach Updates suchen",
"m365_update_install": "Update installieren",
"m365_update_checking": "Wird geprüft…",
"m365_update_uptodate": "Sie verwenden die neueste Version.",
"m365_update_available": "Update verfügbar",
"m365_update_installing": "Update wird installiert — die App startet neu…",
"m365_update_failed": "Updateprüfung fehlgeschlagen",
"m365_update_scan_running": "Update nicht möglich, während ein Scan läuft."
"viewer_pin_cleared": "PIN gelöscht"
}

View File

@ -103,13 +103,6 @@
"lbl_time": "Time",
"lbl_space": "Space",
"lbl_loading": "Loading…",
"history_lbl": "History",
"history_items": "items",
"history_btn_sessions": "Sessions",
"history_btn_latest": "Open items",
"history_picker_empty": "No past scans",
"history_delta_badge": "Delta",
"history_latest_badge": "Latest",
"lbl_blurred": "Blurred",
"lbl_none": "None",
"lbl_scanner": "Scanner",
@ -348,9 +341,8 @@
"m365_resuming": "Resuming — skipping already-scanned items…",
"m365_opt_delta": "Delta scan",
"m365_opt_delta_hint": "Changed items only (after first full scan)",
"m365_delta_tokens_saved": "Tokens saved for {n} source(s)",
"m365_delta_tokens_saved": "Tokens saved",
"m365_delta_clear": "Clear tokens",
"m365_delta_tokens_hint": "Saved change-tokens let delta scans fetch only items modified since the last scan. Clear tokens forces the next scan to be a full scan.",
"m365_delta_cleared": "Delta tokens cleared — next scan will be a full scan.",
"m365_delta_mode": "Delta mode — fetching changed items only…",
"m365_smtp_title": "✉ Email report",
@ -365,8 +357,6 @@
"m365_smtp_recipients": "Recipients",
"m365_smtp_recipients_hint": "Comma or semicolon separated",
"m365_smtp_save": "Save",
"m365_smtp_auto_email_manual": "Email report after manual scan",
"m365_smtp_prefer_smtp": "Always send via SMTP (skip Microsoft Graph)",
"m365_smtp_send": "Send now",
"m365_smtp_saved": "Settings saved.",
"m365_smtp_sending": "Sending…",
@ -561,32 +551,15 @@
"m365_db_import_mode": "Mode:",
"m365_db_import_merge": "Merge (safe)",
"m365_db_import_replace": "Replace (full restore)",
"m365_db_import_replace_warn": "⚠ Replace mode will erase all existing scan data before restoring. Make sure you have a backup of ~/.gdprscanner/scanner.db first.",
"m365_db_import_replace_confirm": "Replace mode will erase ALL existing scan data and restore from the archive.\\n\\nMake sure you have a manual backup of ~/.gdprscanner/scanner.db.\\n\\nProceed?",
"m365_db_import_replace_warn": "⚠ Replace mode will erase all existing scan data before restoring. Make sure you have a backup of ~/.gdpr_scanner.db first.",
"m365_db_import_replace_confirm": "Replace mode will erase ALL existing scan data and restore from the archive.\\n\\nMake sure you have a manual backup of ~/.gdpr_scanner.db.\\n\\nProceed?",
"m365_db_import_no_file": "Please select a ZIP file first.",
"m365_db_importing": "Importing…",
"m365_db_imported": "Imported",
"m365_db_import_run": "Import",
"m365_opt_scan_photos": "Scan photos for faces",
"m365_opt_scan_photos_hint": "Flags images with detected faces as Art. 9 biometric data. Slower — opt in.",
"m365_opt_skip_gps": "Ignore GPS in images",
"m365_opt_skip_gps_hint": "Images with GPS coordinates are not flagged — useful when scanning students whose smartphones embed location in every photo.",
"m365_opt_min_cpr": "Min. CPR count per file",
"m365_opt_scan_emails": "Scan for email addresses",
"m365_opt_scan_emails_hint": "Flags files that contain email addresses. Off by default — email addresses are very common and may produce many results.",
"m365_opt_scan_phones": "Scan for phone numbers",
"m365_opt_scan_phones_hint": "Flags files containing Danish phone numbers (8 digits). Useful for finding contact lists and parent correspondence.",
"m365_badge_emails": "email",
"m365_badge_phones": "phone",
"m365_opt_min_cpr_hint": "Files with fewer distinct CPR numbers than this threshold are not reported. Set to 2 to avoid false positives when students have their own CPR in documents.",
"m365_opt_cpr_only": "CPR-only mode",
"m365_opt_cpr_only_hint": "Only flag files that contain CPR numbers. Files with only email addresses, phone numbers, detected faces, or EXIF metadata are skipped.",
"m365_opt_ocr_lang": "OCR language",
"m365_opt_ocr_lang_hint": "Tesseract language pack(s) used when scanning scanned PDFs and images. Language packs must be installed on the server (e.g. tesseract-ocr-dan). Multiple packs: dan+eng.",
"m365_filter_photo_only": "📷 Photos / biometric",
"m365_filter_all_roles": "All roles",
"m365_filter_staff": "Staff",
"m365_filter_student": "Students",
"m365_badge_faces": "faces",
"a30_photo_items": "Photos with detected faces (Art. 9 biometric)",
"a30_photo_note": "Photographs of identifiable persons are biometric data under Art. 9 GDPR. Retention requires a documented legal basis under Art. 9(2). For school photographs of pupils under 15, parental consent is required (Databeskyttelsesloven §6). See Datatilsynet guidance on school photography.",
@ -610,47 +583,16 @@
"m365_file_sources_empty": "No file sources configured. Add a local folder or network share below.",
"m365_file_sources_add": "Add source",
"m365_fsrc_label": "Label",
"m365_fsrc_name": "Name",
"m365_fsrc_sftp_auth": "Auth",
"m365_fsrc_path": "Path",
"m365_fsrc_smb_detected": "SMB/CIFS network share detected",
"m365_fsrc_smb_host": "SMB host",
"m365_fsrc_smb_user": "Username",
"m365_fsrc_smb_pw": "Password",
"m365_fsrc_smb_pw_hint": "Password is saved to the OS keychain — never stored in a file.",
"m365_fsrc_pw_keychain_placeholder": "Stored in OS keychain",
"m365_fsrc_add_btn": "Add",
"m365_fsrc_saved": "Source saved",
"m365_fsrc_saving": "Saving...",
"m365_fsrc_path_required": "Path is required.",
"m365_fsrc_type_local": "Local folder",
"m365_fsrc_type_smb": "Network share (SMB)",
"m365_fsrc_type_sftp": "SFTP server",
"m365_fsrc_sftp_host": "SFTP host",
"m365_fsrc_sftp_port": "Port",
"m365_fsrc_sftp_user": "Username",
"m365_fsrc_sftp_remote_path": "Remote path",
"m365_fsrc_sftp_auth_password": "Password",
"m365_fsrc_sftp_auth_key": "SSH key",
"m365_fsrc_sftp_pw": "Password",
"m365_fsrc_sftp_pw_hint": "Password is saved to the OS keychain — never stored in a file.",
"m365_fsrc_sftp_key_upload": "Private key file",
"m365_fsrc_sftp_key_btn": "Upload key",
"m365_fsrc_sftp_key_uploaded": "Key uploaded",
"m365_fsrc_sftp_passphrase": "Passphrase (if key is encrypted)",
"m365_fsrc_sftp_passphrase_hint": "Passphrase is saved to the OS keychain — never stored in a file.",
"m365_fsrc_sftp_not_installed": "paramiko not installed — run: pip install paramiko",
"m365_fsrc_name_placeholder": "e.g. Teacher files, NAS archive",
"m365_fsrc_path_placeholder": "~/Documents or //nas/shares",
"m365_fsrc_smb_host_placeholder": "nas.school.dk",
"m365_fsrc_smb_user_placeholder": "DOMAIN\\username",
"m365_fsrc_smb_user_edit_placeholder": "DOMAIN\\username or username",
"m365_fsrc_sftp_host_placeholder": "sftp.school.dk",
"m365_fsrc_sftp_user_placeholder": "backup_user",
"m365_fsrc_sftp_path_placeholder": "/var/data",
"m365_fsrc_sftp_passphrase_placeholder": "Leave blank if key has no passphrase",
"m365_fsrc_sftp_host_required": "SFTP host is required.",
"m365_fsrc_sftp_user_required": "SFTP username is required.",
"m365_fsrc_scan_btn": "Scan",
"m365_fsrc_scan_start": "Starting file scan",
"m365_src_group_files": "File sources",
@ -677,14 +619,6 @@
"m365_settings_tab_general": "General",
"m365_settings_tab_email": "Email report",
"m365_settings_tab_database": "Database",
"m365_settings_tab_auditlog": "Audit Log",
"m365_audit_title": "Compliance Audit Log",
"m365_audit_col_time": "Time",
"m365_audit_col_action": "Action",
"m365_audit_col_detail": "Detail",
"m365_audit_col_ip": "IP",
"m365_audit_loading": "Loading…",
"m365_audit_empty": "No audit events recorded yet.",
"m365_settings_appearance": "Appearance",
"m365_settings_language": "Language",
"m365_settings_theme": "Theme",
@ -721,23 +655,7 @@
"m365_smtp_test": "Test",
"m365_smtp_testing": "Sending test email…",
"m365_smtp_test_ok": "Test email sent",
"m365_smtp_test_ok_graph": "Test email sent via Microsoft Graph to",
"m365_smtp_test_ok_smtp": "Test email sent via SMTP to",
"m365_smtp_graph_also_failed": "(⚠ Graph also failed — Mail.Send not granted)",
"m365_smtp_test_fail": "Connection failed",
"bulk_select_mode": "Select",
"bulk_select_all": "Select all visible",
"bulk_deselect_all": "Deselect all",
"bulk_apply": "Apply",
"bulk_done": "Done",
"bulk_selected": "selected",
"bulk_applied": "updated",
"disp_stats_total": "total",
"disp_stats_unreviewed": "unreviewed",
"disp_stats_retain": "retain",
"disp_stats_delete": "delete",
"disp_stats_other": "other",
"disp_stats_reviewed": "reviewed",
"m365_fsrc_edit_btn": "Edit",
"m365_fsrc_save_changes": "Save changes",
"m365_settings_tab_scheduler": "Scheduler",
@ -755,8 +673,6 @@
"m365_sched_after_scan": "After scan",
"m365_sched_auto_email": "Email report automatically",
"m365_sched_auto_retention": "Enforce retention policy",
"m365_sched_report_only": "Report only",
"m365_sched_report_only_hint": "Email the latest scan results without running a new scan. Requires scan results in the database.",
"m365_sched_status": "Status",
"m365_sched_run_now": "▶ Run now",
"m365_sched_add": "+ Add scheduled scan",
@ -765,9 +681,6 @@
"m365_sched_editor_edit": "Edit scheduled scan",
"m365_sched_name_required": "Name is required",
"m365_sched_no_runs": "No scheduled runs yet",
"m365_sched_no_jobs": "No scheduled scans yet.",
"m365_sched_running": "Running...",
"m365_sched_disabled": "Disabled",
"m365_sched_freq_daily": "Daily",
"m365_sched_freq_weekly": "Weekly",
"m365_sched_freq_monthly": "Monthly",
@ -815,7 +728,9 @@
"role_staff": "Staff",
"role_student": "Student",
"role_other": "Other",
"m365_settings_tab_security": "Security",
"share_modal_title": "Share results",
"share_modal_desc": "Read-only links let a DPO or reviewer browse results and tag dispositions without access to scan controls or credentials.",
"share_new_link": "New link",
@ -844,65 +759,15 @@
"share_create_error": "Failed to create link:",
"share_revoke_confirm": "Revoke this link? Anyone using it will immediately lose access.",
"share_revoke_error": "Failed to revoke:",
"share_scope_lbl": "Scope",
"share_scope_all": "All",
"share_scope_type_role": "Role",
"share_scope_type_user": "User",
"share_date_from": "Items from",
"share_date_to": "Items until",
"share_scope_role_lbl": "Role",
"share_scope_user_lbl": "User email",
"share_scope_user_placeholder": "alice@school.dk",
"share_scope_user_invalid": "Please enter a valid email address for the user scope.",
"share_scope_staff": "Staff",
"share_scope_student": "Students",
"viewer_pin_group_title": "Viewer PIN",
"viewer_pin_desc": "A numeric PIN (48 digits) that lets anyone open <code style=\"font-size:10px\">/view</code> in a browser for read-only access to results without a token URL.",
"viewer_pin_desc": "A numeric PIN (4\u20138 digits) that lets anyone open <code style=\"font-size:10px\">/view</code> in a browser for read-only access to results without a token URL.",
"viewer_pin_clear": "Clear PIN",
"viewer_pin_is_set": "Viewer PIN is set",
"viewer_pin_not_set_msg": "No PIN set /view requires a token link",
"viewer_pin_format": "PIN must be 48 digits.",
"viewer_pin_saving": "Saving",
"viewer_pin_not_set_msg": "No PIN set \u2014 /view requires a token link",
"viewer_pin_format": "PIN must be 4\u20138 digits.",
"viewer_pin_saving": "Saving\u2026",
"viewer_pin_saved": "PIN saved",
"viewer_pin_clear_confirm": "Remove the viewer PIN? /view will require a token link again.",
"viewer_pin_cleared": "PIN cleared",
"interface_pin_group_title": "Interface PIN",
"interface_pin_desc": "A numeric PIN (48 digits) that must be entered before accessing the main scanner interface. Viewers accessing <code style=\"font-size:10px\">/view</code> are not affected.",
"interface_pin_clear": "Clear PIN",
"interface_pin_is_set": "Interface PIN is set",
"interface_pin_not_set_msg": "No PIN set — interface is open to anyone on the network",
"interface_pin_saved": "PIN saved",
"interface_pin_clear_confirm": "Remove the interface PIN? The scanner will be accessible to anyone on the network.",
"interface_pin_cleared": "PIN cleared",
"interface_pin_login_desc": "Enter the interface PIN to continue.",
"interface_pin_login_btn": "Continue",
"interface_pin_err_incorrect": "Incorrect PIN.",
"interface_pin_err_too_many": "Too many attempts. Try again later.",
"interface_pin_err_network": "Network error. Please try again.",
"m365_settings_tab_ai": "AI / NER",
"m365_ai_title": "AI-Enhanced Named Entity Recognition",
"m365_ai_desc": "Use Claude AI instead of spaCy for name, address, and organisation detection. Significantly more accurate on Danish text — especially hyphenated surnames and foreign-origin names. Requires an Anthropic API key; charged per token.",
"m365_ai_enable": "Enable Claude NER",
"m365_ai_api_key_label": "Anthropic API key",
"m365_ai_show_key": "Show",
"m365_ai_hide_key": "Hide",
"m365_ai_key_set": "API key saved",
"m365_ai_key_not_set": "No API key saved",
"m365_ai_test": "Test key",
"m365_ai_testing": "Testing…",
"m365_ai_test_ok": "API key valid",
"m365_ai_test_fail": "Test failed",
"m365_ai_saved": "Saved",
"m365_ai_model_note": "Model: claude-haiku-4-5 · billed at Anthropic token rates · results cached per document.",
"m365_settings_updates": "Software update",
"m365_update_idle": "Check whether a newer version is available.",
"m365_update_auto": "Install updates automatically (checked daily — the app restarts itself)",
"m365_update_check": "Check for updates",
"m365_update_install": "Install update",
"m365_update_checking": "Checking…",
"m365_update_uptodate": "You are running the latest version.",
"m365_update_available": "Update available",
"m365_update_installing": "Installing update — the app will restart…",
"m365_update_failed": "Update check failed",
"m365_update_scan_running": "Cannot update while a scan is running."
"viewer_pin_cleared": "PIN cleared"
}

View File

@ -39,11 +39,9 @@ except ImportError:
GRAPH_BASE = "https://graph.microsoft.com/v1.0"
# Delegated scopes — used when signing in as a specific user (device code flow)
# Files.ReadWrite.All is a superset of Files.Read.All; required for in-place
# OneDrive/SharePoint/Teams redaction (PUT /drives/{id}/items/{id}/content).
SCOPES = [
"Mail.Read",
"Files.ReadWrite.All",
"Files.Read.All",
"Sites.Read.All",
"Team.ReadBasic.All",
"ChannelMessage.Read.All",
@ -84,9 +82,8 @@ class M365PermissionError(M365Error):
f"to access this resource.\n"
f" Path: {path}\n"
f" Fix: the signed-in user must be a Global/Exchange Admin, OR an admin must "
f"grant Application permissions (Mail.Read, Files.ReadWrite.All, Sites.Read.All) "
f"in Azure → App registrations → API permissions → Grant admin consent.\n"
f" Note: Files.ReadWrite.All (not Files.Read.All) is required for file redaction."
f"grant Application permissions (Mail.Read, Files.Read.All, Sites.Read.All) "
f"in Azure → App registrations → API permissions → Grant admin consent."
)
@ -96,17 +93,6 @@ class M365DeltaTokenExpired(M365Error):
pass
class M365DriveNotFound(M365Error):
"""Raised when the Graph API returns 404 for a drive/root path.
Common causes: OneDrive licence not assigned, service plan disabled,
drive not yet provisioned (user has never signed in), or account
suspended/deleted. Not a scan error callers should skip the user
and log at a lower severity.
"""
pass
class M365Connector:
def __init__(self, client_id: str, tenant_id: str, client_secret: str = ""):
if not MSAL_OK:
@ -439,8 +425,6 @@ class M365Connector:
except Exception:
msg = r.text[:200]
raise M365PermissionError(path, msg)
if r.status_code == 404:
raise M365DriveNotFound(f"404 Not Found: {path}")
r.raise_for_status()
return r.json()
raise _requests.exceptions.RetryError(f"Gave up after {self._MAX_RETRIES} attempts: {url}")
@ -476,7 +460,7 @@ class M365Connector:
msg = r.text[:200]
raise M365PermissionError(path, msg)
r.raise_for_status()
return r.json() if r.content else {}
return r.json()
raise _requests.exceptions.RetryError(f"Gave up after {self._MAX_RETRIES} attempts: {url}")
def _get_bytes(self, url: str, _retry: bool = True) -> bytes:
@ -552,8 +536,6 @@ class M365Connector:
r.raise_for_status()
return True # 204 No Content = success
raise _requests.exceptions.RetryError(f"Gave up after {self._MAX_RETRIES} attempts: {url}")
def delete_message(self, user_id: str, message_id: str) -> bool:
"""Move an email to Deleted Items (soft delete)."""
base = "/me" if (not user_id or user_id == "me") else f"/users/{user_id}"
try:
@ -890,50 +872,6 @@ class M365Connector:
url = f"{GRAPH_BASE}/drives/{drive_id}/items/{item_id}/content"
return self._get_bytes(url)
def put_drive_item_content(self, drive_id: str, item_id: str, content: bytes,
user_id: str = "") -> None:
"""Replace file content via Graph. Tries drives/{drive_id} first; falls back
to users/{user_id}/drive when drive_id is absent, then /me/drive."""
if drive_id:
url = f"{GRAPH_BASE}/drives/{drive_id}/items/{item_id}/content"
elif user_id and user_id != "me":
url = f"{GRAPH_BASE}/users/{user_id}/drive/items/{item_id}/content"
else:
url = f"{GRAPH_BASE}/me/drive/items/{item_id}/content"
for attempt in range(self._MAX_RETRIES):
try:
r = _requests.put(url, headers={**self._headers(),
"Content-Type": "application/octet-stream"},
data=content, timeout=self._TIMEOUT_BYTES)
except self._RETRYABLE_ERRORS:
if attempt == self._MAX_RETRIES - 1:
raise
self._backoff_sleep(attempt)
continue
if r.status_code == 429:
self._backoff_sleep(attempt, float(r.headers.get("Retry-After", 5)))
continue
if r.status_code in (503, 504):
if attempt < self._MAX_RETRIES - 1:
self._backoff_sleep(attempt)
continue
if r.status_code == 401 and attempt == 0:
self._token = None
if self.try_silent_auth():
self.put_drive_item_content(drive_id, item_id, content, user_id)
return
if r.status_code == 403:
try:
msg = r.json().get("error", {}).get("message", "")
except Exception:
msg = r.text[:200]
raise M365PermissionError(url, msg)
r.raise_for_status()
return
raise _requests.exceptions.RetryError(f"Gave up after {self._MAX_RETRIES} attempts: {url}")
# ── Teams ─────────────────────────────────────────────────────────────────
def list_all_teams(self) -> list:

View File

@ -13,11 +13,10 @@ pdfplumber>=0.11 # PDF text extraction
python-docx>=1.1 # Word document scanning
openpyxl>=3.1 # Excel scanning + export
# ── Image / video processing ─────────────────────────────────────────────────
# ── Image processing ──────────────────────────────────────────────────────────
Pillow>=10.0 # Image thumbnails + EXIF extraction (always-on)
opencv-python>=4.9 # Face detection (opt-in — Scan photos for faces)
numpy>=1.26 # Required by opencv-python
mutagen>=1.47 # Video metadata extraction (MP4/MOV/AVI — GPS, author, title)
# ── NER / PII detection ───────────────────────────────────────────────────────
# spaCy 3.7 supports Python 3.83.12. Do NOT upgrade past Python 3.12.
@ -37,16 +36,12 @@ pystray>=0.19 # System tray icon
# ── File system scanning (optional) ──────────────────────────────────────────
smbprotocol>=1.13 # SMB2/3 network share scanning without mounting
paramiko>=3.4 # SFTP scanning over SSH
keyring>=25.0 # OS keychain credential storage for SMB/SFTP passwords
keyring>=25.0 # OS keychain credential storage for SMB passwords
python-dotenv>=1.0 # .env file fallback for headless SMB credentials
# ── Scheduler (#19) ──────────────────────────────────────────────────────────
APScheduler>=3.10 # In-process scheduled scans
# ── AI NER (Claude) ──────────────────────────────────────────────────────────
anthropic>=0.40.0 # Claude API client for AI-enhanced NER
# ── Google Workspace scanning (#10) ──────────────────────────────────────────
google-auth>=2.0 # Service account + domain-wide delegation
google-auth-httplib2 # HTTP transport for google-auth

View File

@ -5,8 +5,6 @@ SSE routes must live in `gdpr_scanner.py`, not blueprints — blueprints can't s
M365 scan emits `scan_done`; Google emits `google_scan_done`; file scan emits `file_scan_done`. Never mix them up.
**`scan_start` is M365-only** — `run_scan()` broadcasts `scan_start`; `run_file_scan()` and `routes/google_scan.py` must NOT. The `scan_start` handler in `_attachSchedulerListeners` (scan.js) unconditionally sets `S._m365ScanRunning = true`. If a file scan emits `scan_start`, the flag is set with no matching `scan_done` to clear it — `file_scan_done` checks `!S._m365ScanRunning` before re-enabling the scan button, so the button stays disabled permanently after the scan completes.
## scan_progress source field
All three scan engines must include `"source": "m365"` / `"google"` / `"file"` in every `scan_progress` SSE event. Never remove this field — the frontend uses it to route progress to the correct segment.
@ -16,102 +14,6 @@ All three scan engines must include `"source": "m365"` / `"google"` / `"file"` i
## Circular import prohibition
`scan_engine.py` and `gdpr_scanner.py` must not import each other. `scan_engine` imports from `sse`, `checkpoint`, `app_config`, `cpr_detector`; `gdpr_scanner` imports scan functions from `scan_engine`.
## `_scan_bytes` injection
`scan_engine.py` declares stub versions of `_scan_bytes` / `_scan_bytes_timeout` at module level. `gdpr_scanner.py` replaces them with the real `cpr_detector` implementations at startup. `routes/google_scan.py` pulls them from `gdpr_scanner` via `__getattr__`. Never import these directly in blueprint or engine modules — that breaks the circular-import barrier.
## M365 connector exceptions — m365_connector.py
Exception hierarchy (all inherit `M365Error(Exception)`):
| Exception | Trigger | Handler |
|---|---|---|
| `M365PermissionError` | 403 Forbidden | `scan_error` broadcast with human-readable permission hint |
| `M365DeltaTokenExpired` | 410 Gone on delta endpoint | Caller clears token and falls back to full scan |
| `M365DriveNotFound` | 404 Not Found on any path | `scan_phase` broadcast ("not provisioned — skipped") in `_scan_user_onedrive`; full-scan path's `except Exception: return` also silences it |
**`M365DriveNotFound` — why it exists:** `_get()` previously fell through to `raise_for_status()` on 404, which was caught by the generic `except Exception` handler and broadcast as a red `scan_error`. Adding the specific exception makes the delta path consistent with the full-scan path: a user without a provisioned OneDrive is skipped silently. **Do not add a 404 handler to `_get()` that returns a fallback value** — that would silently mask genuine path bugs.
## Export — routes/export.py
- **`GDPRDb.get_session_sources()`** — returns a `set` of source-key strings for every scan in the current session window. Used by both `_build_excel_bytes()` and `_build_article30_docx()` to include zero-hit sources in summary tables. Do not derive the scanned-source set from `by_source` alone — that dict only contains sources with flagged items.
- **Excel Summary sheet** — shows all scanned sources (even with 0 items). Per-source tabs only created for sources with items.
- **ART.30 breakdown table** — iterates `scanned_sources` (not `by_source`) so Gmail, Drive, etc. appear with `0 | 0 | 0 | —` when the scan found nothing.
- **Role-filtered exports**`_build_excel_bytes(role='')` and `_build_article30_docx(role='')` accept `role='student'` or `role='staff'`. A local `_items` list is built at the top of each function; GPS sheet, External transfers sheet, and Art.30 tables all see only the filtered subset. Filenames get `_elever` / `_ansatte` suffix.
- **`POST /api/redact_item`** — rewrites a file in-place with CPR numbers replaced by `██████-████` / `█` blocks, removes the card from the grid, logs a `"redacted"` disposition. Source types: `local` (DOCX/XLSX/CSV/TXT/PDF, written via temp+move), `onedrive`/`sharepoint`/`teams` (Graph download → redact → PUT, requires `Files.ReadWrite.All`), `gdrive` (Drive API, requires `drive` scope), `sftp` (paramiko read/write, item must still be in `state.flagged_items`), `smb` (smbprotocol `FILE_SUPERSEDE`). **Keep `_redactExts`/`_cloudRedactExts` in `results.js` and `_REDACT_EXTS`/`_GDRIVE_MIME_MAP`/`_ALL_REDACTABLE_TYPES` in `export.py` in sync** — the button and the route must agree.
- **PDF redaction**`redact_pdf_secure` uses PyMuPDF `page.apply_redactions()` (physical removal). Falls back to reportlab overlay if PyMuPDF absent. Text pages use `find_cpr_char_bboxes`; scanned pages use OCR at 200 DPI + `find_cpr_image_bboxes`.
## Preview — routes/database.py
`GET /api/preview/<item_id>?source_type=…&account_id=…` dispatches by `source_type`:
- **`local` / `smb`** — re-reads from disk; renders images as data URIs, text/CSV/PDF/DOCX/XLSX inline.
- **`email`** — fetches M365 message body via Graph (requires `state.connector`).
- **`gmail`** — shows info card with "Open in Gmail" link (X-Frame-Options blocks embedding).
- **`gdrive`** — returns `https://drive.google.com/file/d/{id}/preview` iframe.
- **All other values** (M365 files) — calls Graph `/preview` POST; tries `drive_id`-based path first, then user-drive, then `/me/drive`.
**`_source_type` must be set in `google_scan.py`** — Gmail items need `meta["_source_type"] = "gmail"` and Drive items `"gdrive"` before `_broadcast_card`. Without it, cards fall through to the M365 branch, which calls Graph with a Gmail ID and gets a 404.
**`state.connector` guard** — only the `email` and M365 `else` branches require M365 auth. The `local`/`smb`/`gmail`/`gdrive` branches must not gate on `state.connector` — they work in Google-only deployments.
## Compliance audit log — gdpr_db.py + routes/
- **`audit_log` table** — created by `_DDL` (`CREATE TABLE IF NOT EXISTS`), auto-appears on next server start. Schema: `id, ts (Unix float), action, actor, detail, ip`.
- **`log_audit_event(action, detail, actor, ip)`** — module-level helper; silently no-ops on any exception. Import: `from gdpr_db import log_audit_event as _audit`.
- **`GET /api/audit_log?limit=200&action=<filter>`** — in `routes/app_routes.py`. No auth gate.
- **Recorded events**`profile_save/delete`, `token_create/revoke`, `viewer_pin_set/change/clear`, `interface_pin_set/change/clear`, `source_add/update/delete`, `scheduler_job_save/delete`, `scan_start/stop`, `smtp_save`, `disposition`, `disposition_bulk`, `admin_pin_set/change`, `item_delete`, `item_redact`, `app_update`.
- **`actor` always empty** — no per-user login; field reserved for future use.
## Email sending — routes/email.py + m365_connector.py
- **`_post()` returns `{}` on empty body** — Graph `sendMail` returns HTTP 202 with no body; `r.json()` on empty raises `JSONDecodeError`. Do not revert to unconditional `r.json()`.
- **Graph preferred over SMTP**`smtp_test` and `send_report` try `_send_email_graph()` first; fall back to SMTP only if Graph raises. If Graph fails and no SMTP host saved, the Graph exception surfaces directly.
- **Auto-email after manual scan**`_maybe_send_auto_email()` in `routes/scan.py` called from the `_run()` thread after `run_scan()` returns. Reads `smtp_cfg.get("auto_email_manual")`; no-ops if false, no flagged items, or no recipients.
- **Gmail vs Google Workspace** — auth error handlers check if SMTP username ends in `@gmail.com`/`@googlemail.com`; custom domains are treated as Google Workspace and error message points to the Workspace admin console.
- **Canonical SMTP config keys are `username` and `use_tls`** — all backend readers (`smtp_test`, `_send_report_email`, `_send_email_graph`) use these. The Settings → E-mailrapport tab (`scheduler.js`) historically saved `user`/`starttls`, which left `username` empty so `server.login()` was skipped and the server rejected the send. Frontend now sends the canonical keys, and `_load_smtp_config()` normalises legacy `user``username` / `starttls``use_tls` for already-saved configs. The send-report modal (`scan.js`) already used the canonical keys. Keep both UIs and the backend on `username`/`use_tls`.
- **Graph 202 ≠ delivered**`_send_email_graph` returns on Graph's HTTP 202 (queued), and `smtp_test`/`send_report` treat that as success and never fall back to SMTP. A recipient on a domain Exchange Online considers an accepted/internal domain (e.g. a Google-hosted subdomain of the O365 domain) is silently dropped after the 202. There is no in-app fix for that routing; reaching such recipients requires SMTP (e.g. Google Workspace `smtp.gmail.com`/`smtp-relay.gmail.com`) or fixing Exchange Accepted Domains.
- **`prefer_smtp` config flag** — when truthy, `smtp_test`, `send_report`, and `_maybe_send_auto_email` (routes/scan.py) skip the Graph path entirely and send via SMTP. This is the in-app escape hatch for the Graph-202 routing trap above. The gate is `... and not smtp_cfg.get("prefer_smtp")` on each Graph branch — keep all three in sync. UI: `#st-smtpPreferSmtp` toggle (key `m365_smtp_prefer_smtp`), saved/loaded by `scheduler.js`.
## Scheduler — scan_scheduler.py + routes/scheduler.py
- **Job config keys**`id`, `name`, `enabled`, `frequency` (daily/weekly/monthly), `day_of_week`, `day_of_month`, `hour`, `minute`, `profile_id`, `auto_email`, `auto_retention`, `retention_years`, `fiscal_year_end`, `report_only`. Stored in `~/.gdprscanner/schedule.json`.
- **`_execute_scan(job_id)`** — acquires per-job lock (`_running_jobs` set), records DB run via `db.begin_schedule_run()`, runs M365 → file → Google pipeline, then emails and applies retention. DB run finalised in `finally`.
- **Report-only path** — when `report_only=True`, short-circuits before M365 auth check, populates `_m.flagged_items` from `db.get_session_items()` if empty, calls `_send_email_report()`. Does NOT acquire scan lock; fails with `RuntimeError("No scan results available")` if DB is also empty.
- **`_m.flagged_items` and `state.flagged_items` are the same object** — assigned at startup; in-place updates (`flagged_items[:] = ...`) propagate to both.
- **`scheduler_started` / `scheduler_done` SSE events** — separate from `scan_done` (M365). `scheduler_done` carries `flagged`, `scanned`, `emailed`, `job_name`.
- **Profile options merge into file sources** — scheduler unpacks `{**fs, **_fs_extra}` before calling `run_file_scan(fs)`. Do not pass `fs` directly — the file scan reads `source.get(...)` and silently falls back to defaults without the merge.
## Claude NER — document_scanner.py + app_config.py + routes/app_routes.py
Optional AI-powered NER replacing spaCy. Activated via `config.json` keys `claude_ner` (bool) and `claude_api_key` (str, **Fernet-encrypted at rest** with an `enc:` prefix — same scheme as the SMTP password).
- **`ANTHROPIC_OK`** — module-level flag in `document_scanner.py`; `True` if `anthropic` is importable. Guards all Claude code paths.
- **`_ner_claude(text, api_key)`** — calls `claude-haiku-4-5-20251001` in 8 000-char chunks. Thread-safe cache keyed by `hash(text)`, evicts oldest when > 2 000 entries.
- **Always read the key via `app_config.get_claude_api_key()`** — it decrypts and transparently handles legacy plaintext. Never read `config.json["claude_api_key"]` directly; `save_claude_config()` writes it encrypted.
- **`GET/POST /api/settings/claude`** — GET returns `{"enabled": bool, "api_key_set": bool}` (never exposes key). POST accepts `{"enabled": bool, "api_key": "..."}` — omitting `api_key` leaves stored key unchanged.
- **`POST /api/settings/claude/test`** — minimal 8-token API call; returns `{"ok": true}` or `{"ok": false, "error": "..."}`.
- **Do not import `anthropic` at module level outside `document_scanner.py`**`routes/app_routes.py` imports it locally inside the function body so the server starts without the package.
## Software update — routes/updates.py
- **Git-checkout only**`_supported()` requires a `.git` dir and not `sys.frozen`. The frozen desktop build gets `{"supported": false}` and the UI hides the Settings group.
- **`POST /api/update/apply`** — stash-if-dirty → `merge --ff-only origin/<branch>` → pip install only if `requirements.txt` changed → audit `app_update``_schedule_restart()` re-execs the process via `os.execv` (same PID; works under systemd and `start_gdpr.sh`). Refuses with `code: "scan_running"` (409) while `state._scan_lock` or `state._google_scan_lock` is held.
- **`apply_update()` never restarts itself** — callers decide. Tests patch `_schedule_restart`; the auto-update thread calls `_restart_self()` directly.
- **Auto-update thread**`start_auto_update_thread()` called from `gdpr_scanner.py` `__main__`. Hourly tick, applies at most once per 24 h when `config.json["auto_update"]` is true; skips (and retries next tick) while a scan runs.
- **`update_gdpr.sh`** — standalone CLI/cron equivalent of the same logic; keep stash/ff-only/requirements behaviour in sync.
## Viewer mode — routes/viewer.py
- **`/view` auth chain** — token (`?token=`) → session cookie (`session["viewer_ok"]`) → PIN form → 403. Never skip this order.
- **Token scope** — stored as `"scope": {"role": "student"|"staff"}`, `{"user": [...], "display_name": "..."}`, or `{}` in `viewer_tokens.json`. Enforced server-side in `GET /api/db/flagged`. **Column name is `user_role`** — do not use `role`.
- **`session["viewer_scope"]`** — set at `/view` token validation. `GET /api/db/flagged` reads `session.get("viewer_scope", {})` — defaults to `{}` (unrestricted) for PIN-authenticated sessions.
- **`viewer_tokens.json` format** — `{"tokens": [...], "__pin__": {"hash": "…", "salt": "…"}}`. Old bare-list format handled transparently. Do not write as bare list.
- **Rate-limit state** (`_pin_attempts` dict) — in-memory only, resets on server restart. Intentional.
- **User-scoped tokens**`scope.user` always a list; legacy single-string coerced on read. File-scan items (`account_id = ""`) never appear in user-scoped views. `POST /api/viewer/tokens` rejects combined `role`+`user` scope with 400.
- **Date-range scoping**`valid_from`/`valid_to` (YYYY-MM-DD) in scope dict; filtered via lexicographic string comparison in `GET /api/db/flagged`. Server validates format and enforces `valid_from ≤ valid_to`.
- **`app.secret_key`** — derived from `machine_id` bytes so sessions survive restarts. Set once at startup; do not override.
- **Flask binds to `0.0.0.0`**`gdpr_scanner.py`, `m365_launcher.py`, and `build_gdpr.py` all use `host="0.0.0.0"`. Internal loopback URLs intentionally keep `127.0.0.1`.
## Gotchas
- **`_load_settings()` return** — does NOT include `file_sources`. Returns only: sources, user_ids, options, retention_years, fiscal_year_end, email_to.

View File

@ -72,50 +72,6 @@ def get_lang_json():
return jsonify(state.LANG)
@bp.route("/api/audit_log")
def audit_log_list():
"""Return recent compliance audit log entries."""
try:
from gdpr_db import get_db as _get_db
limit = min(int(request.args.get("limit", 200)), 1000)
action = request.args.get("action") or None
return jsonify(_get_db().get_audit_log(limit=limit, action=action))
except Exception as e:
return jsonify({"error": str(e)}), 500
@bp.route("/api/settings/claude", methods=["GET", "POST"])
def claude_settings():
from app_config import get_claude_config, save_claude_config
if request.method == "GET":
return jsonify(get_claude_config())
data = request.get_json(silent=True) or {}
api_key = data.get("api_key") # None = keep existing key
if api_key == "":
api_key = None # empty string = don't change
save_claude_config(bool(data.get("enabled", False)), api_key)
return jsonify({"ok": True})
@bp.route("/api/settings/claude/test", methods=["POST"])
def claude_test():
from app_config import get_claude_api_key
api_key = get_claude_api_key()
if not api_key:
return jsonify({"ok": False, "error": "No API key saved"}), 400
try:
import anthropic
client = anthropic.Anthropic(api_key=api_key)
client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=8,
messages=[{"role": "user", "content": "Hi"}],
)
return jsonify({"ok": True})
except Exception as e:
return jsonify({"ok": False, "error": str(e)}), 400
@bp.route("/manual")
def manual():
"""Serve the user manual as a styled, printable HTML page.

View File

@ -11,12 +11,11 @@ from checkpoint import _clear_checkpoint, _DELTA_PATH
from cpr_detector import _extract_exif, _html_esc, _placeholder_svg
try:
from gdpr_db import get_db as _get_db, log_audit_event as _audit
from gdpr_db import get_db as _get_db
DB_OK = True
except ImportError:
DB_OK = False
def _get_db(*a, **kw): return None # type: ignore[misc]
def _audit(*a, **kw): pass # type: ignore[misc]
try:
import document_scanner as _ds # noqa: F401
@ -71,13 +70,6 @@ def db_scans():
return jsonify(_get_db().scans_list())
@bp.route("/api/db/sessions")
def db_sessions():
"""List scan sessions (grouped concurrent scans), newest first."""
if not DB_OK: return jsonify([])
return jsonify(_get_db().get_sessions())
@bp.route("/api/db/subject", methods=["POST"])
def db_subject_lookup():
"""Find all items containing a given CPR number.
@ -141,35 +133,9 @@ def db_set_disposition():
notes = data.get("notes", ""),
reviewed_by = data.get("reviewed_by", ""),
)
_audit("disposition",
f"item_id={item_id!r} status={data.get('status','')!r}",
ip=request.remote_addr or "")
return jsonify({"status": "saved"})
@bp.route("/api/db/disposition/bulk", methods=["POST"])
def db_set_disposition_bulk():
"""Set the same disposition on multiple items at once.
Body: {item_ids: [...], status, legal_basis?, notes?, reviewed_by?}
"""
if not DB_OK: return jsonify({"error": "database not available"}), 503
data = request.get_json() or {}
item_ids = data.get("item_ids", [])
status = data.get("status", "")
if not item_ids or not status:
return jsonify({"error": "item_ids and status required"}), 400
db = _get_db()
for iid in item_ids:
db.set_disposition(iid, status,
legal_basis=data.get("legal_basis", ""),
notes=data.get("notes", ""),
reviewed_by=data.get("reviewed_by", ""))
_audit("disposition_bulk",
f"count={len(item_ids)} status={status!r}",
ip=request.remote_addr or "")
return jsonify({"saved": len(item_ids)})
@bp.route("/api/db/disposition/<item_id>")
def db_get_disposition(item_id):
"""Get the current disposition for an item."""
@ -180,62 +146,15 @@ def db_get_disposition(item_id):
@bp.route("/api/db/flagged")
def db_flagged_items():
"""Return flagged items for the results grid.
With ?ref=N, returns the items from that specific past scan session (history
mode). Without ref, returns every item still awaiting action across all
scans (the default landing view) not just the latest session window.
"""Return flagged items from the most recent completed scan session.
Used by the read-only viewer to load results without an active SSE connection.
Respects viewer_scope.role stored in the session for scoped tokens.
"""
if not DB_OK: return jsonify([])
from flask import session as _session
scope = _session.get("viewer_scope", {})
role_filt = scope.get("role", "") if isinstance(scope, dict) else ""
date_from = scope.get("valid_from", "") if isinstance(scope, dict) else ""
date_to = scope.get("valid_to", "") if isinstance(scope, dict) else ""
# user may be a list of emails (current) or a legacy single string
raw_user = scope.get("user", "") if isinstance(scope, dict) else ""
if isinstance(raw_user, list):
user_filt = set(e.lower() for e in raw_user if e)
else:
user_filt = {raw_user.lower()} if raw_user else set()
ref_scan_id = request.args.get("ref", type=int)
if ref_scan_id:
# History mode — a specific past session was requested.
items = _get_db().get_session_items(ref_scan_id=ref_scan_id)
else:
# Default landing / viewer — show every item still awaiting action,
# across all scans, not just the latest session window.
items = _get_db().get_open_items()
items = _get_db().get_session_items()
# Normalise JSON-encoded columns the same way scan_engine does for SSE cards
import json as _json
out = []
for row in items:
if role_filt and row.get("user_role", "") != role_filt:
continue
if user_filt and (row.get("account_id", "") or "").lower() not in user_filt:
continue
if date_from and (row.get("modified") or "") < date_from:
continue
if date_to and (row.get("modified") or "") > date_to:
continue
row["special_category"] = _json.loads(row.get("special_category") or "[]") if isinstance(row.get("special_category"), str) else row.get("special_category", [])
row["exif"] = _json.loads(row.get("exif_json") or "{}") if isinstance(row.get("exif_json"), str) else row.get("exif", {})
row.pop("exif_json", None)
out.append(row)
return jsonify(out)
@bp.route("/api/db/related/<item_id>")
def db_related_items(item_id):
"""Return flagged items from the same session sharing at least one CPR hash."""
if not DB_OK:
return jsonify([])
ref = request.args.get("ref", type=int)
import json as _json
out = []
for row in _get_db().get_related_items(item_id, ref_scan_id=ref):
row["special_category"] = _json.loads(row.get("special_category") or "[]") if isinstance(row.get("special_category"), str) else row.get("special_category", [])
row["exif"] = _json.loads(row.get("exif_json") or "{}") if isinstance(row.get("exif_json"), str) else row.get("exif", {})
row.pop("exif_json", None)
@ -298,13 +217,10 @@ def admin_pin_set():
new_pin = data.get("new_pin", "").strip()
if not new_pin:
return jsonify({"error": "new_pin required"}), 400
had_pin = _admin_pin_is_set()
if had_pin:
if _admin_pin_is_set():
if not _verify_admin_pin(data.get("current_pin", "")):
return jsonify({"error": "incorrect_pin"}), 403
_set_admin_pin(new_pin)
_audit("admin_pin_change" if had_pin else "admin_pin_set", "",
ip=request.remote_addr or "")
return jsonify({"ok": True})
@ -370,29 +286,6 @@ def db_import():
return jsonify({"error": str(e)}), 500
def _excerpt_page(excerpt: str, item_meta: dict) -> str:
"""Minimal HTML page showing a stored body excerpt as a preview fallback."""
import html as _html
subject = _html.escape(item_meta.get("name", ""))
modified = item_meta.get("modified", "")
account = _html.escape(item_meta.get("account_name", ""))
body = "<pre style='white-space:pre-wrap;font-family:sans-serif;margin:0'>" + _html.escape(excerpt) + "</pre>"
note = "<p style='font-size:11px;color:#888;margin-top:12px'>Stored excerpt — connect to reload the full message.</p>"
return (
"<!DOCTYPE html><html><head><meta charset='utf-8'>"
"<style>body{font-family:-apple-system,sans-serif;font-size:13px;"
"padding:12px 16px;background:#fff;color:#111;word-break:break-word}"
".hdr{border-bottom:1px solid #eee;margin-bottom:12px;padding-bottom:10px}"
".hdr-row{color:#555;font-size:12px;margin-bottom:3px}"
".hdr-row b{color:#111}</style></head><body>"
f"<div class='hdr'>"
+ (f"<div class='hdr-row'><b>From:</b> {account}</div>" if account else "")
+ (f"<div class='hdr-row'><b>Date:</b> {_html.escape(modified)}</div>" if modified else "")
+ (f"<div class='hdr-row'><b>Subject:</b> {subject}</div>" if subject else "")
+ f"</div>{body}{note}</body></html>"
)
@bp.route("/api/preview/<item_id>")
def get_preview(item_id):
"""Return a preview URL or HTML for a flagged item."""
@ -585,17 +478,14 @@ def get_preview(item_id):
except Exception as e:
return jsonify({"error": str(e)})
if not state.connector:
return jsonify({"error": "not authenticated"}), 401
item_meta = next((x for x in state.flagged_items if x.get("id") == item_id), {})
drive_id = item_meta.get("drive_id", "")
try:
if source_type == "email":
excerpt = item_meta.get("body_excerpt", "")
if not state.connector:
if excerpt:
import html as _html
return jsonify({"type": "html", "html": _excerpt_page(excerpt, item_meta)})
return jsonify({"error": "not authenticated"}), 401
uid = account_id
try:
msg = state.connector._get(
@ -603,8 +493,6 @@ def get_preview(item_id):
{"$select": "subject,from,receivedDateTime,body"}
)
except Exception as e:
if excerpt:
return jsonify({"type": "html", "html": _excerpt_page(excerpt, item_meta)})
return jsonify({"error": f"Could not load email: {e}"})
sender = msg.get("from", {}).get("emailAddress", {})
@ -662,51 +550,8 @@ def get_preview(item_id):
</body></html>"""
return jsonify({"type": "html", "html": page})
elif source_type in ("gmail", "gdrive"):
item_url = item_meta.get("url", "")
name = item_meta.get("name", "")
if source_type == "gdrive" and item_url:
# Extract Drive file ID and use the embeddable /preview URL
import re as _re
m = _re.search(r"/file/d/([^/]+)", item_url)
if m:
fid = m.group(1)
return jsonify({"type": "iframe", "url": f"https://drive.google.com/file/d/{fid}/preview"})
# Fallback: generic Drive embed
return jsonify({"type": "iframe", "url": item_url.replace("/view", "/preview")})
# Gmail — not embeddable; show link card + stored body excerpt if available
icon = "✉️" if source_type == "gmail" else "☁️"
label = "Open in Gmail" if source_type == "gmail" else "Open in Google Drive"
excerpt = item_meta.get("body_excerpt", "")
link_html = (
f'<a href="{_html_esc(item_url)}" target="_blank" '
f'style="display:inline-block;margin-top:12px;padding:8px 16px;'
f'background:#3b7dd8;color:#fff;border-radius:6px;text-decoration:none;font-size:12px">'
f'{label}</a>'
) if item_url else ""
if excerpt and source_type == "gmail":
html_out = _excerpt_page(excerpt, item_meta)
if item_url:
# Inject the "Open in Gmail" link before </body>
html_out = html_out.replace(
"</body>",
f'<div style="margin-top:12px">{link_html}</div></body>'
)
else:
html_out = (
f'<div style="padding:24px;text-align:center;font-family:sans-serif">'
f'<div style="font-size:40px">{icon}</div>'
f'<div style="font-size:13px;font-weight:600;margin:8px 0">{_html_esc(name)}</div>'
f'<div style="font-size:11px;color:var(--muted)">No inline preview available for this item</div>'
f'{link_html}'
f'</div>'
)
return jsonify({"type": "html", "html": html_out})
else:
# OneDrive / SharePoint / Teams — use Graph's embed preview API
if not state.connector:
return jsonify({"error": "not authenticated"}), 401
preview_url = None
errors = []

View File

@ -5,10 +5,6 @@ from __future__ import annotations
from flask import Blueprint, jsonify, request
from routes import state
from app_config import _load_smtp_config, _save_smtp_config
try:
from gdpr_db import log_audit_event as _audit
except ImportError:
def _audit(*a, **kw): pass # type: ignore[misc]
from routes.export import _build_excel_bytes
bp = Blueprint("email", __name__)
@ -123,7 +119,6 @@ def smtp_config_save():
if not data.get("password") and existing.get("password"):
data["password"] = existing["password"]
_save_smtp_config(data)
_audit("smtp_save", f"host={data.get('host','')!r}", ip=request.remote_addr or "")
return jsonify({"status": "saved"})
@ -148,15 +143,12 @@ def smtp_test():
"</body></html>"
)
# Try Graph API first — unless the user opted to always use SMTP. Graph
# returns 202 (queued) even for recipients Exchange later silently drops
# (e.g. a Google-hosted subdomain of the O365 domain), so SMTP is the only
# reliable path for those; prefer_smtp forces it.
prefer_smtp = bool(saved.get("prefer_smtp"))
if state.connector and state.connector.is_authenticated() and not prefer_smtp:
# Try Graph API first
if state.connector and state.connector.is_authenticated():
try:
_send_email_graph(subject, body_html, recipients)
return jsonify({"ok": True, "method": "graph", "recipients": recipients})
return jsonify({"ok": True,
"message": f"Test email sent via Microsoft Graph to {', '.join(recipients)}"})
except Exception as graph_err:
graph_error_str = str(graph_err)
else:
@ -172,12 +164,6 @@ def smtp_test():
use_tls = bool(saved.get("use_tls", True)) and not use_ssl
if not host:
if graph_error_str:
return jsonify({"error": (
f"Microsoft Graph email failed: {graph_error_str}\n\n"
"Make sure Mail.Send is added to your Azure app registration and admin consent has been granted:\n"
"Azure AD → App registrations → [your app] → API permissions → Add → Microsoft Graph → Mail.Send → Grant admin consent."
)}), 400
return jsonify({"error": "No SMTP host configured. To send via Microsoft 365 Graph (no SMTP needed), add Mail.Send to your Azure app registration."}), 400
try:
@ -201,8 +187,8 @@ def smtp_test():
if username and password:
server.login(username, password)
server.sendmail(from_addr, recipients, msg.as_string())
return jsonify({"ok": True, "method": "smtp", "recipients": recipients,
"graph_also_failed": bool(graph_error_str)})
suffix = " (⚠ Graph also failed — Mail.Send permission not granted)" if graph_error_str else ""
return jsonify({"ok": True, "message": f"Test email sent via SMTP to {', '.join(recipients)}{suffix}"})
except Exception as smtp_err:
err_str = str(smtp_err)
_h = host.lower()
@ -224,33 +210,11 @@ def smtp_test():
"(Users → Active users → [user] → Mail → Manage email apps → Authenticated SMTP), "
"or add Mail.Send to your Azure app to use Graph instead.")
elif (_personal_ms or _gmail_host) and _auth_err:
if _gmail_host:
_gws_account = "@gmail.com" not in username.lower() and "@googlemail.com" not in username.lower()
if _gws_account:
err_str = ("Google Workspace SMTP authentication failed.\n\n"
"Your account uses a custom domain via Google Workspace. "
"SMTP access is controlled by your organisation's Google Workspace admin, not your personal account settings.\n\n"
"Ask your Google Workspace admin to:\n"
" • Enable 2-Step Verification for your account (required for App Passwords)\n"
" • Allow users to manage their own App Passwords (Admin console → Security → 2-Step Verification)\n"
" • Or configure SMTP relay: Admin console → Apps → Google Workspace → Gmail → Routing → SMTP relay service\n\n"
"If App Passwords are available for your account, generate one at "
"myaccount.google.com → Security → 2-Step Verification → App passwords "
"and use it instead of your normal password.")
else:
err_str = ("Gmail SMTP authentication failed.\n\n"
"Google requires an App Password for SMTP — your normal password will not work.\n\n"
"If you are already using an App Password, check:\n"
" • No spaces — the 16-character code must be entered without spaces\n"
" • The App Password has not been revoked — generate a new one at "
"myaccount.google.com → Security → 2-Step Verification → App passwords\n"
" • The correct username (your full Gmail address, e.g. you@gmail.com)\n"
" • Port 587 with STARTTLS, or port 465 with SSL")
else:
url = "account.microsoft.com/security"
err_str = (f"Authentication failed — Microsoft blocks regular passwords for SMTP when MFA is enabled.\n\n"
f"Fix: create an App Password at {url} → App passwords "
f"and use that instead of your normal password.")
provider = "Microsoft" if _personal_ms else "Google"
url = "account.microsoft.com/security" if _personal_ms else "myaccount.google.com → Security → 2-Step Verification"
err_str = (f"Authentication failed — {provider} blocks regular passwords for SMTP when MFA is enabled.\n\n"
f"Fix: create an App Password at {url} → App passwords "
f"and use that instead of your normal password.")
elif graph_error_str:
err_str = f"SMTP: {err_str} | Graph also unavailable (Mail.Send not granted)"
return jsonify({"error": err_str}), 200
@ -289,8 +253,8 @@ def send_report():
"</body></html>"
)
# Try Graph API first — unless prefer_smtp is set (see smtp_test for why).
if state.connector and state.connector.is_authenticated() and not smtp_cfg.get("prefer_smtp"):
# Try Graph API first
if state.connector and state.connector.is_authenticated():
try:
_send_email_graph(subject, body_html, recipients,
attachment_bytes=xl_bytes, attachment_name=fname)
@ -331,32 +295,9 @@ def send_report():
err = (f"{err}\n\nTip: Enable SMTP AUTH for this mailbox in the Microsoft 365 admin centre, "
"or connect to M365 first so the scanner can send via Microsoft Graph instead.")
elif (_personal_ms_2 or _gmail_2) and _auth_err_2:
if _gmail_2:
_uname2 = smtp_cfg.get("username", "").lower()
_gws2 = "@gmail.com" not in _uname2 and "@googlemail.com" not in _uname2
if _gws2:
err = ("Google Workspace SMTP authentication failed.\n\n"
"Your account uses a custom domain via Google Workspace. "
"SMTP access is controlled by your organisation's Google Workspace admin, not your personal account settings.\n\n"
"Ask your Google Workspace admin to:\n"
" • Enable 2-Step Verification for your account (required for App Passwords)\n"
" • Allow users to manage their own App Passwords (Admin console → Security → 2-Step Verification)\n"
" • Or configure SMTP relay: Admin console → Apps → Google Workspace → Gmail → Routing → SMTP relay service\n\n"
"If App Passwords are available for your account, generate one at "
"myaccount.google.com → Security → 2-Step Verification → App passwords "
"and use it instead of your normal password.")
else:
err = ("Gmail SMTP authentication failed.\n\n"
"Google requires an App Password for SMTP — your normal password will not work.\n\n"
"If you are already using an App Password, check:\n"
" • No spaces — the 16-character code must be entered without spaces\n"
" • The App Password has not been revoked — generate a new one at "
"myaccount.google.com → Security → 2-Step Verification → App passwords\n"
" • The correct username (your full Gmail address, e.g. you@gmail.com)\n"
" • Port 587 with STARTTLS, or port 465 with SSL")
else:
url2 = "account.microsoft.com/security"
err = (f"Authentication failed — Microsoft blocks regular passwords for SMTP when MFA is enabled.\n\n"
f"Fix: create an App Password at {url2} → App passwords "
f"and use that instead of your normal password.")
provider2 = "Microsoft" if _personal_ms_2 else "Google"
url2 = "account.microsoft.com/security" if _personal_ms_2 else "myaccount.google.com → Security → 2-Step Verification"
err = (f"Authentication failed — {provider2} blocks regular passwords for SMTP when MFA is enabled.\n\n"
f"Fix: create an App Password at {url2} → App passwords "
f"and use that instead of your normal password.")
return jsonify({"error": err}), 500

View File

@ -9,12 +9,11 @@ from routes import state
from app_config import _GUID_RE, _resolve_display_name
try:
from gdpr_db import get_db as _get_db, log_audit_event as _audit
from gdpr_db import get_db as _get_db
DB_OK = True
except ImportError:
DB_OK = False
def _get_db(*a, **kw): return None # type: ignore[misc]
def _audit(*a, **kw): pass # type: ignore[misc]
try:
from m365_connector import M365PermissionError
@ -25,10 +24,9 @@ bp = Blueprint("export", __name__)
logger = logging.getLogger(__name__)
def _build_excel_bytes(role: str = "") -> tuple[bytes, str]:
def _build_excel_bytes() -> tuple[bytes, str]:
"""Build the M365 scan Excel workbook and return (bytes, filename).
Raises on error. Used by export_excel() and send_report().
role: '' = all, 'student' = students only, 'staff' = staff + other."""
Raises on error. Used by export_excel() and send_report()."""
from openpyxl import Workbook
from openpyxl.styles import Font, PatternFill, Alignment, Border, Side
from openpyxl.utils import get_column_letter
@ -45,7 +43,6 @@ def _build_excel_bytes(role: str = "") -> tuple[bytes, str]:
"gdrive": ("💾 Google Drive", "D5F5E3"),
"local": ("📁 Local", "E6F7E6"),
"smb": ("🌐 Network", "E0F0FA"),
"sftp": ("🔒 SFTP", "EDE9F7"),
}
COLS = [
("Name / Subject", 45),
@ -134,20 +131,11 @@ def _build_excel_bytes(role: str = "") -> tuple[bytes, str]:
ws.auto_filter.ref = f"A1:{get_column_letter(len(COLS))}1"
# Apply role filter — '' means all roles
if role == "student":
_items = [i for i in state.flagged_items if i.get("user_role") == "student"]
elif role == "staff":
_items = [i for i in state.flagged_items if i.get("user_role") != "student"]
else:
_items = list(state.flagged_items)
wb = Workbook()
ws_sum = wb.active
ws_sum.title = "Summary"
ws_sum.sheet_properties.tabColor = "1F3864"
_role_label = {"student": " — Elever", "staff": " — Ansatte"}.get(role, "")
ws_sum["A1"] = f"GDPRScanner — Export{_role_label}"
ws_sum["A1"] = "GDPRScanner — Export"
ws_sum["A1"].font = Font(name="Arial", bold=True, size=14, color=HEADER_FG)
ws_sum["A1"].fill = _fill(HEADER_BG)
ws_sum.merge_cells("A1:D1")
@ -158,8 +146,8 @@ def _build_excel_bytes(role: str = "") -> tuple[bytes, str]:
ws_sum["A2"] = "Generated:"
ws_sum["B2"] = _dt.datetime.now().strftime("%Y-%m-%d %H:%M")
ws_sum["A3"] = "Total flagged items:"
ws_sum["B3"] = len(_items)
gps_count = sum(1 for i in _items if (i.get("exif") or {}).get("gps"))
ws_sum["B3"] = len(state.flagged_items)
gps_count = sum(1 for i in state.flagged_items if (i.get("exif") or {}).get("gps"))
if gps_count:
ws_sum["A4"] = "Items with GPS data:"
ws_sum["B4"] = gps_count
@ -180,26 +168,14 @@ def _build_excel_bytes(role: str = "") -> tuple[bytes, str]:
ws_sum.column_dimensions["C"].width = 16
by_source: dict = {}
for item in _items:
for item in state.flagged_items:
by_source.setdefault(item.get("source_type", "other"), []).append(item)
# Determine which sources were actually scanned (even if they found nothing)
scanned_sources: set = set()
if DB_OK:
try:
_db_tmp = _get_db()
if _db_tmp:
scanned_sources = _db_tmp.get_session_sources()
except Exception:
pass
# Fall back: treat any source that has items as scanned
scanned_sources |= set(by_source.keys())
sum_row = 7
for src_key, (label, tab_bg) in SOURCE_MAP.items():
if src_key not in scanned_sources:
continue
items = by_source.get(src_key, [])
if not items:
continue
ws_sum.cell(row=sum_row, column=1, value=label).font = Font(name="Arial", size=10)
ws_sum.cell(row=sum_row, column=2, value=len(items)).font = Font(name="Arial", size=10)
ws_sum.cell(row=sum_row, column=3, value=sum(i.get("cpr_count", 0) for i in items)).font = Font(name="Arial", size=10)
@ -216,7 +192,7 @@ def _build_excel_bytes(role: str = "") -> tuple[bytes, str]:
_write_sheet(wb.create_sheet(title=clean_label), items, tab_bg)
# GPS items sheet
gps_items = [i for i in _items if (i.get("exif") or {}).get("gps")]
gps_items = [i for i in state.flagged_items if (i.get("exif") or {}).get("gps")]
if gps_items:
ws_gps = wb.create_sheet(title="GPS locations")
ws_gps.sheet_properties.tabColor = "1A7A6E"
@ -254,7 +230,7 @@ def _build_excel_bytes(role: str = "") -> tuple[bytes, str]:
ws_gps.auto_filter.ref = f"A1:{get_column_letter(len(GPS_COLS))}1"
# External transfers sheet
ext_items = [i for i in _items
ext_items = [i for i in state.flagged_items
if i.get("transfer_risk") in ("external-recipient", "external-share", "shared")]
if ext_items:
ws_ext = wb.create_sheet(title="External transfers")
@ -270,11 +246,8 @@ def _build_excel_bytes(role: str = "") -> tuple[bytes, str]:
buf = io.BytesIO()
wb.save(buf)
buf.seek(0)
_role_suffix = {"student": "_elever", "staff": "_ansatte"}.get(role, "")
fname = f"m365_scan{_role_suffix}_{_dt.datetime.now().strftime('%Y%m%d_%H%M%S')}.xlsx"
fname = f"m365_scan_{_dt.datetime.now().strftime('%Y%m%d_%H%M%S')}.xlsx"
return buf.read(), fname
@bp.route("/api/export_excel")
def export_excel():
"""Export flagged items as an Excel workbook with per-source tabs."""
@ -290,9 +263,8 @@ def export_excel():
state.flagged_items[:] = db_items
except Exception:
pass
role = request.args.get("role", "")
try:
xl_bytes, fname = _build_excel_bytes(role=role)
xl_bytes, fname = _build_excel_bytes()
return Response(
xl_bytes,
mimetype="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
@ -308,10 +280,9 @@ def export_excel():
# ── Article 30 report ─────────────────────────────────────────────────────────
def _build_article30_docx(role: str = "") -> tuple[bytes, str]:
def _build_article30_docx() -> tuple[bytes, str]:
"""Generate a GDPR Article 30 Register of Processing Activities as .docx.
Returns (bytes, filename). Strings are translated using the active state.LANG dict.
role: '' = all, 'student' = students only, 'staff' = staff + other."""
Returns (bytes, filename). Strings are translated using the active state.LANG dict."""
try:
from docx import Document as _Document
from docx.shared import Pt, RGBColor, Inches, Cm
@ -331,10 +302,6 @@ def _build_article30_docx(role: str = "") -> tuple[bytes, str]:
db = _get_db() if DB_OK else None
stats = db.get_stats() if db else {}
items = db.get_session_items() if db else list(state.flagged_items)
if role == "student":
items = [i for i in items if i.get("user_role") == "student"]
elif role == "staff":
items = [i for i in items if i.get("user_role") != "student"]
trend = db.get_trend(10) if db else []
overdue = db.get_overdue_items(5) if db else []
@ -378,8 +345,7 @@ def _build_article30_docx(role: str = "") -> tuple[bytes, str]:
now_str = _dt.datetime.now().strftime("%Y-%m-%d %H:%M")
date_str = _dt.datetime.now().strftime("%Y-%m-%d")
_role_suffix = {"student": "_elever", "staff": "_ansatte"}.get(role, "")
fname = f"article30{_role_suffix}_{date_str}.docx"
fname = f"article30_{date_str}.docx"
# Aggregate by source
by_source: dict = {}
@ -387,15 +353,6 @@ def _build_article30_docx(role: str = "") -> tuple[bytes, str]:
st = item.get("source_type", "other")
by_source.setdefault(st, []).append(item)
# Determine which sources were actually scanned (may be empty-hit)
scanned_sources: set = set()
if db:
try:
scanned_sources = db.get_session_sources()
except Exception:
pass
scanned_sources |= set(by_source.keys())
SOURCE_LABELS = {
"email": "Exchange (Outlook)",
"onedrive": "OneDrive",
@ -405,7 +362,6 @@ def _build_article30_docx(role: str = "") -> tuple[bytes, str]:
"gdrive": "Google Drive",
"local": "Local files",
"smb": "Network / SMB",
"sftp": "SFTP",
}
# ── Colour palette ────────────────────────────────────────────────────────
@ -600,10 +556,10 @@ def _build_article30_docx(role: str = "") -> tuple[bytes, str]:
r = p.add_run(txt); r.bold = True
r.font.size = Pt(10); r.font.color.rgb = WHITE
for src_key in ("email", "onedrive", "sharepoint", "teams", "gmail", "gdrive", "local", "smb", "sftp"):
if src_key not in scanned_sources:
continue
for src_key in ("email", "onedrive", "sharepoint", "teams", "gmail", "gdrive", "local", "smb"):
src_items = by_source.get(src_key, [])
if not src_items:
continue
row = src_tbl.add_row().cells
n_ov = sum(1 for i in src_items if i.get("id") in overdue_ids)
n_cpr = sum(i.get("cpr_count", 0) for i in src_items)
@ -1144,8 +1100,7 @@ def export_article30():
if not state.flagged_items:
return jsonify({"error": "No results to export — run a scan first"}), 400
try:
role = request.args.get("role", "")
docx_bytes, fname = _build_article30_docx(role=role)
docx_bytes, fname = _build_article30_docx()
return Response(
docx_bytes,
mimetype="application/vnd.openxmlformats-officedocument.wordprocessingml.document",
@ -1159,7 +1114,6 @@ def export_article30():
return jsonify({"error": str(e)}), 500
@bp.route("/api/delete_item", methods=["POST"])
def delete_item():
"""Delete a single flagged item. Returns {ok, error}."""
if not state.connector:
@ -1192,9 +1146,6 @@ def delete_item():
reason="manual")
_db.delete_item_record(item_id)
except Exception: pass
_audit("item_delete",
f"id={item_id!r} name={item_meta.get('name','')!r}",
ip=request.remote_addr or "")
return jsonify({"ok": True})
return jsonify({"ok": False, "error": "Delete returned unexpected result"})
except M365PermissionError:
@ -1205,502 +1156,6 @@ def delete_item():
return jsonify({"ok": False, "error": str(e)})
_REDACT_EXTS = {".docx", ".xlsx", ".csv", ".txt", ".pdf"}
_M365_CLOUD_TYPES = {"onedrive", "sharepoint", "teams"}
_GDRIVE_MIME_MAP = {
".docx": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
".xlsx": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
".pdf": "application/pdf",
}
_ALL_REDACTABLE_TYPES = {"local", "smb", "sftp", "gdrive"} | _M365_CLOUD_TYPES
@bp.route("/api/redact_item", methods=["POST"])
def redact_item():
"""Redact CPR numbers in-place in a local, SMB, SFTP, M365, or Google Drive file."""
from pathlib import Path as _Path
import tempfile as _tempfile
import shutil as _shutil
data = request.get_json() or {}
item_id = data.get("id", "")
if not item_id:
return jsonify({"ok": False, "error": "id required"}), 400
# Resolve item meta: in-memory first (active scan), then DB (history)
item_meta = next((x for x in state.flagged_items if x.get("id") == item_id), None)
if item_meta is None:
_db = _get_db() if DB_OK else None
if _db:
row = _db._connect().execute(
"SELECT * FROM flagged_items WHERE id=? LIMIT 1", (item_id,)
).fetchone()
item_meta = dict(row) if row else {}
else:
item_meta = {}
source_type = item_meta.get("source_type", "")
is_m365_cloud = source_type in _M365_CLOUD_TYPES
if source_type not in _ALL_REDACTABLE_TYPES:
return jsonify({"ok": False, "error": "Redaction is only supported for local, SMB, SFTP, M365, and Google Drive files"}), 400
# --- local path branch ---
if source_type == "local":
full_path = item_meta.get("full_path", "")
if not full_path:
return jsonify({"ok": False, "error": "File path not available — rescan to enable redaction"}), 400
path = _Path(full_path).expanduser()
if not path.exists():
return jsonify({"ok": False, "error": f"File not found: {full_path}"}), 404
ext = path.suffix.lower()
if ext not in _REDACT_EXTS:
return jsonify({"ok": False, "error": f"Redaction not supported for {ext or 'this'} files. Supported: DOCX, XLSX, CSV, TXT, PDF"}), 400
tmp_path = None
try:
from document_scanner import (
scan_docx, redact_docx,
scan_xlsx, redact_xlsx,
redact_csv,
scan_pdf, redact_pdf_secure,
find_pii_spans_in_text,
)
with _tempfile.NamedTemporaryFile(suffix=ext, delete=False, dir=path.parent) as tmp:
tmp_path = _Path(tmp.name)
if ext == ".docx":
results = scan_docx(path)
redacted = redact_docx(path, tmp_path, results, use_ner=False)
elif ext == ".xlsx":
results = scan_xlsx(path)
redacted = redact_xlsx(path, tmp_path, results, use_ner=False)
elif ext == ".csv":
redacted = redact_csv(path, tmp_path, use_ner=False)
elif ext == ".pdf":
results = scan_pdf(path)
redacted = redact_pdf_secure(path, tmp_path, results,
force_ocr=False, lang="dan+eng",
dpi=200, poppler_path=None,
use_ner=False)
if redacted is False:
raise RuntimeError("PDF redaction failed — PyMuPDF and reportlab both unavailable. Install with: pip install pymupdf")
else: # .txt
text = path.read_text(encoding="utf-8", errors="replace")
spans = [(s, e, l) for s, e, l in find_pii_spans_in_text(text, use_ner=False) if l == "CPR"]
chars = list(text)
for s, e, _ in sorted(spans, reverse=True):
chars[s:e] = [""] * (e - s)
tmp_path.write_text("".join(chars), encoding="utf-8")
redacted = len(spans)
_shutil.move(str(tmp_path), str(path))
tmp_path = None
except Exception as exc:
if tmp_path and tmp_path.exists():
try:
tmp_path.unlink()
except Exception:
pass
logger.exception("[redact] local file error")
return jsonify({"ok": False, "error": str(exc)}), 500
# --- M365 cloud branch (OneDrive / SharePoint / Teams) ---
elif is_m365_cloud:
conn = state.connector
if conn is None:
return jsonify({"ok": False, "error": "M365 not connected — cannot redact cloud files"}), 400
name = item_meta.get("name", "")
ext = _Path(name).suffix.lower() if name else ""
if ext not in _REDACT_EXTS - {".csv", ".txt"}:
return jsonify({"ok": False, "error": f"Redaction not supported for {ext or 'this'} cloud files. Supported: DOCX, XLSX, PDF"}), 400
drive_id = item_meta.get("drive_id") or item_meta.get("_drive_id", "")
account_id = item_meta.get("account_id") or item_meta.get("_account_id", "")
tmp_path = None
try:
# Download
if drive_id:
raw = conn.download_sharepoint_item(drive_id, item_id)
elif account_id and account_id != "me":
raw = conn.download_drive_item_for(account_id, item_id)
else:
raw = conn.download_drive_item(item_id)
from document_scanner import (
scan_docx, redact_docx,
scan_xlsx, redact_xlsx,
scan_pdf, redact_pdf_secure,
)
with _tempfile.NamedTemporaryFile(suffix=ext, delete=False) as tmp:
tmp.write(raw)
tmp_path = _Path(tmp.name)
del raw
with _tempfile.NamedTemporaryFile(suffix=ext, delete=False) as out:
out_path = _Path(out.name)
if ext == ".docx":
results = scan_docx(tmp_path)
redacted = redact_docx(tmp_path, out_path, results, use_ner=False)
elif ext == ".xlsx":
results = scan_xlsx(tmp_path)
redacted = redact_xlsx(tmp_path, out_path, results, use_ner=False)
else: # .pdf
results = scan_pdf(tmp_path)
redacted = redact_pdf_secure(tmp_path, out_path, results,
force_ocr=False, lang="dan+eng",
dpi=200, poppler_path=None,
use_ner=False)
if redacted is False:
raise RuntimeError("PDF redaction failed — PyMuPDF and reportlab both unavailable. Install with: pip install pymupdf")
# Upload redacted bytes back
redacted_bytes = out_path.read_bytes()
conn.put_drive_item_content(drive_id, item_id, redacted_bytes, user_id=account_id)
del redacted_bytes
except Exception as exc:
logger.exception("[redact] cloud file error")
return jsonify({"ok": False, "error": str(exc)}), 500
finally:
for p in ("tmp_path", "out_path"):
_p = locals().get(p)
if _p and _p.exists():
try:
_p.unlink()
except Exception:
pass
# --- Google Drive branch ---
elif source_type == "gdrive":
gconn = state.google_connector
if gconn is None:
return jsonify({"ok": False, "error": "Google not connected — cannot redact Drive files"}), 400
name = item_meta.get("name", "")
ext = _Path(name).suffix.lower() if name else ""
if ext not in _GDRIVE_MIME_MAP:
return jsonify({"ok": False, "error": f"Redaction not supported for {ext or 'this'} Drive files. Supported: DOCX, XLSX, PDF"}), 400
# item_id is "gdrive:{file_id}"
gfile_id = item_id[len("gdrive:"):] if item_id.startswith("gdrive:") else item_id
user_email = item_meta.get("account_id") or item_meta.get("_account_id", "")
tmp_path = out_path = None
try:
from document_scanner import (
scan_docx, redact_docx,
scan_xlsx, redact_xlsx,
scan_pdf, redact_pdf_secure,
)
from google_connector import GoogleError as _GoogleError
# Refuse Google-native formats (Docs/Sheets exported as DOCX)
try:
mime = gconn.get_drive_file_mime(user_email, gfile_id)
except Exception as exc:
return jsonify({"ok": False, "error": f"Could not read Drive file info: {exc}"}), 500
if mime.startswith("application/vnd.google-apps."):
return jsonify({"ok": False, "error": (
"Cannot redact a Google Docs/Sheets/Slides file in-place. "
"Export it as DOCX/XLSX/PDF first, then redact the exported copy."
)}), 400
raw = gconn.download_drive_file_by_id(user_email, gfile_id)
with _tempfile.NamedTemporaryFile(suffix=ext, delete=False) as tmp:
tmp.write(raw)
tmp_path = _Path(tmp.name)
del raw
with _tempfile.NamedTemporaryFile(suffix=ext, delete=False) as out:
out_path = _Path(out.name)
if ext == ".docx":
results = scan_docx(tmp_path)
redacted = redact_docx(tmp_path, out_path, results, use_ner=False)
elif ext == ".xlsx":
results = scan_xlsx(tmp_path)
redacted = redact_xlsx(tmp_path, out_path, results, use_ner=False)
else: # .pdf
results = scan_pdf(tmp_path)
redacted = redact_pdf_secure(tmp_path, out_path, results,
force_ocr=False, lang="dan+eng",
dpi=200, poppler_path=None,
use_ner=False)
if redacted is False:
raise RuntimeError("PDF redaction failed — PyMuPDF and reportlab both unavailable. Install with: pip install pymupdf")
redacted_bytes = out_path.read_bytes()
gconn.update_drive_file(user_email, gfile_id, redacted_bytes, _GDRIVE_MIME_MAP[ext])
del redacted_bytes
except Exception as exc:
logger.exception("[redact] gdrive file error")
return jsonify({"ok": False, "error": str(exc)}), 500
finally:
for _p in (tmp_path, out_path):
if _p and _p.exists():
try:
_p.unlink()
except Exception:
pass
# --- SFTP branch ---
elif source_type == "sftp":
full_path = item_meta.get("full_path", "")
source_uri = item_meta.get("account_name", "") # sftp://user@host/root_path
if not full_path:
return jsonify({"ok": False, "error": "File path not available — rescan to enable SFTP redaction"}), 400
if not source_uri:
return jsonify({"ok": False, "error": "SFTP source info not in memory — rescan and redact in the same session"}), 400
ext = _Path(full_path).suffix.lower()
if ext not in _REDACT_EXTS:
return jsonify({"ok": False, "error": f"Redaction not supported for {ext or 'this'} files. Supported: DOCX, XLSX, CSV, TXT, PDF"}), 400
# Parse sftp://user@host/root to find matching source config
try:
from urllib.parse import urlparse as _urlparse
_u = _urlparse(source_uri)
_sftp_host = _u.hostname or ""
_sftp_user = _u.username or ""
except Exception:
_sftp_host = _sftp_user = ""
from app_config import _load_file_sources, _resolve_sftp_credentials
_sftp_source = next(
(s for s in _load_file_sources()
if s.get("source_type") == "sftp"
and s.get("sftp_host", "") == _sftp_host
and s.get("sftp_user", "") == _sftp_user),
None,
)
if _sftp_source is None:
return jsonify({"ok": False, "error": f"SFTP source config not found for {_sftp_host} — rescan to enable redaction"}), 400
_sftp_source = _resolve_sftp_credentials(_sftp_source)
tmp_path = out_path = None
try:
from sftp_connector import SFTPScanner as _SFTPScanner
from document_scanner import (
scan_docx, redact_docx,
scan_xlsx, redact_xlsx,
redact_csv,
scan_pdf, redact_pdf_secure,
find_pii_spans_in_text,
)
_sftp = _SFTPScanner(
host=_sftp_source.get("sftp_host", ""),
root_path=_sftp_source.get("path", "/"),
username=_sftp_source.get("sftp_user", ""),
port=int(_sftp_source.get("sftp_port", 22)),
auth_type=_sftp_source.get("sftp_auth", "password"),
password=_sftp_source.get("sftp_password") or None,
key_path=_sftp_source.get("sftp_key_path") or None,
passphrase=_sftp_source.get("sftp_passphrase") or None,
)
raw = _sftp.read_file(full_path)
with _tempfile.NamedTemporaryFile(suffix=ext, delete=False) as tmp:
tmp.write(raw)
tmp_path = _Path(tmp.name)
del raw
with _tempfile.NamedTemporaryFile(suffix=ext, delete=False) as out:
out_path = _Path(out.name)
if ext == ".docx":
results = scan_docx(tmp_path)
redacted = redact_docx(tmp_path, out_path, results, use_ner=False)
elif ext == ".xlsx":
results = scan_xlsx(tmp_path)
redacted = redact_xlsx(tmp_path, out_path, results, use_ner=False)
elif ext == ".csv":
redacted = redact_csv(tmp_path, out_path, use_ner=False)
elif ext == ".pdf":
results = scan_pdf(tmp_path)
redacted = redact_pdf_secure(tmp_path, out_path, results,
force_ocr=False, lang="dan+eng",
dpi=200, poppler_path=None,
use_ner=False)
if redacted is False:
raise RuntimeError("PDF redaction failed — install PyMuPDF: pip install pymupdf")
else: # .txt
text = tmp_path.read_text(encoding="utf-8", errors="replace")
spans = [(s, e, l) for s, e, l in find_pii_spans_in_text(text, use_ner=False) if l == "CPR"]
chars = list(text)
for s, e, _ in sorted(spans, reverse=True):
chars[s:e] = [""] * (e - s)
out_path.write_text("".join(chars), encoding="utf-8")
redacted = len(spans)
_sftp.write_file(full_path, out_path.read_bytes())
except Exception as exc:
logger.exception("[redact] sftp file error")
return jsonify({"ok": False, "error": str(exc)}), 500
finally:
for _p in (tmp_path, out_path):
if _p and _p.exists():
try:
_p.unlink()
except Exception:
pass
# --- SMB branch ---
elif source_type == "smb":
full_path = item_meta.get("full_path", "")
if not full_path:
return jsonify({"ok": False, "error": "File path not available — rescan to enable SMB redaction"}), 400
ext = _Path(full_path.replace("\\", "/").split("/")[-1]).suffix.lower()
if ext not in _REDACT_EXTS:
return jsonify({"ok": False, "error": f"Redaction not supported for {ext or 'this'} files. Supported: DOCX, XLSX, CSV, TXT, PDF"}), 400
# Parse //host/share/... to find matching source config
_norm = full_path.replace("\\", "/").lstrip("/")
_parts = _norm.split("/", 2)
_smb_host_fp = _parts[0] if len(_parts) > 0 else ""
from app_config import _load_file_sources
from file_scanner import get_smb_password as _get_smb_pw
_smb_source = next(
(s for s in _load_file_sources()
if s.get("source_type", "smb") in ("smb", "")
and (s.get("smb_host", "") == _smb_host_fp
or s.get("path", "").replace("\\", "/").lstrip("/").split("/")[0] == _smb_host_fp)),
None,
)
if _smb_source is None:
return jsonify({"ok": False, "error": f"SMB source config not found for {_smb_host_fp}"}), 400
_smb_user = _smb_source.get("smb_user", "")
_smb_domain = _smb_source.get("smb_domain", "")
_smb_kc = _smb_source.get("keychain_key") or None
_smb_pw = _smb_source.get("smb_password") or _get_smb_pw(_smb_host_fp, _smb_user, _smb_kc) or ""
tmp_path = out_path = None
try:
from file_scanner import write_smb_file as _write_smb
from document_scanner import (
scan_docx, redact_docx,
scan_xlsx, redact_xlsx,
redact_csv,
scan_pdf, redact_pdf_secure,
find_pii_spans_in_text,
)
# Download current content
from file_scanner import _smb_read_file as _smb_read, SMB_OK as _SMB_OK
if not _SMB_OK:
raise RuntimeError("smbprotocol not installed — run: pip install smbprotocol")
import uuid as _uuid
from smbprotocol.connection import Connection as _SmbConn
from smbprotocol.session import Session as _SmbSession
from smbprotocol.tree import TreeConnect as _SmbTree
_norm2 = full_path.replace("\\", "/").lstrip("/")
_fp = _norm2.split("/", 2)
_fhost = _fp[0]; _fshare = _fp[1] if len(_fp) > 1 else ""
_frel = (_fp[2].replace("/", "\\")) if len(_fp) > 2 else ""
_smb_conn = _SmbConn(_uuid.uuid4(), _fhost, 445)
_smb_conn.connect(timeout=30)
try:
_smb_sess = _SmbSession(_smb_conn,
username=f"{_smb_domain}\\{_smb_user}" if _smb_domain else _smb_user,
password=_smb_pw, require_encryption=False)
_smb_sess.connect()
try:
_smb_tree = _SmbTree(_smb_sess, f"\\\\{_fhost}\\{_fshare}")
_smb_tree.connect()
try:
raw = _smb_read(_smb_tree, _frel)
finally:
_smb_tree.disconnect()
finally:
_smb_sess.disconnect()
finally:
_smb_conn.disconnect()
with _tempfile.NamedTemporaryFile(suffix=ext, delete=False) as tmp:
tmp.write(raw)
tmp_path = _Path(tmp.name)
del raw
with _tempfile.NamedTemporaryFile(suffix=ext, delete=False) as out:
out_path = _Path(out.name)
if ext == ".docx":
results = scan_docx(tmp_path)
redacted = redact_docx(tmp_path, out_path, results, use_ner=False)
elif ext == ".xlsx":
results = scan_xlsx(tmp_path)
redacted = redact_xlsx(tmp_path, out_path, results, use_ner=False)
elif ext == ".csv":
redacted = redact_csv(tmp_path, out_path, use_ner=False)
elif ext == ".pdf":
results = scan_pdf(tmp_path)
redacted = redact_pdf_secure(tmp_path, out_path, results,
force_ocr=False, lang="dan+eng",
dpi=200, poppler_path=None,
use_ner=False)
if redacted is False:
raise RuntimeError("PDF redaction failed — install PyMuPDF: pip install pymupdf")
else: # .txt
text = tmp_path.read_text(encoding="utf-8", errors="replace")
spans = [(s, e, l) for s, e, l in find_pii_spans_in_text(text, use_ner=False) if l == "CPR"]
chars = list(text)
for s, e, _ in sorted(spans, reverse=True):
chars[s:e] = [""] * (e - s)
out_path.write_text("".join(chars), encoding="utf-8")
redacted = len(spans)
_write_smb(full_path, out_path.read_bytes(), _smb_user, _smb_pw, _smb_domain)
except Exception as exc:
logger.exception("[redact] smb file error")
return jsonify({"ok": False, "error": str(exc)}), 500
finally:
for _p in (tmp_path, out_path):
if _p and _p.exists():
try:
_p.unlink()
except Exception:
pass
# --- shared: remove from grid + DB ---
state.flagged_items[:] = [x for x in state.flagged_items if x.get("id") != item_id]
_db = _get_db() if DB_OK else None
if _db:
try:
_db.log_deletion(item_meta, reason="redacted")
_db.delete_item_record(item_id)
except Exception:
pass
_audit("item_redact",
f"id={item_id!r} name={item_meta.get('name','')!r} spans={redacted}",
ip=request.remote_addr or "")
logger.info("[redact] %s%d CPR span(s) redacted", item_meta.get('name', item_id), redacted)
return jsonify({"ok": True, "redacted": redacted})
@bp.route("/api/delete_bulk", methods=["POST"])
def delete_bulk():
"""Delete multiple items matching criteria. Streams progress as SSE."""
@ -1758,11 +1213,10 @@ def delete_bulk():
except Exception: pass
return jsonify({
"ok": True,
"deleted": len(deleted_ids),
"deleted_ids": deleted_ids, # so the grid can mark exactly these
"failed": len(failed_items),
"errors": failed_items[:10], # cap error list
"ok": True,
"deleted": len(deleted_ids),
"failed": len(failed_items),
"errors": failed_items[:10], # cap error list
})

View File

@ -140,16 +140,6 @@ def _run_google_scan(options: dict):
max_file_mb = float(scan_opts.get("max_file_mb", 50.0))
scan_body = bool(scan_opts.get("scan_body", True))
scan_att = bool(scan_opts.get("scan_attachments", True))
delta_enabled = bool(scan_opts.get("delta", False))
scan_emails = bool(scan_opts.get("scan_emails", False))
scan_phones = bool(scan_opts.get("scan_phones", False))
ocr_lang = str(scan_opts.get("ocr_lang", "dan+eng")) or "dan+eng"
cpr_only = bool(scan_opts.get("cpr_only", False))
from checkpoint import (_load_delta_tokens, _save_delta_tokens,
_save_checkpoint, _load_checkpoint, _clear_checkpoint)
_drive_delta_tokens: dict = _load_delta_tokens() if delta_enabled else {}
_new_drive_tokens: dict = {}
# Resolve users: explicit list → Admin SDK → fall back to SA email itself
_user_role_map: dict = {} # email → role
@ -198,45 +188,14 @@ def _run_google_scan(options: dict):
except Exception as e:
logger.error("[google_scan] begin_scan failed: %s", e)
# ── Checkpoint: resume from a previous interrupted Google scan ────────────
import hashlib as _hl, json as _js
_gck_prefix = "google"
_gck_key = _hl.sha256(_js.dumps({
"emails": sorted(user_emails),
"sources": sorted(sources),
"older_than_days": scan_opts.get("older_than_days", 0),
}, sort_keys=True).encode()).hexdigest()[:16]
_gck = _load_checkpoint(_gck_key, prefix=_gck_prefix)
_g_scanned_ids: set = set(_gck["scanned_ids"]) if _gck else set()
_google_flagged: list = [] # items found by this Google scan (for checkpoint)
_gck_resumed = len(_g_scanned_ids)
if _gck:
from scan_engine import _with_disposition as _wd_ck
_google_flagged = list(_gck.get("flagged", []))
flagged_items.extend(_google_flagged)
broadcast("scan_phase", {"phase": f"Resuming — skipping {_gck_resumed} already-scanned items…"})
for _card in _google_flagged:
broadcast("scan_file_flagged", _wd_ck(_card, _db))
_GCHECKPOINT_SAVE_EVERY = 25
_g_items_since_save = 0
total_flagged = 0
total_scanned = 0
t_start = _time.monotonic()
def _check_abort():
if _scan_abort.is_set():
# Emit google_scan_done (not scan_cancelled) so that the frontend
# google_scan_done handler can decide whether to close the SSE based
# on whether other scan types (M365, file) are still running.
# scan_cancelled would unconditionally close the SSE connection,
# dropping events from a concurrently running new scan.
broadcast("google_scan_done", {
"flagged_count": total_flagged,
"total_scanned": total_scanned,
"elapsed_seconds": round(_time.monotonic() - t_start, 1),
"cancelled": True,
})
from gdpr_scanner import _scan_abort as _sa
if _sa.is_set():
broadcast("scan_cancelled", {"completed": total_scanned})
return True
return False
@ -248,8 +207,6 @@ def _run_google_scan(options: dict):
"source": item_meta.get("_source", ""),
"source_type": item_meta.get("_source_type", ""),
"cpr_count": len(cprs),
"email_count": item_meta.get("_email_count", 0),
"phone_count": item_meta.get("_phone_count", 0),
"url": item_meta.get("_url", ""),
"size_kb": round(item_meta.get("size", 0) / 1024, 1),
"modified": (item_meta.get("lastModifiedDateTime") or item_meta.get("receivedDateTime") or "")[:10],
@ -266,10 +223,8 @@ def _run_google_scan(options: dict):
"special_category": [],
"face_count": 0,
"exif": {},
"body_excerpt": item_meta.get("_body_excerpt", ""),
}
flagged_items.append(card)
_google_flagged.append(card)
broadcast("scan_file_flagged", _with_disposition(card, _db))
total_flagged += 1
if _db and _db_scan_id:
@ -301,10 +256,6 @@ def _run_google_scan(options: dict):
):
if _check_abort():
return
_item_id = meta.get("id", "")
if _item_id in _g_scanned_ids:
total_scanned += 1
continue
total_scanned += 1
broadcast("scan_file", {"file": meta.get("name", "")})
broadcast("scan_progress", {
@ -316,33 +267,14 @@ def _run_google_scan(options: dict):
})
try:
meta["_account"] = _display_name
meta["_source_type"] = "gmail"
# Extract a plain-text excerpt before scanning (body is discarded after)
try:
import re as _re
_raw = data[:3000].decode("utf-8", errors="replace")
_plain = _re.sub(r"<[^>]+>", " ", _raw)
meta["_body_excerpt"] = " ".join(_plain.split())[:500]
except Exception:
meta["_body_excerpt"] = ""
result = _scan_bytes(data, meta.get("name", "msg.txt"), lang=ocr_lang)
result = _scan_bytes(data, meta.get("name", "msg.txt"))
except Exception as e:
broadcast("scan_error", {"file": meta.get("name", ""), "error": str(e)})
_g_scanned_ids.add(_item_id)
continue
cprs = result.get("cprs", [])
cprs = result.get("cprs", [])
pii_counts = result.get("pii_counts")
_em = list(dict.fromkeys(e["formatted"] for e in result.get("emails", []))) if scan_emails else []
_ph = list(dict.fromkeys(p["formatted"] for p in result.get("phones", []))) if scan_phones else []
if cprs or (not cpr_only and ((pii_counts and any(pii_counts.values())) or _em or _ph)):
meta["_email_count"] = len(_em)
meta["_phone_count"] = len(_ph)
if cprs or (pii_counts and any(pii_counts.values())):
_broadcast_card(meta, cprs, pii_counts)
_g_scanned_ids.add(_item_id)
_g_items_since_save += 1
if _g_items_since_save >= _GCHECKPOINT_SAVE_EVERY:
_save_checkpoint(_gck_key, _g_scanned_ids, _google_flagged, {}, prefix=_gck_prefix)
_g_items_since_save = 0
except GoogleError as e:
broadcast("scan_error", {"file": f"Gmail/{user_email}", "error": str(e)})
except Exception as e:
@ -351,45 +283,14 @@ def _run_google_scan(options: dict):
# ── Google Drive ──────────────────────────────────────────────────────
if "gdrive" in sources:
try:
delta_key = f"gdrive:{user_email}"
saved_token = _drive_delta_tokens.get(delta_key) if delta_enabled else None
if delta_enabled and saved_token:
broadcast("scan_phase", {"phase": f"{user_email} — Google Drive (delta)"})
try:
drive_items, new_token = conn.get_drive_changes(
user_email, saved_token,
max_files=max_files, max_file_mb=max_file_mb,
)
_new_drive_tokens[delta_key] = new_token
except Exception as delta_err:
broadcast("scan_phase", {"phase": f"{user_email} — Google Drive (delta token invalid — full scan)"})
logger.warning("[gdrive delta] %s: %s — falling back to full scan", user_email, delta_err)
# Record start token BEFORE iterating so the next delta starts from here
try:
_new_drive_tokens[delta_key] = conn.get_drive_start_token(user_email)
except Exception:
pass
# Use a lazy generator (no list()) so _check_abort() fires between items
drive_items = conn.iter_drive_files(user_email, max_files=max_files, max_file_mb=max_file_mb)
else:
broadcast("scan_phase", {"phase": f"{user_email} — Google Drive"})
# Record start token BEFORE iterating so the next delta starts from here
if delta_enabled:
try:
_new_drive_tokens[delta_key] = conn.get_drive_start_token(user_email)
except Exception:
pass
# Use a lazy generator (no list()) so _check_abort() fires between items
drive_items = conn.iter_drive_files(user_email, max_files=max_files, max_file_mb=max_file_mb)
for meta, data in drive_items:
broadcast("scan_phase", {"phase": f"{user_email} — Google Drive"})
for meta, data in conn.iter_drive_files(
user_email,
max_files=max_files,
max_file_mb=max_file_mb,
):
if _check_abort():
return
_item_id = meta.get("id", "")
if _item_id in _g_scanned_ids:
total_scanned += 1
continue
total_scanned += 1
broadcast("scan_file", {"file": meta.get("name", "")})
broadcast("scan_progress", {
@ -401,50 +302,27 @@ def _run_google_scan(options: dict):
})
try:
meta["_account"] = _display_name
meta["_source_type"] = "gdrive"
result = _scan_bytes(data, meta.get("name", "file"), lang=ocr_lang)
result = _scan_bytes(data, meta.get("name", "file"))
except Exception as e:
broadcast("scan_error", {"file": meta.get("name", ""), "error": str(e)})
_g_scanned_ids.add(_item_id)
continue
cprs = result.get("cprs", [])
cprs = result.get("cprs", [])
pii_counts = result.get("pii_counts")
_em = list(dict.fromkeys(e["formatted"] for e in result.get("emails", []))) if scan_emails else []
_ph = list(dict.fromkeys(p["formatted"] for p in result.get("phones", []))) if scan_phones else []
if cprs or (not cpr_only and ((pii_counts and any(pii_counts.values())) or _em or _ph)):
meta["_email_count"] = len(_em)
meta["_phone_count"] = len(_ph)
if cprs or (pii_counts and any(pii_counts.values())):
_broadcast_card(meta, cprs, pii_counts)
_g_scanned_ids.add(_item_id)
_g_items_since_save += 1
if _g_items_since_save >= _GCHECKPOINT_SAVE_EVERY:
_save_checkpoint(_gck_key, _g_scanned_ids, _google_flagged, {}, prefix=_gck_prefix)
_g_items_since_save = 0
except GoogleError as e:
broadcast("scan_error", {"file": f"Drive/{user_email}", "error": str(e)})
except Exception as e:
broadcast("scan_error", {"file": f"Drive/{user_email}", "error": str(e)})
if delta_enabled and _new_drive_tokens:
try:
current_tokens = _load_delta_tokens()
_save_delta_tokens({**current_tokens, **_new_drive_tokens})
except Exception as e:
logger.warning("[gdrive delta] token save failed: %s", e)
if not _scan_abort.is_set():
_clear_checkpoint(prefix=_gck_prefix)
elapsed = _time.monotonic() - t_start
broadcast("google_scan_done", {
"flagged_count": total_flagged,
"total_scanned": total_scanned,
broadcast("scan_done", {
"flagged_count": total_flagged,
"total_scanned": total_scanned,
"elapsed_seconds": round(elapsed, 1),
"delta": delta_enabled and bool(_new_drive_tokens),
"delta_sources": len(_new_drive_tokens),
})
if _db and _db_scan_id:
try:
_db.finish_scan(_db_scan_id, total_scanned)
_db.end_scan(_db_scan_id, total_scanned, total_flagged)
except Exception:
pass

View File

@ -4,10 +4,6 @@ Scan profiles
from __future__ import annotations
from flask import Blueprint, jsonify, request
from app_config import _profiles_load, _profile_save, _profile_delete, _profile_get
try:
from gdpr_db import log_audit_event as _audit
except ImportError:
def _audit(*a, **kw): pass # type: ignore[misc]
bp = Blueprint("profiles", __name__)
@ -25,8 +21,6 @@ def profiles_save():
if not profile.get("name"):
return jsonify({"error": "name required"}), 400
saved = _profile_save(profile)
_audit("profile_save", f"name={profile.get('name')!r}",
ip=request.remote_addr or "")
return jsonify({"status": "saved", "profile": saved})
@ -38,8 +32,6 @@ def profiles_delete():
if not key:
return jsonify({"error": "name or id required"}), 400
ok = _profile_delete(key)
if ok:
_audit("profile_delete", f"key={key!r}", ip=request.remote_addr or "")
return jsonify({"status": "deleted" if ok else "not_found"})
@ -51,3 +43,5 @@ def profiles_get():
if not p:
return jsonify({"error": "not found"}), 404
return jsonify({"profile": p})

View File

@ -3,70 +3,18 @@ Scan stream, start/stop, checkpoint, settings, delta
"""
from __future__ import annotations
import threading
import logging
from flask import Blueprint, jsonify, request
from routes import state
from app_config import (
_save_settings, _load_settings,
_load_src_toggles, _save_src_toggles,
_load_smtp_config,
)
from checkpoint import (
_checkpoint_key, _load_checkpoint, _clear_checkpoint,
_load_delta_tokens, _DELTA_PATH, _cp_path,
_load_delta_tokens, _DELTA_PATH,
)
bp = Blueprint("scan", __name__)
_log = logging.getLogger(__name__)
try:
from gdpr_db import log_audit_event as _audit
except ImportError:
def _audit(*a, **kw): pass # type: ignore[misc]
def _maybe_send_auto_email():
"""Send the scan report email after a manual scan if auto_email_manual is enabled."""
try:
smtp_cfg = _load_smtp_config()
if not smtp_cfg.get("auto_email_manual"):
return
if not state.flagged_items:
return
recipients = smtp_cfg.get("recipients", [])
if isinstance(recipients, str):
recipients = [r.strip() for r in recipients.replace(";", ",").split(",") if r.strip()]
if not recipients:
return
from routes.export import _build_excel_bytes
from routes.email import _send_report_email, _send_email_graph
import datetime as _dt
xl_bytes, fname = _build_excel_bytes()
subject = f"GDPR Scanner — scan report {_dt.datetime.now().strftime('%Y-%m-%d')}"
body_html = (
"<html><body style='font-family:Arial,sans-serif;color:#333;padding:24px'>"
"<h2 style='color:#1F3864'>☁️ GDPR Scanner — scan report</h2>"
f"<p>Please find the latest scan report attached ({fname}).</p>"
f"<p style='color:#888;font-size:12px'>Generated: {_dt.datetime.now().strftime('%Y-%m-%d %H:%M:%S')}<br>"
f"Items flagged: {len(state.flagged_items)}</p>"
"</body></html>"
)
if state.connector and state.connector.is_authenticated() and not smtp_cfg.get("prefer_smtp"):
try:
_send_email_graph(subject, body_html, recipients,
attachment_bytes=xl_bytes, attachment_name=fname)
_log.info("[auto-email] report sent via Graph to %s", recipients)
return
except Exception as e:
_log.warning("[auto-email] Graph failed, trying SMTP: %s", e)
_send_report_email(xl_bytes, fname, smtp_cfg, recipients)
_log.info("[auto-email] report sent via SMTP to %s", recipients)
except Exception as e:
_log.error("[auto-email] failed: %s", e)
@bp.route("/api/scan/status")
@ -76,13 +24,9 @@ def scan_status():
acquired = state._scan_lock.acquire(blocking=False)
if acquired:
state._scan_lock.release()
g_acquired = state._google_scan_lock.acquire(blocking=False)
if g_acquired:
state._google_scan_lock.release()
return jsonify({
"running": not acquired, # M365 + file scan lock
"google_running": not g_acquired, # Google scan lock (separate)
"scan_id": _sse_mod._current_scan_id or None,
"running": not acquired,
"scan_id": _sse_mod._current_scan_id or None,
})
@ -113,21 +57,15 @@ def scan_start():
from scan_engine import run_scan
try:
run_scan(options)
_maybe_send_auto_email()
finally:
state._scan_lock.release()
threading.Thread(target=_run, daemon=True).start()
_audit("scan_start",
f"sources={options.get('sources',[])} profile_id={profile_id!r}",
ip=request.remote_addr or "")
return jsonify({"status": "started"})
@bp.route("/api/scan/stop", methods=["POST"])
def scan_stop():
state._scan_abort.set()
state._google_scan_abort.set()
_audit("scan_stop", "", ip=request.remote_addr or "")
return jsonify({"status": "stopping"})
@ -135,80 +73,28 @@ def scan_stop():
def scan_checkpoint_info():
"""Return info about any saved checkpoint for the given scan options.
If check_only=true, just reports whether a scan is currently running."""
import hashlib, json as _json
options = request.get_json() or {}
if options.get("check_only"):
acquired = state._scan_lock.acquire(blocking=False)
if acquired:
state._scan_lock.release()
return jsonify({"running": not acquired})
engines = {}
# M365
if options.get("sources"):
key = _checkpoint_key(options)
cp = _load_checkpoint(key, prefix="m365")
if cp:
engines["m365"] = {
"exists": True,
"scanned_count": len(cp.get("scanned_ids", [])),
"flagged_count": len(cp.get("flagged", [])),
"started_at": cp.get("meta", {}).get("started_at"),
}
# Google
google_emails = options.get("googleUserEmails", [])
google_sources = options.get("googleSources", [])
if google_emails and google_sources:
gkey = hashlib.sha256(_json.dumps({
"emails": sorted(google_emails),
"sources": sorted(google_sources),
"older_than_days": options.get("options", {}).get("older_than_days", 0),
}, sort_keys=True).encode()).hexdigest()[:16]
cp = _load_checkpoint(gkey, prefix="google")
if cp:
engines["google"] = {
"exists": True,
"scanned_count": len(cp.get("scanned_ids", [])),
"flagged_count": len(cp.get("flagged", [])),
"started_at": cp.get("meta", {}).get("started_at"),
}
# File sources (one checkpoint per source ID)
for src_id in options.get("fileSources", []):
fkey = _checkpoint_key({"sources": ["file"], "user_ids": [src_id], "options": {}})
cp = _load_checkpoint(fkey, prefix=f"file_{src_id}")
if cp:
fe = engines.setdefault("file", {"exists": True, "scanned_count": 0, "flagged_count": 0, "started_at": None})
fe["scanned_count"] += len(cp.get("scanned_ids", []))
fe["flagged_count"] += len(cp.get("flagged", []))
if not fe["started_at"]:
fe["started_at"] = cp.get("meta", {}).get("started_at")
if not engines:
key = _checkpoint_key(options)
cp = _load_checkpoint(key)
if not cp:
return jsonify({"exists": False})
started_ats = [v["started_at"] for v in engines.values() if v.get("started_at")]
return jsonify({
"exists": True,
"scanned_count": sum(v.get("scanned_count", 0) for v in engines.values()),
"flagged_count": sum(v.get("flagged_count", 0) for v in engines.values()),
"started_at": min(started_ats) if started_ats else None,
"engines": engines,
"scanned_count": len(cp.get("scanned_ids", [])),
"flagged_count": len(cp.get("flagged", [])),
"started_at": cp.get("meta", {}).get("started_at"),
})
@bp.route("/api/scan/clear_checkpoint", methods=["POST"])
def scan_clear_checkpoint():
"""Discard all saved checkpoints so the next scan starts fresh."""
from pathlib import Path
data_dir = Path.home() / ".gdprscanner"
for f in data_dir.glob("checkpoint_*.json"):
try:
f.unlink()
except Exception:
pass
"""Discard any saved checkpoint so the next scan starts fresh."""
_clear_checkpoint()
return jsonify({"status": "cleared"})

View File

@ -4,10 +4,6 @@ Scheduler API routes — multi-job CRUD, status, history, run-now.
from __future__ import annotations
from flask import Blueprint, jsonify, request
import sys, os, threading
try:
from gdpr_db import log_audit_event as _audit
except ImportError:
def _audit(*a, **kw): pass # type: ignore[misc]
bp = Blueprint("scheduler", __name__)
@ -56,9 +52,6 @@ def scheduler_jobs_save():
_sched().reload()
except Exception:
pass
_audit("scheduler_job_save",
f"id={job_id!r} name={jobs[i].get('name','')!r}",
ip=request.remote_addr or "")
return jsonify({"ok": True, "job": jobs[i]})
# New job
job = sm._new_job(data)
@ -68,9 +61,6 @@ def scheduler_jobs_save():
_sched().reload()
except Exception:
pass
_audit("scheduler_job_save",
f"id={job.get('id','')!r} name={job.get('name','')!r}",
ip=request.remote_addr or "")
return jsonify({"ok": True, "job": job})
except Exception as e:
import traceback
@ -91,7 +81,6 @@ def scheduler_jobs_delete():
_sched().reload()
except Exception:
pass
_audit("scheduler_job_delete", f"id={job_id!r}", ip=request.remote_addr or "")
return jsonify({"ok": True})
except Exception as e:
import traceback

View File

@ -3,15 +3,9 @@ File sources and file scan
"""
from __future__ import annotations
import threading
import uuid as _uuid
from pathlib import Path
from flask import Blueprint, jsonify, request
from routes import state
from app_config import _load_file_sources, _save_file_sources, _SFTP_KEYS_DIR
try:
from gdpr_db import log_audit_event as _audit
except ImportError:
def _audit(*a, **kw): pass # type: ignore[misc]
from app_config import _load_file_sources, _save_file_sources
try:
from file_scanner import store_smb_password, SMB_OK as _SMB_OK
@ -21,12 +15,6 @@ except ImportError:
_SMB_OK = False
def store_smb_password(*a, **kw): return False # type: ignore[misc]
try:
from sftp_connector import store_sftp_password, SFTP_OK as _SFTP_OK
except ImportError:
_SFTP_OK = False
def store_sftp_password(*a, **kw): return False # type: ignore[misc]
bp = Blueprint("sources", __name__)
@ -35,166 +23,70 @@ def file_sources_list():
"""Return all saved file source definitions."""
sources = _load_file_sources()
return jsonify({
"sources": sources,
"smb_available": _SMB_OK,
"sftp_available": _SFTP_OK,
"scanner_ok": _FILE_SCANNER_OK,
"sources": sources,
"smb_available": _SMB_OK,
"scanner_ok": _FILE_SCANNER_OK,
})
@bp.route("/api/file_sources/save", methods=["POST"])
def file_sources_save():
"""Add or update a file source. Assigns a UUID if id is missing."""
import uuid as _uuid
data = request.get_json() or {}
source_type = data.get("source_type", "")
# Validate required fields per source type
if source_type == "sftp":
if not data.get("sftp_host", "").strip():
return jsonify({"error": "sftp_host required"}), 400
if not data.get("sftp_user", "").strip():
return jsonify({"error": "sftp_user required"}), 400
if not data.get("path", "").strip():
data["path"] = "/"
else:
if not data.get("path", "").strip():
return jsonify({"error": "path required"}), 400
path = data.get("path", "").strip()
if not path:
return jsonify({"error": "path required"}), 400
sources = _load_file_sources()
uid = data.get("id") or ""
for i, s in enumerate(sources):
if s.get("id") == uid:
sources[i] = {**s, **data}
_save_file_sources(sources)
_audit("source_update",
f"name={data.get('name','')!r} type={data.get('source_type','local')!r}",
ip=request.remote_addr or "")
return jsonify({"ok": True, "source": sources[i]})
data["id"] = data.get("id") or str(_uuid.uuid4())
sources.append(data)
_save_file_sources(sources)
_audit("source_add",
f"name={data.get('name','')!r} type={data.get('source_type','local')!r}",
ip=request.remote_addr or "")
return jsonify({"ok": True, "source": data})
@bp.route("/api/file_sources/delete", methods=["POST"])
def file_sources_delete():
"""Remove a file source by id. Also deletes any associated SFTP key file."""
"""Remove a file source by id."""
uid = (request.get_json() or {}).get("id", "")
if not uid:
return jsonify({"error": "id required"}), 400
sources = _load_file_sources()
deleted = next((s for s in sources if s.get("id") == uid), None)
sources = [s for s in sources if s.get("id") != uid]
sources = [s for s in _load_file_sources() if s.get("id") != uid]
_save_file_sources(sources)
if deleted:
_audit("source_delete",
f"name={deleted.get('name','')!r} type={deleted.get('source_type','local')!r}",
ip=request.remote_addr or "")
# Clean up key file if this was an SFTP key-auth source
if deleted and deleted.get("sftp_key_path"):
key_file = Path(deleted["sftp_key_path"])
if key_file.parent == _SFTP_KEYS_DIR and key_file.exists():
try:
key_file.unlink()
except OSError:
pass
return jsonify({"ok": True})
@bp.route("/api/file_sources/store_creds", methods=["POST"])
def file_sources_store_creds():
"""Store SMB or SFTP password/passphrase in the OS keychain."""
data = request.get_json() or {}
source_type = data.get("source_type", "smb")
password = data.get("password", "")
if source_type == "sftp":
if not _SFTP_OK:
return jsonify({"error": "paramiko not installed — run: pip install paramiko"}), 503
host = data.get("sftp_host", "")
user = data.get("sftp_user", "")
if not user or not password:
return jsonify({"error": "sftp_user and password required"}), 400
key = data.get("keychain_key") or f"sftp:{user}@{host}"
ok = store_sftp_password(host, user, password, key)
if ok:
return jsonify({"ok": True, "keychain_key": key})
return jsonify({"error": "keyring not available — install: pip install keyring"}), 500
else:
if not _FILE_SCANNER_OK:
return jsonify({"error": "file_scanner not available"}), 503
smb_host = data.get("smb_host", "")
smb_user = data.get("smb_user", "")
if not smb_user or not password:
return jsonify({"error": "smb_user and password required"}), 400
key = data.get("keychain_key") or smb_user
ok = store_smb_password(smb_host, smb_user, password, key)
if ok:
return jsonify({"ok": True, "keychain_key": key})
return jsonify({"error": "keyring not available — install: pip install keyring"}), 500
@bp.route("/api/file_sources/upload_key", methods=["POST"])
def file_sources_upload_key():
"""Accept an SSH private key file upload and store it in the SFTP keys directory.
Validates the file is a recognised private key format before saving.
Returns {"key_id": uuid, "key_path": absolute_path}.
"""
if not _SFTP_OK:
return jsonify({"error": "paramiko not installed — run: pip install paramiko"}), 503
if "key_file" not in request.files:
return jsonify({"error": "key_file required"}), 400
file = request.files["key_file"]
raw = file.read(65536) # 64 KB is more than enough for any private key
# Validate before saving — try loading the key material with paramiko
import io
import paramiko
loaded = False
for cls in (paramiko.RSAKey, paramiko.Ed25519Key, paramiko.ECDSAKey, paramiko.DSSKey):
try:
cls.from_private_key(io.BytesIO(raw))
loaded = True
break
except (paramiko.ssh_exception.SSHException, Exception):
continue
if not loaded:
# Might be passphrase-protected — still accept it; validation will happen at connect time
if b"-----BEGIN" not in raw and b"OPENSSH PRIVATE KEY" not in raw:
return jsonify({"error": "File does not appear to be a private key"}), 400
key_id = str(_uuid.uuid4())
key_path = _SFTP_KEYS_DIR / key_id
key_path.write_bytes(raw)
key_path.chmod(0o600)
return jsonify({"ok": True, "key_id": key_id, "key_path": str(key_path)})
"""Store SMB password in the OS keychain."""
if not _FILE_SCANNER_OK:
return jsonify({"error": "file_scanner not available"}), 503
data = request.get_json() or {}
smb_host = data.get("smb_host", "")
smb_user = data.get("smb_user", "")
password = data.get("password", "")
key = data.get("keychain_key") or smb_user
if not smb_user or not password:
return jsonify({"error": "smb_user and password required"}), 400
ok = store_smb_password(smb_host, smb_user, password, key)
if ok:
return jsonify({"ok": True, "keychain_key": key})
return jsonify({"error": "keyring not available — install: pip install keyring"}), 500
@bp.route("/api/file_scan/start", methods=["POST"])
def file_scan_start():
"""Start a file system scan for a single file source (local, SMB, or SFTP)."""
source = request.get_json() or {}
source_type = source.get("source_type", "")
if source_type == "sftp":
if not _SFTP_OK:
return jsonify({"error": "paramiko not installed — run: pip install paramiko"}), 503
elif not _FILE_SCANNER_OK:
"""Start a file system scan for a single file source."""
if not _FILE_SCANNER_OK:
return jsonify({"error": "file_scanner not available"}), 503
if not state._scan_lock.acquire(blocking=False):
return jsonify({"error": "scan already running"}), 409
source = request.get_json() or {}
state._scan_abort.clear()
def _run():

View File

@ -1,216 +0,0 @@
"""
Software update routes: check origin for new commits, apply the update,
and an optional auto-update background thread.
Only available when running from a git checkout the frozen desktop
build (PyInstaller) reports supported=False and the UI hides the group.
Applying an update fast-forwards to origin/<branch>, reinstalls
dependencies if requirements.txt changed, then re-execs the process so
the new code is loaded. Local edits are stashed (kept), never discarded.
"""
from __future__ import annotations
import os
import subprocess
import sys
import threading
import time
from pathlib import Path
from flask import Blueprint, jsonify, request
from routes import state
from app_config import get_update_config, save_update_config
bp = Blueprint("updates", __name__)
_REPO_DIR = Path(__file__).parent.parent
_GIT_TIMEOUT = 30
_AUTO_CHECK_INTERVAL = 24 * 3600 # auto-update checks once per day
_last_auto_check = [0.0]
def _supported() -> bool:
return (not getattr(sys, "frozen", False)) and (_REPO_DIR / ".git").exists()
def _git(*args: str, timeout: int = _GIT_TIMEOUT) -> subprocess.CompletedProcess:
return subprocess.run(
["git", *args], cwd=_REPO_DIR,
capture_output=True, text=True, timeout=timeout,
)
def _scan_running() -> bool:
return state._scan_lock.locked() or state._google_scan_lock.locked()
def check_for_update() -> dict:
"""Fetch origin and compare HEAD against the tracked branch."""
if not _supported():
return {"supported": False}
try:
branch = _git("rev-parse", "--abbrev-ref", "HEAD").stdout.strip() or "main"
fetch = _git("fetch", "origin", branch, timeout=60)
if fetch.returncode != 0:
return {"supported": True, "error": fetch.stderr.strip()[:300] or "git fetch failed"}
local = _git("rev-parse", "HEAD").stdout.strip()
remote = _git("rev-parse", f"origin/{branch}").stdout.strip()
except (subprocess.TimeoutExpired, OSError) as e:
return {"supported": True, "error": str(e)[:300]}
info = {
"supported": True, "branch": branch,
"current": local[:7], "latest": remote[:7],
"up_to_date": local == remote, "commits": [],
}
if local != remote:
lg = _git("log", "--oneline", f"HEAD..origin/{branch}")
info["commits"] = lg.stdout.strip().splitlines()[:20]
return info
def apply_update() -> dict:
"""Fast-forward to origin/<branch>; returns {"ok", "updated", ...}.
Does NOT restart the process callers decide (the route schedules a
re-exec, the auto-update thread restarts directly).
"""
chk = check_for_update()
if not chk.get("supported"):
return {"ok": False, "code": "unsupported",
"error": "Updates require running from a git checkout."}
if chk.get("error"):
return {"ok": False, "code": "check_failed", "error": chk["error"]}
if chk.get("up_to_date"):
return {"ok": True, "updated": False, "current": chk["current"]}
if _scan_running():
return {"ok": False, "code": "scan_running",
"error": "Cannot update while a scan is running."}
branch = chk["branch"]
try:
if _git("diff-index", "--quiet", "HEAD", "--").returncode != 0:
_git("stash", "push", "-m",
"auto-stash before update " + time.strftime("%Y-%m-%d %H:%M:%S"))
reqs_changed = _git(
"diff", "--quiet", f"HEAD..origin/{branch}", "--", "requirements.txt"
).returncode != 0
merge = _git("merge", "--ff-only", f"origin/{branch}")
if merge.returncode != 0:
return {"ok": False, "code": "merge_failed",
"error": (merge.stderr.strip() or "git merge failed")[:300]}
if reqs_changed:
subprocess.run(
[sys.executable, "-m", "pip", "install", "-q", "-r",
str(_REPO_DIR / "requirements.txt")],
cwd=_REPO_DIR, capture_output=True, timeout=600,
)
except (subprocess.TimeoutExpired, OSError) as e:
return {"ok": False, "code": "apply_failed", "error": str(e)[:300]}
try:
from gdpr_db import log_audit_event as _audit
_audit("app_update", f"{chk['current']} -> {chk['latest']}",
ip=(request.remote_addr if request else ""))
except Exception:
pass
return {"ok": True, "updated": True,
"from": chk["current"], "to": chk["latest"]}
def _mark_fds_cloexec() -> None:
"""Mark every fd above stderr close-on-exec.
Werkzeug calls ``srv.socket.set_inheritable(True)`` unconditionally
(for its debug reloader), so without this the listening socket leaks
into the exec'd process: it sits on the port as a zombie listener no
one accepts from, the port probe sees the port as busy, and the new
server hops to port+1 while clients hang against the dead socket.
"""
try:
fds = [int(f) for f in os.listdir("/proc/self/fd")] # Linux
except (OSError, ValueError):
fds = list(range(3, 4096))
for fd in fds:
if fd > 2:
try:
os.set_inheritable(fd, False)
except OSError:
pass
def _restart_self() -> None:
"""Re-exec the current process so the updated code is loaded.
Keeps the same PID, so it works both under systemd and when launched
manually via start_gdpr.sh.
"""
_mark_fds_cloexec()
try:
os.execv(sys.executable, [sys.executable] + sys.argv)
except OSError:
# Last resort: exit and rely on a supervisor (systemd Restart=) to
# bring the app back up.
os._exit(0)
def _schedule_restart(delay: float = 1.5) -> None:
def _later():
time.sleep(delay)
_restart_self()
threading.Thread(target=_later, daemon=True, name="update-restart").start()
# ── Routes ────────────────────────────────────────────────────────────────────
@bp.route("/api/update/check")
def update_check():
return jsonify(check_for_update())
@bp.route("/api/update/apply", methods=["POST"])
def update_apply():
res = apply_update()
if res.get("updated"):
res["restarting"] = True
_schedule_restart()
return jsonify(res), (200 if res.get("ok") else 409)
@bp.route("/api/update/settings", methods=["GET", "POST"])
def update_settings():
if request.method == "GET":
return jsonify({"supported": _supported(), **get_update_config()})
data = request.get_json(silent=True) or {}
save_update_config(bool(data.get("auto_update", False)))
return jsonify({"ok": True})
# ── Auto-update background thread ─────────────────────────────────────────────
def _auto_update_loop() -> None:
while True:
time.sleep(3600)
try:
if not get_update_config().get("auto_update"):
continue
if time.time() - _last_auto_check[0] < _AUTO_CHECK_INTERVAL:
continue
_last_auto_check[0] = time.time()
if _scan_running():
_last_auto_check[0] = 0.0 # retry on the next hourly tick
continue
res = apply_update()
if res.get("updated"):
print(f" Auto-update: {res['from']} -> {res['to']} — restarting")
_restart_self()
except Exception:
pass
def start_auto_update_thread() -> bool:
"""Called once at startup from gdpr_scanner.py. No-op for frozen builds."""
if not _supported():
return False
threading.Thread(target=_auto_update_loop, daemon=True, name="auto-update").start()
return True

View File

@ -14,15 +14,7 @@ from app_config import (
set_viewer_pin,
verify_viewer_pin,
clear_viewer_pin,
get_interface_pin_hash,
set_interface_pin,
verify_interface_pin,
clear_interface_pin,
)
try:
from gdpr_db import log_audit_event as _audit
except ImportError:
def _audit(*a, **kw): pass # type: ignore[misc]
bp = Blueprint("viewer", __name__)
@ -60,7 +52,6 @@ def list_tokens():
"token_hint": t["token"][:8] + "",
"token": t["token"],
"label": t.get("label", ""),
"scope": t.get("scope", {}),
"created_at": t.get("created_at"),
"expires_at": t.get("expires_at"),
"last_used_at": t.get("last_used_at"),
@ -82,49 +73,7 @@ def create_token():
return jsonify({"error": "expires_days must be a positive integer"}), 400
except (TypeError, ValueError):
return jsonify({"error": "expires_days must be a positive integer"}), 400
raw_scope = body.get("scope", {})
if not isinstance(raw_scope, dict):
return jsonify({"error": "scope must be an object"}), 400
role = str(raw_scope.get("role", "")).strip()
# user may be a single email string (legacy) or a list of email strings
raw_user = raw_scope.get("user", "")
if isinstance(raw_user, str):
user_emails = [raw_user.strip().lower()] if raw_user.strip() else []
elif isinstance(raw_user, list):
user_emails = [str(e).strip().lower() for e in raw_user if str(e).strip()]
else:
user_emails = []
display_name = str(raw_scope.get("display_name", "")).strip()
if role and user_emails:
return jsonify({"error": "scope.role and scope.user are mutually exclusive"}), 400
if role not in ("", "student", "staff"):
return jsonify({"error": "scope.role must be '', 'student', or 'staff'"}), 400
if user_emails and not all("@" in e for e in user_emails):
return jsonify({"error": "scope.user entries must be valid email addresses"}), 400
valid_from = str(raw_scope.get("valid_from", "")).strip()
valid_to = str(raw_scope.get("valid_to", "")).strip()
from datetime import datetime as _dt
for _d, _lbl in ((valid_from, "valid_from"), (valid_to, "valid_to")):
if _d:
try:
_dt.strptime(_d, "%Y-%m-%d")
except ValueError:
return jsonify({"error": f"scope.{_lbl} must be YYYY-MM-DD"}), 400
if valid_from and valid_to and valid_from > valid_to:
return jsonify({"error": "scope.valid_from must be ≤ scope.valid_to"}), 400
if user_emails:
scope = {"user": user_emails, "display_name": display_name or user_emails[0]}
elif role:
scope = {"role": role}
else:
scope = {}
if valid_from:
scope["valid_from"] = valid_from
if valid_to:
scope["valid_to"] = valid_to
entry = create_viewer_token(label=label, expires_days=expires_days, scope=scope)
_audit("token_create", f"label={label!r} scope={scope}",
ip=request.remote_addr or "")
entry = create_viewer_token(label=label, expires_days=expires_days)
return jsonify(entry), 201
@ -135,7 +84,6 @@ def delete_token(token: str):
removed = revoke_viewer_token(token)
if not removed:
return jsonify({"error": "token not found"}), 404
_audit("token_revoke", f"token={token[:8]}...", ip=request.remote_addr or "")
return jsonify({"ok": True})
@ -169,13 +117,10 @@ def pin_set():
return jsonify({"error": "pin required"}), 400
if not new_pin.isdigit() or not (4 <= len(new_pin) <= 8):
return jsonify({"error": "PIN must be 48 digits"}), 400
had_pin = bool(get_viewer_pin_hash())
if had_pin:
if get_viewer_pin_hash():
if not verify_viewer_pin(str(body.get("current_pin", "")).strip()):
return jsonify({"error": "current PIN is incorrect"}), 403
set_viewer_pin(new_pin)
_audit("viewer_pin_change" if had_pin else "viewer_pin_set", "",
ip=request.remote_addr or "")
return jsonify({"ok": True})
@ -187,49 +132,6 @@ def pin_clear():
if not verify_viewer_pin(str(body.get("current_pin", "")).strip()):
return jsonify({"error": "current PIN is incorrect"}), 403
clear_viewer_pin()
_audit("viewer_pin_clear", "", ip=request.remote_addr or "")
return jsonify({"ok": True})
# ── Interface PIN management endpoints ───────────────────────────────────────
@bp.route("/api/interface/pin", methods=["GET"])
def interface_pin_status():
"""Return whether an interface PIN is currently set."""
return jsonify({"pin_set": bool(get_interface_pin_hash())})
@bp.route("/api/interface/pin", methods=["POST"])
def interface_pin_set():
"""Set or change the interface PIN.
Body: {pin: "...", current_pin: "..."}
current_pin required only when a PIN is already set.
"""
body = request.get_json(silent=True) or {}
new_pin = str(body.get("pin", "")).strip()
if not new_pin:
return jsonify({"error": "pin required"}), 400
if not new_pin.isdigit() or not (4 <= len(new_pin) <= 8):
return jsonify({"error": "PIN must be 48 digits"}), 400
had_ipin = bool(get_interface_pin_hash())
if had_ipin:
if not verify_interface_pin(str(body.get("current_pin", "")).strip()):
return jsonify({"error": "current PIN is incorrect"}), 403
set_interface_pin(new_pin)
_audit("interface_pin_change" if had_ipin else "interface_pin_set", "",
ip=request.remote_addr or "")
return jsonify({"ok": True})
@bp.route("/api/interface/pin", methods=["DELETE"])
def interface_pin_clear():
"""Remove the interface PIN. Requires current PIN if one is set."""
body = request.get_json(silent=True) or {}
if get_interface_pin_hash():
if not verify_interface_pin(str(body.get("current_pin", "")).strip()):
return jsonify({"error": "current PIN is incorrect"}), 403
clear_interface_pin()
_audit("interface_pin_clear", "", ip=request.remote_addr or "")
return jsonify({"ok": True})

View File

@ -54,7 +54,6 @@ def _get_scan_meta():
try:
from m365_connector import (
M365Connector, M365Error, M365PermissionError, M365DeltaTokenExpired,
M365DriveNotFound,
MSAL_OK, REQUESTS_OK,
)
CONNECTOR_OK = True
@ -63,7 +62,6 @@ except ImportError:
M365Error = Exception
M365PermissionError = Exception
M365DeltaTokenExpired = Exception
M365DriveNotFound = Exception
MSAL_OK = False
REQUESTS_OK = False
CONNECTOR_OK = False
@ -75,12 +73,6 @@ except ImportError:
FileScanner = None # type: ignore[assignment,misc]
FILE_SCANNER_OK = False
try:
from sftp_connector import SFTPScanner, SFTP_OK as _SFTP_OK
except ImportError:
SFTPScanner = None # type: ignore[assignment,misc]
_SFTP_OK = False
try:
import document_scanner as ds
SCANNER_OK = True
@ -105,17 +97,13 @@ except ImportError:
# Stubs for standalone import — overwritten by gdpr_scanner.py injections
LANG: dict = {}
PHOTO_EXTS: set = set()
VIDEO_EXTS: set = set()
AUDIO_EXTS: set = set()
SUPPORTED_EXTS: set = set()
# cpr_detector helpers — injected by gdpr_scanner.py
def _scan_bytes(content, filename, poppler_path=None, lang="dan+eng"): return {"cprs": [], "dates": []} # type: ignore[misc]
def _scan_bytes_timeout(content, filename, timeout=60, lang="dan+eng"): return {"cprs": [], "dates": []} # type: ignore[misc]
def _scan_bytes(content, filename, poppler_path=None): return {"cprs": [], "dates": []} # type: ignore[misc]
def _scan_bytes_timeout(content, filename, timeout=60): return {"cprs": [], "dates": []} # type: ignore[misc]
def _detect_photo_faces(content, filename): return 0 # type: ignore[misc]
def _extract_exif(content, filename): return {} # type: ignore[misc]
def _extract_video_metadata(content, filename): return {} # type: ignore[misc]
def _extract_audio_metadata(content, filename): return {} # type: ignore[misc]
def _make_thumb(content, filename): return "" # type: ignore[misc]
def _placeholder_svg(ext, name): return "" # type: ignore[misc]
def _check_special_category(text, cprs): return [] # type: ignore[misc]
@ -125,8 +113,8 @@ def _html_esc(s): return str(s) # type: ignore[misc]
# checkpoint helpers — injected by gdpr_scanner.py
def _checkpoint_key(opts): return "" # type: ignore[misc]
def _save_checkpoint(*a, **kw): pass # type: ignore[misc]
def _load_checkpoint(key, **kw): return None # type: ignore[misc]
def _clear_checkpoint(**kw): pass # type: ignore[misc]
def _load_checkpoint(key): return None # type: ignore[misc]
def _clear_checkpoint(): pass # type: ignore[misc]
def _load_delta_tokens(): return {} # type: ignore[misc]
def _save_delta_tokens(t): pass # type: ignore[misc]
@ -157,21 +145,18 @@ def _with_disposition(card: dict, db) -> dict:
def run_file_scan(source: dict):
"""Scan a single local, SMB, or SFTP file source for CPR numbers and PII.
"""Scan a single local or SMB file source for CPR numbers and PII.
Reuses _scan_bytes, _broadcast_card, _check_special_category,
_detect_photo_faces and all other existing scan helpers.
Args:
source: file source dict with keys:
source_type ("local"|"smb"|"sftp"), path, label,
smb_host, smb_user, smb_domain, keychain_key,
sftp_host, sftp_port, sftp_user, sftp_auth, sftp_key_path,
path, label, smb_host, smb_user, smb_domain, keychain_key,
scan_photos (bool), max_file_mb (int)
"""
# state vars accessed via _state module
source_kind = source.get("source_type", "")
path = source.get("path", "")
label = source.get("label") or path
smb_host = source.get("smb_host") or None
@ -179,20 +164,10 @@ def run_file_scan(source: dict):
smb_domain = source.get("smb_domain") or ""
keychain_key= source.get("keychain_key") or None
smb_password= source.get("smb_password") or None
scan_photos = bool(source.get("scan_photos", False))
skip_gps_images = bool(source.get("skip_gps_images", False))
min_cpr_count = max(1, int(source.get("min_cpr_count", 1)))
scan_emails = bool(source.get("scan_emails", False))
scan_phones = bool(source.get("scan_phones", False))
cpr_only = bool(source.get("cpr_only", False))
ocr_lang = str(source.get("ocr_lang", "dan+eng")) or "dan+eng"
max_mb = int(source.get("max_file_mb", 50))
scan_photos = bool(source.get("scan_photos", False))
max_mb = int(source.get("max_file_mb", 50))
if source_kind == "sftp":
if not _SFTP_OK:
broadcast("scan_error", {"file": label, "error": "paramiko not installed — run: pip install paramiko"})
return
elif not FILE_SCANNER_OK:
if not FILE_SCANNER_OK:
broadcast("scan_error", {"file": label, "error": "file_scanner.py not found"})
return
@ -203,61 +178,30 @@ def run_file_scan(source: dict):
_db_scan_id: int | None = None
if _db:
try:
_db_scan_id = _db.begin_scan({
"sources": [source.get("source_type", "local")],
"user_ids": [],
"options": source,
})
_db_scan_id = _db.begin_scan(
sources=[source.get("source_type", "local")],
user_count=0,
options=source,
)
except Exception as e:
logger.error("[db] start_scan failed: %s", e)
# \u2500\u2500 Checkpoint: resume from a previous interrupted file scan \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500
_ck_prefix = f"file_{source.get('id', 'local')}"
_ck_key = _checkpoint_key({"sources": [source.get("source_type", "local")], "user_ids": [source.get("id", path)], "options": {}})
_ck = _load_checkpoint(_ck_key, prefix=_ck_prefix)
_file_scanned_ids: set = set(_ck["scanned_ids"]) if _ck else set()
_file_flagged: list = [] # items found by this file scan run (for checkpoint)
_ck_resumed = len(_file_scanned_ids)
if _ck:
_file_flagged = list(_ck.get("flagged", []))
for card in _file_flagged:
_state.flagged_items.append(card)
broadcast("scan_phase", {"phase": LANG.get("m365_resuming", f"Resuming \u2014 skipping {_ck_resumed} already-scanned items\u2026")})
for card in _file_flagged:
broadcast("scan_file_flagged", _with_disposition(card, _db))
_CHECKPOINT_SAVE_EVERY_FILE = 25
_file_items_since_save = 0
total_scanned = 0
total_flagged = 0
broadcast("scan_start", {"sources": [label]})
broadcast("scan_phase", {"phase": f"Files \u2014 {label}"})
try:
if source_kind == "sftp":
fs = SFTPScanner(
host=source.get("sftp_host", ""),
root_path=path,
username=source.get("sftp_user", ""),
port=int(source.get("sftp_port", 22)),
auth_type=source.get("sftp_auth", "password"),
password=source.get("sftp_password") or None,
key_path=source.get("sftp_key_path") or None,
passphrase=source.get("sftp_passphrase") or None,
keychain_key=keychain_key,
max_file_bytes=max_mb * 1_048_576,
label=label,
)
else:
fs = FileScanner(
path=path,
smb_host=smb_host,
smb_user=smb_user,
smb_password=smb_password,
smb_domain=smb_domain,
keychain_key=keychain_key,
max_file_bytes=max_mb * 1_048_576,
)
fs = FileScanner(
path=path,
smb_host=smb_host,
smb_user=smb_user,
smb_password=smb_password,
smb_domain=smb_domain,
keychain_key=keychain_key,
max_file_bytes=max_mb * 1_048_576,
)
def _progress(rel_path: str):
broadcast("scan_file", {"file": rel_path})
@ -266,10 +210,6 @@ def run_file_scan(source: dict):
if _state._scan_abort.is_set():
break
if rel_path in _file_scanned_ids:
total_scanned += 1
continue
total_scanned += 1
broadcast("scan_progress", {"scanned": total_scanned, "flagged": total_flagged, "file": rel_path, "pct": min(90, 10 + total_scanned // 10), "source": "file"})
@ -284,41 +224,26 @@ def run_file_scan(source: dict):
ext = Path(rel_path).suffix.lower()
# CPR scan — skip for images, video and audio (no text layer)
# CPR scan — skip for images (no text layer; EXIF/face detection handles them)
result: dict = {"cprs": [], "dates": []}
if ext not in PHOTO_EXTS and ext not in VIDEO_EXTS and ext not in AUDIO_EXTS:
if ext not in PHOTO_EXTS:
try:
result = _scan_bytes_timeout(content, rel_path, lang=ocr_lang)
result = _scan_bytes_timeout(content, rel_path)
except Exception as e:
broadcast("scan_error", {"file": rel_path, "error": str(e)})
continue
cprs = result.get("cprs", [])
emails = result.get("emails", []) if scan_emails else []
phones = result.get("phones", []) if scan_phones else []
cprs = result.get("cprs", [])
# Photo / biometric scan + EXIF/video/audio metadata extraction
# Photo / biometric scan + EXIF extraction
_face_count = 0
_exif = {}
if ext in PHOTO_EXTS:
if scan_photos:
_face_count = _detect_photo_faces(content, rel_path)
_exif = _extract_exif(content, rel_path)
elif ext in VIDEO_EXTS:
_exif = _extract_video_metadata(content, rel_path)
elif ext in AUDIO_EXTS:
_exif = _extract_audio_metadata(content, rel_path)
# Apply filters: distinct CPR threshold and GPS suppression
_distinct_cprs = list(dict.fromkeys(c["formatted"] for c in cprs))
_cpr_qualifies = len(_distinct_cprs) >= min_cpr_count
_distinct_emails = list(dict.fromkeys(e["formatted"] for e in emails))
_distinct_phones = list(dict.fromkeys(p["formatted"] for p in phones))
_exif_has_pii = _exif.get("has_pii") and (
not skip_gps_images or bool(_exif.get("pii_fields") or _exif.get("author"))
)
if not (_cpr_qualifies and cprs) and (cpr_only or (not _distinct_emails and not _distinct_phones and _face_count == 0 and not _exif_has_pii)):
if not cprs and _face_count == 0 and not _exif.get("has_pii"):
continue
# Build card metadata
@ -331,9 +256,9 @@ def run_file_scan(source: dict):
_sc = _check_special_category(_file_text, cprs)
if _face_count > 0 and "biometric" not in _sc:
_sc = sorted(_sc + ["biometric"])
if _exif.get("gps") and not skip_gps_images and "gps_location" not in _sc:
if _exif.get("gps") and "gps_location" not in _sc:
_sc = sorted(_sc + ["gps_location"])
if _exif_has_pii and "exif_pii" not in _sc:
if _exif.get("has_pii") and "exif_pii" not in _sc:
_sc = sorted(_sc + ["exif_pii"])
# Thumbnail for images
@ -354,8 +279,6 @@ def run_file_scan(source: dict):
"source": label,
"source_type": source_type,
"cpr_count": len(cprs),
"email_count": len(_distinct_emails),
"phone_count": len(_distinct_phones),
"url": "",
"size_kb": meta["size_kb"],
"modified": meta["modified"],
@ -376,7 +299,6 @@ def run_file_scan(source: dict):
}
_state.flagged_items.append(card)
_file_flagged.append(card)
total_flagged += 1
broadcast("scan_file_flagged", _with_disposition(card, _db))
@ -386,19 +308,10 @@ def run_file_scan(source: dict):
except Exception as e:
logger.error("[db] save_item failed: %s", e)
_file_scanned_ids.add(rel_path)
_file_items_since_save += 1
if _file_items_since_save >= _CHECKPOINT_SAVE_EVERY_FILE:
_save_checkpoint(_ck_key, _file_scanned_ids, _file_flagged, _state.scan_meta, prefix=_ck_prefix)
_file_items_since_save = 0
except Exception as e:
import traceback
broadcast("scan_error", {"file": label, "error": str(e)})
logger.error("[file_scan] error:\n%s", traceback.format_exc())
else:
if not _state._scan_abort.is_set():
_clear_checkpoint(prefix=_ck_prefix)
finally:
if _db and _db_scan_id:
try:
@ -476,12 +389,6 @@ def run_scan(options: dict):
max_emails = int(scan_opts.get("max_emails", 2000))
delta_enabled = bool(scan_opts.get("delta", False))
scan_photos = bool(scan_opts.get("scan_photos", False)) # biometric photo scan (#9)
skip_gps_images= bool(scan_opts.get("skip_gps_images", False))
min_cpr_count = max(1, int(scan_opts.get("min_cpr_count", 1)))
ocr_lang = str(scan_opts.get("ocr_lang", "dan+eng")) or "dan+eng"
cpr_only = bool(scan_opts.get("cpr_only", False))
scan_emails = bool(scan_opts.get("scan_emails", False))
scan_phones = bool(scan_opts.get("scan_phones", False))
# Delta token state — loaded once, updated per-source, saved on completion
delta_tokens: dict = _load_delta_tokens() if delta_enabled else {}
@ -535,8 +442,6 @@ def run_scan(options: dict):
"source": item_meta.get("_source", ""),
"source_type": item_meta.get("_source_type", ""),
"cpr_count": len(cprs),
"email_count": item_meta.get("_email_count", 0),
"phone_count": item_meta.get("_phone_count", 0),
"url": item_meta.get("webUrl", "") or item_meta.get("_url", ""),
"size_kb": round(item_meta.get("size", 0) / 1024, 1),
"modified": (item_meta.get("lastModifiedDateTime") or item_meta.get("receivedDateTime") or "")[:10],
@ -553,7 +458,6 @@ def run_scan(options: dict):
"special_category": item_meta.get("_special_category", []),
"face_count": item_meta.get("_face_count", 0),
"exif": item_meta.get("_exif", {}),
"body_excerpt": item_meta.get("_body_excerpt", ""),
}
_state.flagged_items.append(card)
broadcast("scan_file_flagged", _with_disposition(card, _db))
@ -853,10 +757,6 @@ def run_scan(options: dict):
work_items.append(("file", item, None))
except M365PermissionError:
broadcast("scan_error", {"file": f"OneDrive ({uname})", "error": _permission_msg("OneDrive", uname)})
except M365DriveNotFound:
# OneDrive not provisioned for this user (no licence, service plan
# disabled, or drive never initialised). Not a scan error — skip silently.
broadcast("scan_phase", {"phase": f"OneDrive ({uname}): not provisioned — skipped"})
except Exception as e:
broadcast("scan_error", {"file": f"OneDrive ({uname})", "error": str(e)})
else:
@ -1078,14 +978,6 @@ def run_scan(options: dict):
if _check_abort():
# Save checkpoint so scan can be resumed later
_save_checkpoint(ck_key, scanned_ids, _state.flagged_items, _state.scan_meta)
# Finalise the DB scan record so items found before the stop stay
# visible — this early return otherwise skips finish_scan below,
# stranding them (invisible to get_session_items / get_open_items).
if _db and _db_scan_id:
try:
_db.finish_scan(_db_scan_id, resumed_count + idx + 1)
except Exception as _e:
logger.error("[db] finish_scan (aborted) failed: %s", _e)
return
idx += 1
kind, meta, _ = _work_q.popleft() # releases this item from the deque immediately
@ -1112,18 +1004,12 @@ def run_scan(options: dict):
# Scan body — use pre-extracted text (body HTML was stripped at
# collection time to keep work_items memory footprint small)
all_cprs = []
all_emails = []
all_phones = []
body_text = ""
all_cprs = []
body_text = ""
if scan_email_body:
body_text = meta.pop("_precomputed_body", "")
body_text = meta.pop("_precomputed_body", "")
body_result = _scan_text_direct(body_text)
all_cprs = list(body_result.get("cprs", []))
if scan_emails:
all_emails = list(body_result.get("emails", []))
if scan_phones:
all_phones = list(body_result.get("phones", []))
all_cprs = list(body_result.get("cprs", []))
# <span data-i18n="m365_opt_attachments" data-i18n="m365_opt_attachments">Scan attachments</span>
uid = meta.get("_account_id", "me")
@ -1143,31 +1029,21 @@ def run_scan(options: dict):
try:
att_bytes = (conn.download_attachment_for(uid, msg_id, att["id"])
if uid != "me" else conn.download_attachment(msg_id, att["id"]))
att_result = _scan_bytes(att_bytes, att_name, lang=ocr_lang)
att_result = _scan_bytes(att_bytes, att_name)
att_cprs = att_result.get("cprs", [])
all_cprs.extend(att_cprs)
if scan_emails:
all_emails.extend(att_result.get("emails", []))
if scan_phones:
all_phones.extend(att_result.get("phones", []))
att_results.append({"name": att_name, "cpr_count": len(att_cprs)})
except Exception as att_err:
broadcast("scan_error", {"file": att_name, "error": str(att_err)})
_distinct_emails = list(dict.fromkeys(e["formatted"] for e in all_emails))
_distinct_phones = list(dict.fromkeys(p["formatted"] for p in all_phones))
if all_cprs or (not cpr_only and (_distinct_emails or _distinct_phones)):
if all_cprs:
meta["_thumb"] = _placeholder_svg(".eml", subject)
meta["_thumb_is_jpeg"] = False
meta["_attachments"] = att_results
meta["_email_count"] = len(_distinct_emails)
meta["_phone_count"] = len(_distinct_phones)
_email_pii = _get_pii_counts(body_text) if scan_email_body else {}
meta["_transfer_risk"] = _check_transfer_risk(meta)
meta["_special_category"] = _check_special_category(
body_text if scan_email_body else "", all_cprs)
# Store a short excerpt so preview still works if Graph is unavailable
meta["_body_excerpt"] = body_text[:500].strip() if body_text else ""
_broadcast_card(meta, all_cprs, pii_counts=_email_pii)
del body_text # free email text — may be large for HTML-rich emails
@ -1192,37 +1068,19 @@ def run_scan(options: dict):
content = conn.download_drive_item_for(uid, item_id)
else:
content = conn.download_item(meta)
result = _scan_bytes(content, name)
cprs = result.get("cprs", [])
# CPR/email/phone scan — skip for video and audio (metadata-only; no text layer)
_media_only = ext in VIDEO_EXTS or ext in AUDIO_EXTS
result = {"cprs": [], "dates": [], "emails": [], "phones": []} if _media_only else _scan_bytes(content, name, lang=ocr_lang)
cprs = result.get("cprs", [])
emails = result.get("emails", []) if scan_emails else []
phones = result.get("phones", []) if scan_phones else []
# ── Biometric photo scan (#9) + EXIF/video/audio metadata (#18) ─
# ── Biometric photo scan (#9) + EXIF (#18) ───────────────
_face_count = 0
_exif = {}
if ext in PHOTO_EXTS:
if scan_photos:
_face_count = _detect_photo_faces(content, name)
_exif = _extract_exif(content, name)
elif ext in VIDEO_EXTS:
_exif = _extract_video_metadata(content, name)
elif ext in AUDIO_EXTS:
_exif = _extract_audio_metadata(content, name)
# Apply filters: distinct CPR threshold and GPS suppression
_distinct_cprs = list(dict.fromkeys(c["formatted"] for c in cprs))
_cpr_qualifies = len(_distinct_cprs) >= min_cpr_count
_distinct_emails = list(dict.fromkeys(e["formatted"] for e in emails))
_distinct_phones = list(dict.fromkeys(p["formatted"] for p in phones))
_exif_has_pii = _exif.get("has_pii") and (
not skip_gps_images or bool(_exif.get("pii_fields") or _exif.get("author"))
)
# Flag item if CPRs/emails/phones found, faces detected, or EXIF PII found
if (_cpr_qualifies and cprs) or (not cpr_only and (_distinct_emails or _distinct_phones or _face_count > 0 or _exif_has_pii)):
# Flag item if CPRs found, faces detected, or EXIF PII found
if cprs or _face_count > 0 or _exif.get("has_pii"):
# Make thumbnail
if ext in {".jpg", ".jpeg", ".png"} and PIL_OK:
thumb = _make_thumb(content, name)
@ -1251,15 +1109,13 @@ def run_scan(options: dict):
# the category even when no CPR is present in the file.
if _face_count > 0 and "biometric" not in _sc:
_sc = sorted(_sc + ["biometric"])
if _exif.get("gps") and not skip_gps_images and "gps_location" not in _sc:
if _exif.get("gps") and "gps_location" not in _sc:
_sc = sorted(_sc + ["gps_location"])
if _exif_has_pii and "exif_pii" not in _sc:
if _exif.get("has_pii") and "exif_pii" not in _sc:
_sc = sorted(_sc + ["exif_pii"])
meta["_special_category"] = _sc
meta["_face_count"] = _face_count
meta["_exif"] = _exif
meta["_email_count"] = len(_distinct_emails)
meta["_phone_count"] = len(_distinct_phones)
_broadcast_card(meta, cprs, pii_counts=_file_pii)
else:
del content # no hits — free raw bytes immediately

View File

@ -43,7 +43,6 @@ _DEFAULT_JOB: dict[str, Any] = {
"profile_id": "",
"auto_email": False,
"auto_retention": False,
"report_only": False,
"retention_years": None,
"fiscal_year_end": None,
}
@ -271,35 +270,6 @@ class ScanScheduler:
})
from routes import state
# ── Report-only path: skip scan, email latest DB results ──────────
if job_cfg.get("report_only"):
if not _m.flagged_items and _m.DB_OK:
try:
_db_inst = _m._get_db()
_db_rows = _db_inst.get_session_items() if _db_inst else []
if _db_rows:
_m.flagged_items[:] = _db_rows
except Exception:
pass
if not _m.flagged_items:
raise RuntimeError(
"No scan results available — run a scan first")
run["flagged"] = len(_m.flagged_items)
run["scanned"] = 0
run["status"] = "completed"
try:
self._send_email_report(job_cfg)
run["emailed"] = 1
except Exception as _re:
run["status"] = "failed"
run["error"] = f"Email failed: {_re}"
_m.broadcast("scheduler_done", {
"flagged": run["flagged"], "scanned": 0,
"emailed": run["emailed"], "job_name": job_cfg.get("name", ""),
})
return
# If connector not set, attempt to restore from saved config
if not state.connector or not state.connector.is_authenticated():
try:
@ -340,16 +310,6 @@ class ScanScheduler:
# Fire file scan for each file source in the profile
# file_sources may be IDs (strings) or full dicts — resolve either
_all_file_sources = {s["id"]: s for s in (_m._load_file_sources() or []) if isinstance(s, dict)}
# Merge per-scan options from the profile so the file scan honours
# cpr_only/ocr_lang/scan_photos/etc. (the browser does this in
# startScan(); the scheduler must mirror it).
_profile_opts = options.get("options", {}) or {}
_FS_OPT_KEYS = (
"scan_photos", "skip_gps_images", "min_cpr_count",
"scan_emails", "scan_phones", "cpr_only", "ocr_lang",
"max_file_mb",
)
_fs_extra = {k: _profile_opts[k] for k in _FS_OPT_KEYS if k in _profile_opts}
for fs in options.get("file_sources", []):
# Resolve string IDs to full source dicts
if isinstance(fs, str):
@ -357,7 +317,6 @@ class ScanScheduler:
if not isinstance(fs, dict) or not fs.get("path"):
logger.warning("[scheduler] skipping invalid file source: %r", fs)
continue
fs = {**fs, **_fs_extra}
try:
_m.run_file_scan(fs)
except Exception as _fse:
@ -473,7 +432,7 @@ class ScanScheduler:
logger.info("[scheduler] Profile '%s': sources=%s, users=%d",
p.get("name", pid), opts["sources"], len(opts.get("user_ids", [])))
_m.broadcast("scheduler_debug", {
"msg": f"Using profile '{p.get('name',pid)}': sources={opts['sources']}, users={len(opts.get('user_ids',[]))}"})
"msg": f"Using profile '{p.get('name',pid)}': sources={opts['sources']}, users={len(opts.get("user_ids",[]))}"})
return opts
logger.info("[scheduler] Profile '%s' not found — using saved settings", pid)
_m.broadcast("scheduler_debug", {"msg": f"Profile id '{pid}' not found — falling back to saved settings"})
@ -496,15 +455,11 @@ class ScanScheduler:
raise RuntimeError("No email recipients configured")
job_name = job_cfg.get("name", "scheduled scan")
subject = f"GDPR Scanner — {job_name} {datetime.now().strftime('%Y-%m-%d %H:%M')}"
if job_cfg.get("report_only"):
scan_line = f"Report on latest scan results. {len(_m.flagged_items)} item(s) flagged."
else:
scan_line = f"Scan completed. {len(_m.flagged_items)} item(s) flagged."
body = (
"<html><body style='font-family:Arial,sans-serif;color:#333;padding:24px'>"
"<h2 style='color:#1F3864'>&#128336; GDPR Scanner — scheduled scan report</h2>"
f"<p>Job: <strong>{job_name}</strong></p>"
f"<p>{scan_line}</p>"
f"<p>Scan completed. {len(_m.flagged_items)} item(s) flagged.</p>"
f"<p>Report attached: {fname}</p></body></html>")
from routes.email import _send_email_graph
from routes import state

View File

@ -1,292 +0,0 @@
"""
sftp_connector.py SFTP file iterator for GDPR Scanner.
Provides SFTPScanner.iter_files() which yields (relative_path, bytes, metadata)
for files on an SFTP/SSH server, using the same interface as FileScanner so that
run_file_scan() in scan_engine.py works identically for all three source types.
Optional dependency:
paramiko>=3.4 SSH/SFTP client (pip install paramiko)
If paramiko is not installed, SFTP_OK is False and callers must check before use.
"""
from __future__ import annotations
import stat
import time
from pathlib import PurePosixPath
from typing import Iterator
from file_scanner import SKIP_DIRS, MAX_FILE_BYTES, _skip, _error, KEYCHAIN_SERVICE
# ── Optional dependency ───────────────────────────────────────────────────────
try:
import paramiko
SFTP_OK = True
except ImportError:
SFTP_OK = False
try:
import keyring as _keyring
_KEYRING_OK = True
except ImportError:
_KEYRING_OK = False
# ── Credential helpers ────────────────────────────────────────────────────────
def get_sftp_password(host: str, user: str, keychain_key: str | None = None) -> str | None:
"""Return SFTP password or key passphrase from OS keychain."""
if not _KEYRING_OK:
return None
account = keychain_key or f"sftp:{user}@{host}"
try:
return _keyring.get_password(KEYCHAIN_SERVICE, account) or None
except Exception:
return None
def store_sftp_password(host: str, user: str, password: str,
keychain_key: str | None = None) -> bool:
"""Store SFTP password or passphrase in the OS keychain. Returns True on success."""
if not _KEYRING_OK:
return False
account = keychain_key or f"sftp:{user}@{host}"
try:
_keyring.set_password(KEYCHAIN_SERVICE, account, password)
return True
except Exception:
return False
# ── SFTPScanner ───────────────────────────────────────────────────────────────
class SFTPScanner:
"""SFTP file iterator — identical iter_files() interface to FileScanner."""
def __init__(
self,
host: str,
root_path: str,
username: str,
port: int = 22,
auth_type: str = "password", # "password" | "key"
password: str | None = None,
key_path: str | None = None,
passphrase: str | None = None,
keychain_key: str | None = None,
max_file_bytes: int = MAX_FILE_BYTES,
label: str = "",
):
self.host = host
self.port = port
self.root_path = root_path.rstrip("/") or "/"
self.username = username
self.auth_type = auth_type
self.key_path = key_path
self.keychain_key = keychain_key
self.max_file_bytes = max_file_bytes
self.label = label or f"{username}@{host}"
# Resolve credentials from keychain if not provided directly
self._password = password
self._passphrase = passphrase
if not self._password and auth_type == "password":
self._password = get_sftp_password(host, username, keychain_key)
if not self._passphrase and auth_type == "key" and key_path:
self._passphrase = get_sftp_password(host, username, keychain_key)
@staticmethod
def sftp_available() -> bool:
return SFTP_OK
@property
def source_type(self) -> str:
return "sftp"
# ── Public ────────────────────────────────────────────────────────────────
def iter_files(
self,
extensions: set[str] | None = None,
progress_cb=None,
) -> Iterator[tuple[str, bytes | None, dict]]:
"""Yield (relative_path, content_bytes, metadata) for every scannable file.
Same contract as FileScanner.iter_files() oversized and unreadable files
yield a sentinel with content=None and meta['skipped']=True.
"""
if not SFTP_OK:
raise RuntimeError("paramiko not installed — run: pip install paramiko")
from cpr_detector import SUPPORTED_EXTS as DEFAULT_EXTENSIONS
exts = extensions or DEFAULT_EXTENSIONS
ssh = paramiko.SSHClient()
ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
connect_kwargs: dict = {
"hostname": self.host,
"port": self.port,
"username": self.username,
"timeout": 30,
}
if self.auth_type == "key" and self.key_path:
pkey = _load_pkey(self.key_path, self._passphrase)
connect_kwargs["pkey"] = pkey
else:
connect_kwargs["password"] = self._password or ""
# Disable agent and key lookup when using password so paramiko doesn't
# prompt interactively when the server advertises pubkey auth.
connect_kwargs["look_for_keys"] = False
connect_kwargs["allow_agent"] = False
ssh.connect(**connect_kwargs)
try:
sftp = ssh.open_sftp()
try:
yield from self._walk(sftp, self.root_path, exts, progress_cb)
finally:
sftp.close()
finally:
ssh.close()
def _ssh_connect(self):
"""Return a connected paramiko SSHClient. Caller must call .close()."""
if not SFTP_OK:
raise RuntimeError("paramiko not installed — run: pip install paramiko")
ssh = paramiko.SSHClient()
ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
kw: dict = {
"hostname": self.host,
"port": self.port,
"username": self.username,
"timeout": 30,
}
if self.auth_type == "key" and self.key_path:
kw["pkey"] = _load_pkey(self.key_path, self._passphrase)
else:
kw["password"] = self._password or ""
kw["look_for_keys"] = False
kw["allow_agent"] = False
ssh.connect(**kw)
return ssh
def read_file(self, remote_path: str) -> bytes:
"""Download and return the raw bytes of a single remote file."""
ssh = self._ssh_connect()
try:
sftp = ssh.open_sftp()
try:
with sftp.open(remote_path, "rb") as fh:
return fh.read()
finally:
sftp.close()
finally:
ssh.close()
def write_file(self, remote_path: str, content: bytes) -> None:
"""Write content to remote_path on the SFTP server, overwriting if it exists."""
ssh = self._ssh_connect()
try:
sftp = ssh.open_sftp()
try:
with sftp.open(remote_path, "wb") as fh:
fh.write(content)
finally:
sftp.close()
finally:
ssh.close()
# ── Private walker ────────────────────────────────────────────────────────
def _walk(
self,
sftp,
directory: str,
exts: set[str],
progress_cb,
) -> Iterator[tuple[str, bytes | None, dict]]:
source_root = f"sftp://{self.username}@{self.host}{self.root_path}"
try:
entries = sftp.listdir_attr(directory)
except OSError as e:
rel = _rel(directory, self.root_path) or "."
yield _error(rel, str(e), "sftp", source_root)
return
for attr in entries:
name = attr.filename
if name.startswith("."):
continue
if name.lower() in SKIP_DIRS:
continue
full_remote = f"{directory}/{name}".replace("//", "/")
rel = _rel(full_remote, self.root_path)
if attr.st_mode is not None and stat.S_ISDIR(attr.st_mode):
yield from self._walk(sftp, full_remote, exts, progress_cb)
continue
ext = PurePosixPath(name).suffix.lower()
if ext not in exts:
continue
size = attr.st_size or 0
if size > self.max_file_bytes:
yield _skip(rel, size, "sftp", source_root)
continue
if progress_cb:
progress_cb(rel)
modified = (
time.strftime("%Y-%m-%d", time.gmtime(attr.st_mtime))
if attr.st_mtime else ""
)
meta = {
"size_kb": round(size / 1024, 1),
"modified": modified,
"source_type": "sftp",
"source_root": source_root,
"full_path": full_remote,
"skipped": False,
}
try:
with sftp.open(full_remote, "rb") as fh:
content = fh.read(self.max_file_bytes)
yield rel, content, meta
except OSError as e:
yield _error(rel, str(e), "sftp", source_root)
# ── Helpers ───────────────────────────────────────────────────────────────────
def _rel(full_path: str, root: str) -> str:
"""Return path relative to root, stripping leading slash."""
if full_path.startswith(root):
return full_path[len(root):].lstrip("/")
return full_path.lstrip("/")
def _load_pkey(key_path: str, passphrase: str | None):
"""Load a private key from disk, trying RSA → Ed25519 → ECDSA → DSS."""
for cls in (
paramiko.RSAKey,
paramiko.Ed25519Key,
paramiko.ECDSAKey,
paramiko.DSSKey,
):
try:
return cls.from_private_key_file(key_path, password=passphrase)
except paramiko.ssh_exception.SSHException:
continue
except FileNotFoundError:
raise
raise ValueError(f"Unrecognised private key format: {key_path}")

View File

@ -22,65 +22,8 @@ Never revert to `!!window._googleConnected` / `_fileSources.length > 0` — thos
`_PHASE_SOURCE_MAP` ordering matters — `Google Workspace` must appear before `Gmail` in the map. The email regex uses `/iu` flags — do not drop the `i`.
## Profile startup race conditions — profiles.js + users.js
`loadProfiles()` (fast, local file) resolves before `loadUsers()` (slow, Graph API). The user can select a profile before `S._allUsers` or the sources panel is populated.
- **`user_ids = "all"` must be deferred** — if `S._allUsers` is empty when `_applyProfile()` runs, set `window._pendingProfileAllUsers = true` instead of calling `.forEach()` on an empty array. `loadUsers()` checks this flag after populating `S._allUsers` and selects everyone. Do not remove this — reverting will silently leave all accounts unchecked whenever a profile is chosen on a fast machine before the user list loads.
- **Source checkboxes may not exist yet**`_applyProfile()` calls `renderSourcesPanel()` first if `#sourcesPanel` contains no `input[data-source-id]` nodes. Same guard used in `loadUsers()`. Without it, `querySelectorAll` returns nothing and the profile's source selection is discarded; the next `renderSourcesPanel()` call re-renders all sources as checked (their default).
## SSE teardown — scan.js
- **Do not close `S.es` in `scan_done` if other scans are still running** — M365 (`scan_done`), Google (`google_scan_done`), and File (`file_scan_done`) each emit their own done event. Close `S.es` only when all concurrent scans have finished: `scan_done` checks `!S._googleScanRunning && !S._fileScanRunning`; `google_scan_done` checks `!S._m365ScanRunning && !S._fileScanRunning`; `file_scan_done` checks `!S._m365ScanRunning && !S._googleScanRunning`.
- **Scheduled scans**`S._userStartedScan` is false for scheduler-triggered runs, so SSE is never closed and future scheduler events continue to arrive.
- **Two separate abort events**`state._scan_abort` (M365 + file) and `state._google_scan_abort` (Google). `POST /api/scan/stop` sets **both**. `_check_abort()` inside `_run_google_scan` must use the module-level `_scan_abort` alias (`= state._google_scan_abort`), not `gdpr_scanner._scan_abort`.
- **`_check_abort()` emits `google_scan_done`, not `scan_cancelled`** — `scan_cancelled` unconditionally closes the SSE; `google_scan_done` checks whether other scans are still running before closing.
- **`scan_phase` replay sets running flags — handled by `sse_replay_done`** — the `scan_phase` handler sets running flags to `true` whenever all flags are `false` and a source keyword is found in the phase text. On page refresh this fires during SSE replay of a completed scan, temporarily making the scan appear running. The `sse_replay_done` handler retries `loadHistorySession(null)` if no scan is running and `S._historyRefScanId` is still `null` after replay. Do not remove either the flag-setting logic or the retry.
- **Google Drive uses a lazy generator, not `list()`**`iter_drive_files()` iterated directly so `_check_abort()` fires between items. Wrapping in `list()` blocks the thread for the entire enumeration.
## Scan history browser — history.js + results.js
- **`S._historyRefScanId`** — `null` = live/SSE mode **or** the default open-items view; positive int = viewing a past session. Set by `loadHistorySession()`; cleared by `exitHistoryMode()`.
- **`loadHistorySession(null)``loadOpenItems()`** — passing `null` no longer resolves to the latest session. It now loads **all open (unactioned) items across every scan** via `GET /api/db/flagged` (no `ref`), leaves `_historyRefScanId` null, and shows no history banner. The "Open items" banner button (`onclick="loadHistorySession(null)"`, key `history_btn_latest`) therefore returns to this open-items view. Specific sessions are still loaded with a positive `ref`, which keeps the re-scan resolved-diff. Do not revert `null` to "resolve latest ref" — that reintroduces the "only the last scan is shown" complaint.
- **Auto-load on page load**`_sseWatchdog()` in `results.js` calls `window.loadHistorySession?.(null)` whenever `/api/scan/status` reports neither `running` (M365 + file lock) nor `google_running` (Google lock) **and** nothing is shown yet (`!S._historyRefScanId && !S.flaggedData.length`). This is **not one-shot** — it retries on every 4s poll until a session is restored, because (a) the replay buffer is empty after a server restart so `sse_replay_done` never fires, and (b) a completed scan's replayed `scan_phase` can leave a running flag set that would otherwise block the load forever. Because both locks are confirmed free, the watchdog clears the stale `_m365/_google/_fileScanRunning` flags before calling. Do not revert to a one-shot `_initialStatusChecked` gate — that reintroduces the "blank grid after refresh/restart" bug. `/api/scan/status` **must** report `google_running` separately; `running` alone misses live Google scans. The `sse_replay_done` handler in `scan.js` still retries for the non-empty-buffer (no-restart) case.
- **History banner** (`#historyBanner`) — shown when `S._historyRefScanId` is set. Do not hide/show from outside `history.js`.
- **Session picker** (`#historyDropdown`) — rendered inside `[data-history-wrap]` so the outside-click handler works correctly. Do not move the picker outside this wrapper.
- **Cache invalidation**`invalidateHistoryCache()` clears `_sessions` and `_latestRefScanId`. All three `*_done` SSE handlers call `window.invalidateHistoryCache?.()`.
- **Re-scan diff** — items present in the previous session but absent from the current one are tagged `_resolved: true`, rendered with `.card-resolved` and a green ✓ badge, and NOT added to `S.flaggedData` (grid-only, cannot be bulk-selected or exported).
- **Mode transitions**`startScan()` calls `window.exitHistoryMode?.()` before clearing the grid.
- **`renderGrid(files)` hides the landing cards** — whenever `files.length > 0` it hides `#emptyState` and `#lastScanSummary` and shows `#grid`. This is centralised here because the live `scan_file_flagged` handler (`scan.js`) shows the grid but does NOT clear those panels, so results would render *underneath* a still-visible landing/last-scan card until a manual refresh. Do not move this hiding back into individual callers — every render path (live SSE, `loadOpenItems`, history, filters) must clear the landing. The empty case (`files.length === 0`) is left untouched so callers still control the empty/landing state.
## Card user/group badge — results.js
- **`_accountPill(f)`** builds the account/role pill for both card layouts (list + grid). The **group badge is driven by `f.user_role`** (`student`/`staff`) alone, so it renders even with no display name — items from scans saved before `account_name` was persisted (DB migration 11) have only `user_role` + `account_id`. The user label resolves best-effort: `f.account_name``S._allUsers` match (by `id` or `email`) → email-style `account_id` → omit. Do not re-nest the role badge inside an `account_name` check (the old bug) — that hides the group badge for legacy items. Both layouts call `_accountPill(f)`; keep them sharing the one helper.
## CPR cross-referencing — results.js
- **`_loadRelated(f)`** — async; hides `#previewRelated` if `f.cpr_count` is 0, otherwise fetches `/api/db/related/<id>?ref=N` and renders a clickable list with per-item shared-CPR badge. Called from `openPreview`.
- **`window._openRelated(id, itemData)`** — looks up `id` in `S.flaggedData` first, falls back to `itemData` from the API response for items not yet in the grid.
## Sources panel resize — log.js + sources.js
- **`_fitSourcesPanel()`** — called at the end of every `renderSourcesPanel()`. Clears inline height, reads `scrollHeight`, then restores a saved preference from `localStorage` (`gdpr_sources_h`) or pins to `scrollHeight`.
- **`_initSourcesResize()`** — attaches pointer-drag to `#sourcesResizeHandle`. Captures `scrollHeight` as hard max on `pointerdown`; saves to `localStorage` on release.
- **Do not add a fixed `max-height` or `height` to `#sourcesPanel` in HTML** — height controlled entirely by `_fitSourcesPanel()` at runtime.
- **Do not call `_fitSourcesPanel()` before the panel has rendered**`scrollHeight` will be 0.
## Viewer mode — viewer.js
- **`window.VIEWER_MODE`** — injected by Jinja2. `auth.js` adds `viewer-mode` class to `<body>`; all hide rules are CSS (`body.viewer-mode …`) except `delBtn` which is also guarded in JS.
- **`window.VIEWER_SCOPE`** — injected alongside `VIEWER_MODE`. If `VIEWER_SCOPE.role` is set, `auth.js` pre-sets `#filterRole` and hides the dropdown.
- **Token onclick attributes** — Copy/Revoke buttons pass the token as a single-quoted JS string literal, never via `JSON.stringify` (which produces double-quoted strings that break `onclick="…"` attributes).
- **Share link base URL**`_getShareBaseUrl()` uses `window.location.origin` whenever the page is served over HTTPS or from a non-localhost host (a reverse-proxied hostname or LAN IP is already routable, and rewriting it to `http://<LAN-IP>` would bypass the proxy's TLS). Only when browsing at `localhost`/`127.0.0.1` over HTTP does it fetch `/api/local_ip` (LAN IP via UDP probe to `8.8.8.8`) so copied links work from other machines. The result is cached in `_shareBaseUrl` so Copy buttons stay within the click gesture. Both `createShareLink` and `copyTokenLink` are `async`. Do not make it return bare `window.location.origin` unconditionally — that reintroduces unusable `127.0.0.1` links.
- **Settings Security pane** — Admin PIN and Viewer PIN groups live in `stPaneSecurity`. `switchSettingsTab('security')` triggers both `stLoadPinStatus()` and `stLoadViewerPinStatus()`.
## Gotchas
- **`navigator.clipboard` is `undefined` over plain HTTP** — the app is normally reached at `http://<LAN-IP>:5100`, a non-secure context where the Clipboard API does not exist, so calling `navigator.clipboard.writeText(...)` throws synchronously (a `.catch()` on it never runs). Always copy via `window._copyText(text, btn)` (defined in `viewer.js`) — it feature-detects the API and falls back to `document.execCommand('copy')`, then to a `prompt()`. Because `execCommand` needs a user gesture, don't `await` network calls between the click and the copy; `_getShareBaseUrl()` caches its result for this reason.
- **`scheduler.js` strings must use `t()`** — frequency labels, "Next", "Running...", "Disabled", empty-job text, and empty-history text all have translation keys. Do not hard-code English strings in `schedLoad()` or `schedRenderJobs()`.
- **Scheduler UI — `schedToggleReportOnly()`** — dims the Profile row, shows/hides `#schedReportOnlyHint`, and forces `#schedAutoEmail` checked. Called from the checkbox `onchange` handler and at the start of `schedAddJob()` / `schedEditJob()`.
- **Profile editor accounts** — default to unchecked. Only explicitly saved `user_ids` are checked.
- **Date presets** — stored as `years * 365` (integer days). Do not use `* 365.25`.
- **`copyTokenLink` is async** — called from `onclick` as fire-and-forget. Do not make it synchronous.
- **Escape scan-derived strings with `esc()`**`results.js` defines `esc()` (escapes `& < > " '`). Every value that originates from scanned content (`f.name`, `f.account_name`, `f.folder`, `f.source`, `f.modified`, `label`, image `alt`, and the same fields on `item`/related rows) must pass through `esc()` before going into `innerHTML` or a `title=`/`alt=` attribute. These are attacker-influenceable (e.g. a file named with markup), so an unescaped interpolation is stored XSS — including in shared read-only viewer sessions. Numeric counts (`cpr_count`, `size_kb`) don't need it. When embedding an object in an `onclick` payload, also `.replace(/"/g,'&quot;')` the `JSON.stringify(...)`.
- **`copyTokenLink` is async** — called from `onclick` attributes as a fire-and-forget (the Promise is unhandled, which is fine). It `await`s `_getShareBaseUrl()` to get the machine's LAN IP before building the URL. Do not make it synchronous or revert to `window.location.origin` directly.

View File

@ -159,24 +159,6 @@ if (window.VIEWER_MODE) {
document.body.classList.add('viewer-mode');
document.getElementById('authScreen').style.display = 'none';
document.getElementById('scannerScreen').style.display = 'flex';
// If this token is role-scoped, lock the filter to that role and hide the dropdown.
const _scopeRole = (window.VIEWER_SCOPE || {}).role || '';
if (_scopeRole) {
const _fr = document.getElementById('filterRole');
if (_fr) { _fr.value = _scopeRole; _fr.style.display = 'none'; }
}
// If this token is user-scoped, show a locked identity badge and hide irrelevant filters.
const _scopeUserRaw = (window.VIEWER_SCOPE || {}).user;
if (_scopeUserRaw && (Array.isArray(_scopeUserRaw) ? _scopeUserRaw.length : _scopeUserRaw)) {
const _fr = document.getElementById('filterRole');
if (_fr) _fr.style.display = 'none';
const _badge = document.getElementById('viewerIdentityBadge');
if (_badge) {
_badge.textContent = (window.VIEWER_SCOPE || {}).display_name
|| (Array.isArray(_scopeUserRaw) ? _scopeUserRaw[0] : _scopeUserRaw);
_badge.style.display = '';
}
}
try { loadTrend(); } catch(e) {}
} else {
(async function() {

View File

@ -378,19 +378,6 @@ function getGoogleScanOptions() {
// ── File sources pane ─────────────────────────────────────────────────────────
function _srcIcon(s) {
if (s.source_type === 'sftp') return '\uD83D\uDD12';
const isSmb = s.path && (s.path.startsWith('//') || s.path.startsWith('\\\\'));
return isSmb ? '\uD83C\uDF10' : '\uD83D\uDCC1';
}
function _srcSubtitle(s) {
if (s.source_type === 'sftp') {
return _esc((s.sftp_user||'')+'@'+(s.sftp_host||'')+(s.path||'/'));
}
return _esc(s.path||'')+(s.smb_user?' \u00b7 \uD83D\uDC64 '+_esc(s.smb_user):'');
}
function srcFileRenderList() {
const list = document.getElementById('srcFileList');
if (!list) return;
@ -399,8 +386,9 @@ function srcFileRenderList() {
return;
}
list.innerHTML = S._fileSources.map(function(s) {
const icon = _srcIcon(s);
const sid = _esc(s.id||'');
const isSmb = s.path && (s.path.startsWith('//') || s.path.startsWith('\\\\'));
const icon = isSmb ? '\uD83C\uDF10' : '\uD83D\uDCC1';
const sid = _esc(s.id||'');
const slabel = _esc(s.label||s.path||'');
return '<div class="fsrc-row">'
+'<div class="fsrc-row-head">'
@ -410,47 +398,11 @@ function srcFileRenderList() {
+'<button class="btn-edit" onclick="srcFileEdit(\''+sid+'\')" style="background:none;border:1px solid var(--border);color:var(--muted);padding:2px 7px;border-radius:4px;font-size:10px;cursor:pointer">'+t('m365_fsrc_edit_btn','Edit')+'</button>'
+'<button class="btn-del" onclick="srcFileDelete(\''+sid+'\',\''+slabel+'\')">'+t('m365_profile_delete','Delete')+'</button>'
+'</div></div>'
+'<div class="fsrc-row-path">'+_srcSubtitle(s)+'</div>'
+'<div class="fsrc-row-path">'+_esc(s.path||'')+(s.smb_user?' \u00b7 \uD83D\uDC64 '+_esc(s.smb_user):'')+'</div>'
+'</div>';
}).join('');
}
function srcFileTypeSelect(type) {
document.getElementById('srcFileSourceType').value = type;
var pathRow = document.getElementById('srcFilePathRow');
var smbFields = document.getElementById('srcFileSmbFields');
var sftpFields= document.getElementById('srcFileSftpFields');
if (pathRow) pathRow.style.display = type === 'sftp' ? 'none' : '';
if (smbFields) smbFields.style.display = type === 'smb' ? 'flex' : 'none';
if (sftpFields)sftpFields.style.display= type === 'sftp' ? 'flex' : 'none';
['srcTypeLocal','srcTypeSmb','srcTypeSftp'].forEach(function(id) {
var btn = document.getElementById(id);
if (!btn) return;
var active = (id === 'srcType' + type.charAt(0).toUpperCase() + type.slice(1));
btn.style.background = active ? 'var(--accent)' : 'none';
btn.style.color = active ? '#fff' : 'var(--muted)';
});
}
function srcFileAutoNameSftp() {
var labelEl = document.getElementById('srcFileLabel');
if (labelEl && labelEl._userEdited) return;
var host = (document.getElementById('srcFileSftpHost')||{}).value || '';
if (labelEl && host) labelEl.value = host;
}
function srcFileSftpAuthSelect(authType) {
document.getElementById('srcFileSftpAuth').value = authType;
var pwFields = document.getElementById('srcSftpPwFields');
var keyFields = document.getElementById('srcSftpKeyFields');
var btnPw = document.getElementById('srcSftpAuthPw');
var btnKey = document.getElementById('srcSftpAuthKey');
if (pwFields) pwFields.style.display = authType === 'password' ? '' : 'none';
if (keyFields) keyFields.style.display = authType === 'key' ? 'flex' : 'none';
if (btnPw) { btnPw.style.background = authType==='password'?'var(--accent)':'none'; btnPw.style.color = authType==='password'?'#fff':'var(--muted)'; }
if (btnKey) { btnKey.style.background = authType==='key'?'var(--accent)':'none'; btnKey.style.color = authType==='key'?'#fff':'var(--muted)'; }
}
function srcFileDetectSmb() {
const p = document.getElementById('srcFilePath').value;
const isSmb = p.startsWith('//') || p.startsWith('\\\\');
@ -475,80 +427,30 @@ function srcFileAutoName() {
}
async function srcFileAdd() {
const label = document.getElementById('srcFileLabel').value.trim();
const sourceType = (document.getElementById('srcFileSourceType')||{}).value || 'local';
const stat = document.getElementById('srcFileStatus');
const editIdEl = document.getElementById('srcFileEditId');
const existingId = editIdEl ? editIdEl.value : '';
const label = document.getElementById('srcFileLabel').value.trim();
const path = document.getElementById('srcFilePath').value.trim();
const smbHost = document.getElementById('srcFileSmbHost').value.trim();
const smbUser = document.getElementById('srcFileSmbUser').value.trim();
const smbPw = document.getElementById('srcFileSmbPw').value;
const stat = document.getElementById('srcFileStatus');
if (!label) { stat.style.color='var(--danger)'; stat.textContent=t('m365_fsrc_name_required','Name is required.'); document.getElementById('srcFileLabel').focus(); return; }
if (!path) { stat.style.color='var(--danger)'; stat.textContent=t('m365_fsrc_path_required','Path is required.'); return; }
stat.style.color='var(--muted)'; stat.textContent=t('m365_fsrc_saving','Saving...');
var body = {label, source_type: sourceType};
if (existingId) body.id = existingId;
if (sourceType === 'sftp') {
const sftpHost = document.getElementById('srcFileSftpHost').value.trim();
const sftpUser = document.getElementById('srcFileSftpUser').value.trim();
const sftpPath = document.getElementById('srcFileSftpPath').value.trim() || '/';
const sftpPort = parseInt(document.getElementById('srcFileSftpPort').value) || 22;
const sftpAuth = document.getElementById('srcFileSftpAuth').value || 'password';
if (!sftpHost) { stat.style.color='var(--danger)'; stat.textContent=t('m365_fsrc_sftp_host_required','SFTP host is required.'); return; }
if (!sftpUser) { stat.style.color='var(--danger)'; stat.textContent=t('m365_fsrc_sftp_user_required','SFTP username is required.'); return; }
Object.assign(body, {sftp_host:sftpHost, sftp_port:sftpPort, sftp_user:sftpUser, sftp_auth:sftpAuth, path:sftpPath});
if (sftpAuth === 'password') {
const sftpPw = document.getElementById('srcFileSftpPw').value;
if (sftpPw) {
try { await fetch('/api/file_sources/store_creds',{method:'POST',headers:{'Content-Type':'application/json'},body:JSON.stringify({source_type:'sftp',sftp_host:sftpHost,sftp_user:sftpUser,password:sftpPw})}); } catch(e){}
}
} else {
// Upload key file if one is selected
const keyFileEl = document.getElementById('srcFileSftpKeyFile');
const keyStatusEl = document.getElementById('srcFileSftpKeyStatus');
const keyPathEl = document.getElementById('srcFileSftpKeyPath');
if (keyFileEl && keyFileEl.files.length && !keyPathEl.value) {
try {
const fd = new FormData(); fd.append('key_file', keyFileEl.files[0]);
const kr = await fetch('/api/file_sources/upload_key',{method:'POST',body:fd});
const kd = await kr.json();
if (kd.error) { stat.style.color='var(--danger)'; stat.textContent=kd.error; return; }
keyPathEl.value = kd.key_path;
if (keyStatusEl) keyStatusEl.textContent = t('m365_fsrc_sftp_key_uploaded','Key uploaded');
} catch(e){ stat.style.color='var(--danger)'; stat.textContent=e.message; return; }
}
body.sftp_key_path = keyPathEl ? keyPathEl.value : '';
const passphrase = (document.getElementById('srcFileSftpPassphrase')||{}).value || '';
if (passphrase) {
const passphraseKey = sftpHost+':'+sftpUser+':passphrase';
try { await fetch('/api/file_sources/store_creds',{method:'POST',headers:{'Content-Type':'application/json'},body:JSON.stringify({source_type:'sftp',sftp_host:sftpHost,sftp_user:sftpUser,password:passphrase,keychain_key:passphraseKey})}); } catch(e){}
body.keychain_key = passphraseKey;
}
}
} else {
const path = document.getElementById('srcFilePath').value.trim();
const smbHost = document.getElementById('srcFileSmbHost').value.trim();
const smbUser = document.getElementById('srcFileSmbUser').value.trim();
const smbPw = document.getElementById('srcFileSmbPw').value;
if (!path) { stat.style.color='var(--danger)'; stat.textContent=t('m365_fsrc_path_required','Path is required.'); return; }
Object.assign(body, {path, smb_host:smbHost, smb_user:smbUser});
if (smbPw && smbUser) {
try { await fetch('/api/file_sources/store_creds',{method:'POST',headers:{'Content-Type':'application/json'},body:JSON.stringify({source_type:'smb',smb_host:smbHost,smb_user:smbUser,password:smbPw})}); } catch(e){}
}
if (smbPw && smbUser) {
try { await fetch('/api/file_sources/store_creds',{method:'POST',headers:{'Content-Type':'application/json'},body:JSON.stringify({smb_host:smbHost,smb_user:smbUser,password:smbPw})}); } catch(e){}
}
try {
const editId = document.getElementById('srcFileEditId');
const existingId = editId ? editId.value : '';
const body = {label, path, smb_host:smbHost, smb_user:smbUser};
if (existingId) body.id = existingId;
const r = await fetch('/api/file_sources/save',{method:'POST',headers:{'Content-Type':'application/json'},body:JSON.stringify(body)});
const d = await r.json();
if (d.error) { stat.style.color='var(--danger)'; stat.textContent=d.error; return; }
// Reset form
['srcFileLabel','srcFilePath','srcFileSmbHost','srcFileSmbUser','srcFileSmbPw',
'srcFileSftpHost','srcFileSftpUser','srcFileSftpPw','srcFileSftpPassphrase','srcFileSftpKeyPath'].forEach(function(id){const el=document.getElementById(id);if(el){el.value='';if(el._userEdited!==undefined)el._userEdited=false;}});
var portEl = document.getElementById('srcFileSftpPort'); if(portEl) portEl.value='22';
if (editIdEl) editIdEl.value='';
['srcFileLabel','srcFilePath','srcFileSmbHost','srcFileSmbUser','srcFileSmbPw'].forEach(function(id){const el=document.getElementById(id);if(el){el.value='';el._userEdited=false;}});
if (editId) editId.value='';
const addBtn=document.getElementById('srcFileAddBtn'); if(addBtn) addBtn.textContent=t('m365_fsrc_add_btn','Add');
srcFileTypeSelect('local');
document.getElementById('srcFileSmbFields').style.display='none';
stat.style.color='var(--accent)'; stat.textContent='\u2714 '+t('m365_fsrc_saved','Source saved');
await _loadFileSources();
srcFileRenderList();
@ -560,28 +462,20 @@ function srcFileEdit(id) {
const s = S._fileSources.find(function(x){return x.id===id;});
if (!s) return;
const labelEl = document.getElementById('srcFileLabel');
const pathEl = document.getElementById('srcFilePath');
const hostEl = document.getElementById('srcFileSmbHost');
const userEl = document.getElementById('srcFileSmbUser');
const pwEl = document.getElementById('srcFileSmbPw');
const editId = document.getElementById('srcFileEditId');
if (labelEl) { labelEl.value = s.label||''; labelEl._userEdited = true; }
if (pathEl) pathEl.value = s.path||'';
if (hostEl) hostEl.value = s.smb_host||'';
if (userEl) userEl.value = s.smb_user||'';
if (pwEl) pwEl.value = s.smb_user ? '\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022' : '';
if (editId) editId.value = id;
var sourceType = s.source_type || (((s.path||'').startsWith('//')||(s.path||'').startsWith('\\\\')) ? 'smb' : 'local');
srcFileTypeSelect(sourceType);
if (sourceType === 'sftp') {
var hostEl = document.getElementById('srcFileSftpHost'); if(hostEl) hostEl.value = s.sftp_host||'';
var portEl = document.getElementById('srcFileSftpPort'); if(portEl) portEl.value = s.sftp_port||22;
var userEl = document.getElementById('srcFileSftpUser'); if(userEl) userEl.value = s.sftp_user||'';
var pathEl = document.getElementById('srcFileSftpPath'); if(pathEl) pathEl.value = s.path||'/';
var authEl = document.getElementById('srcFileSftpAuth'); if(authEl) authEl.value = s.sftp_auth||'password';
srcFileSftpAuthSelect(s.sftp_auth||'password');
if (s.sftp_key_path) { var kp = document.getElementById('srcFileSftpKeyPath'); if(kp) kp.value=s.sftp_key_path; }
} else {
var pathEl2 = document.getElementById('srcFilePath'); if(pathEl2) pathEl2.value = s.path||'';
var smbHostEl = document.getElementById('srcFileSmbHost'); if(smbHostEl) smbHostEl.value = s.smb_host||'';
var smbUserEl = document.getElementById('srcFileSmbUser'); if(smbUserEl) smbUserEl.value = s.smb_user||'';
var smbPwEl = document.getElementById('srcFileSmbPw'); if(smbPwEl) smbPwEl.value = s.smb_user ? '\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022' : '';
}
const isSmb = (s.path||'').startsWith('//') || (s.path||'').startsWith('\\\\');
const smbFields = document.getElementById('srcFileSmbFields');
if (smbFields) smbFields.style.display = isSmb ? 'flex' : 'none';
const btn = document.getElementById('srcFileAddBtn');
if (btn) btn.textContent = t('m365_fsrc_save_changes','Save changes');
const stat = document.getElementById('srcFileStatus');
@ -653,7 +547,9 @@ function _renderFileSources() {
return;
}
list.innerHTML = S._fileSources.map(function(s) {
const icon = _srcIcon(s);
const isSmb = s.path && (s.path.startsWith('//') || s.path.startsWith('\\\\'));
const icon = isSmb ? '\uD83C\uDF10' : '\uD83D\uDCC1';
const userPart = s.smb_user ? ' \u00b7 \uD83D\uDC64 ' + _esc(s.smb_user) : '';
const sid = _esc(s.id || '');
const slabel = _esc(s.label || s.path || '');
return '<div class="fsrc-row">'
@ -663,7 +559,7 @@ function _renderFileSources() {
+ '<button class="btn-scan" onclick="fsrcScan(\'' + sid + '\')">&#9654; ' + t('m365_fsrc_scan_btn','Scan') + '</button>'
+ '<button class="btn-del" onclick="fsrcDelete(\'' + sid + '\',\'' + slabel + '\')">' + t('m365_profile_delete','Delete') + '</button>'
+ '</div></div>'
+ '<div class="fsrc-row-path">' + _srcSubtitle(s) + '</div>'
+ '<div class="fsrc-row-path">' + _esc(s.path || '') + userPart + '</div>'
+ '</div>';
}).join('');
}
@ -771,9 +667,6 @@ window.getGoogleScanOptions = getGoogleScanOptions;
window.srcFileRenderList = srcFileRenderList;
window.srcFileDetectSmb = srcFileDetectSmb;
window.srcFileAutoName = srcFileAutoName;
window.srcFileAutoNameSftp = srcFileAutoNameSftp;
window.srcFileTypeSelect = srcFileTypeSelect;
window.srcFileSftpAuthSelect = srcFileSftpAuthSelect;
window.srcFileAdd = srcFileAdd;
window.srcFileEdit = srcFileEdit;
window.srcFileDelete = srcFileDelete;

View File

@ -1,255 +0,0 @@
// ── Scan history browser ──────────────────────────────────────────────────────
// Lets the user load and browse results from any past scan session without
// running a new scan. Sessions are groups of concurrent M365 + Google + File
// scans (same 300-second window used by get_session_items on the server).
import { S } from './state.js';
const _SRC_LABELS = {
email: 'Outlook',
onedrive: 'OneDrive',
sharepoint: 'SharePoint',
teams: 'Teams',
gmail: 'Gmail',
gdrive: 'Google Drive',
local: 'Lokal',
smb: 'SMB',
};
let _sessions = null; // cached list; null = stale
let _latestRefScanId = null; // ref_scan_id of the newest session
// ── Session cache ─────────────────────────────────────────────────────────────
async function _fetchSessions() {
try {
const r = await fetch('/api/db/sessions');
_sessions = await r.json();
} catch(e) {
_sessions = [];
}
_latestRefScanId = _sessions.length ? _sessions[0].ref_scan_id : null;
return _sessions;
}
function invalidateHistoryCache() {
_sessions = null;
_latestRefScanId = null;
}
// ── Load a session into the results grid ──────────────────────────────────────
// Default landing view: every flagged item still awaiting action, across all
// scans (not just the latest session). Leaves S._historyRefScanId null (live
// mode) and shows no history banner — this is "now", not a past session.
async function loadOpenItems() {
// Bail if a scan is running — live SSE owns the grid then.
if (S._m365ScanRunning || S._googleScanRunning || S._fileScanRunning) return;
try {
const r = await fetch('/api/db/flagged');
const items = await r.json();
if (S._m365ScanRunning || S._googleScanRunning || S._fileScanRunning) return;
closeHistoryPicker();
if (!Array.isArray(items) || items.length === 0) {
S._historyRefScanId = null;
_setHistoryBanner(false);
window.loadLastScanSummary?.();
return;
}
S._historyRefScanId = null;
S.flaggedData = items;
S.filteredData = [];
const grid = document.getElementById('grid');
const emptyState = document.getElementById('emptyState');
const lastScan = document.getElementById('lastScanSummary');
if (emptyState) emptyState.style.display = 'none';
if (lastScan) lastScan.style.display = 'none';
if (grid) { grid.innerHTML = ''; grid.style.display = 'grid'; }
window.renderGrid(items);
try { window.markOverdueCards(); } catch(_) {}
try { window.loadTrend(); } catch(_) {}
_setHistoryBanner(false);
} catch(e) {
console.error('[history] failed to load open items:', e);
}
}
async function loadHistorySession(refScanId) {
// refScanId: null → all open (unreviewed) items across every scan,
// positive int → a specific past session
if (refScanId === null) return loadOpenItems();
const resolvedRef = refScanId;
try {
const r = await fetch('/api/db/flagged?ref=' + resolvedRef);
const items = await r.json();
// Bail if a scan started while we were fetching flagged items
if (S._m365ScanRunning || S._googleScanRunning || S._fileScanRunning) return;
closeHistoryPicker();
if (!Array.isArray(items) || items.length === 0) {
S._historyRefScanId = null;
_setHistoryBanner(false);
window.loadLastScanSummary?.();
return;
}
S._historyRefScanId = resolvedRef;
S.flaggedData = items;
S.filteredData = [];
const grid = document.getElementById('grid');
const emptyState = document.getElementById('emptyState');
const lastScan = document.getElementById('lastScanSummary');
if (emptyState) emptyState.style.display = 'none';
if (lastScan) lastScan.style.display = 'none';
if (grid) { grid.innerHTML = ''; grid.style.display = 'grid'; }
window.renderGrid(items);
try { window.markOverdueCards(); } catch(_) {}
try { window.loadTrend(); } catch(_) {}
_setHistoryBanner(true, resolvedRef);
// ── Re-scan diff: append items from previous session no longer present ────
const allSessions = _sessions !== null ? _sessions : await _fetchSessions();
const idx = allSessions.findIndex(s => s.ref_scan_id === resolvedRef);
if (idx !== -1 && idx + 1 < allSessions.length) {
const prevRef = allSessions[idx + 1].ref_scan_id;
try {
const pr = await fetch('/api/db/flagged?ref=' + prevRef);
const prevItems = await pr.json();
if (Array.isArray(prevItems) && prevItems.length) {
const currentIds = new Set(items.map(f => f.id));
const resolved = prevItems.filter(f => !currentIds.has(f.id));
if (resolved.length) {
const divider = document.createElement('div');
divider.className = 'resolved-divider';
divider.textContent = resolved.length + ' ' + t('history_resolved_label', 'items no longer present');
document.getElementById('grid')?.appendChild(divider);
resolved.forEach(f => { f._resolved = true; window.appendCard(f); });
_setHistoryBanner(true, resolvedRef, resolved.length);
}
}
} catch(e) {
console.warn('[history] diff failed:', e);
}
}
} catch(e) {
console.error('[history] failed to load session:', e);
}
}
// ── Banner ────────────────────────────────────────────────────────────────────
function _setHistoryBanner(visible, resolvedRef, resolvedCount) {
const banner = document.getElementById('historyBanner');
const bannerTxt = document.getElementById('historyBannerText');
const latestBtn = document.getElementById('historyLatestBtn');
if (!banner) return;
if (!visible) { banner.style.display = 'none'; return; }
const sess = (_sessions || []).find(s => s.ref_scan_id === resolvedRef);
let label = '';
if (sess) {
const date = new Date(sess.started_at * 1000).toLocaleDateString(undefined,
{day: 'numeric', month: 'short', year: 'numeric'});
const time = new Date(sess.started_at * 1000).toLocaleTimeString(undefined,
{hour: '2-digit', minute: '2-digit'});
const srcStr = (sess.sources || []).map(s => _SRC_LABELS[s] || s).join(' · ');
label = date + ' ' + time
+ (srcStr ? ' · ' + srcStr : '')
+ ' · ' + sess.flagged_count + ' ' + t('history_items', 'items');
if (resolvedCount) label += ' · ' + resolvedCount + ' ' + t('history_resolved_badge', 'resolved');
} else {
label = S.flaggedData.length + ' ' + t('history_items', 'items');
}
if (bannerTxt) bannerTxt.textContent = label;
if (latestBtn) latestBtn.style.display = (resolvedRef !== _latestRefScanId) ? '' : 'none';
banner.style.display = 'flex';
}
function exitHistoryMode() {
S._historyRefScanId = null;
const banner = document.getElementById('historyBanner');
if (banner) banner.style.display = 'none';
closeHistoryPicker();
}
// ── Session picker dropdown ───────────────────────────────────────────────────
async function openHistoryPicker() {
const drop = document.getElementById('historyDropdown');
if (!drop) return;
// Toggle
if (drop.style.display !== 'none') { drop.style.display = 'none'; return; }
drop.innerHTML = '<div style="padding:10px 12px;font-size:12px;color:var(--muted)">'
+ t('lbl_loading', 'Loading\u2026') + '</div>';
drop.style.display = '';
const sessions = _sessions !== null ? _sessions : await _fetchSessions();
if (!sessions.length) {
drop.innerHTML = '<div style="padding:12px;font-size:12px;color:var(--muted);text-align:center">'
+ t('history_picker_empty', 'No past scans') + '</div>';
return;
}
drop.innerHTML = '';
sessions.forEach((sess, i) => {
const date = new Date(sess.started_at * 1000).toLocaleDateString(undefined,
{day: 'numeric', month: 'short', year: 'numeric'});
const time = new Date(sess.started_at * 1000).toLocaleTimeString(undefined,
{hour: '2-digit', minute: '2-digit'});
const srcStr = (sess.sources || []).map(s => _SRC_LABELS[s] || s).join(' · ');
const isActive = sess.ref_scan_id === S._historyRefScanId;
const row = document.createElement('div');
row.style.cssText = 'padding:8px 12px;cursor:pointer'
+ (i < sessions.length - 1 ? ';border-bottom:1px solid var(--border)' : '')
+ (isActive ? ';background:var(--bg)' : '');
row.innerHTML =
'<div style="display:flex;align-items:center;gap:6px;margin-bottom:2px">' +
'<span style="font-size:12px;font-weight:500;color:var(--text)">' + date + '</span>' +
'<span style="font-size:10px;color:var(--muted)">' + time + '</span>' +
(sess.delta
? '<span style="font-size:9px;padding:1px 5px;border-radius:10px;background:var(--muted);color:#fff;font-weight:600">'
+ t('history_delta_badge', 'Delta') + '</span>'
: '') +
(i === 0
? '<span style="font-size:9px;padding:1px 5px;border-radius:10px;background:var(--accent);color:#fff;font-weight:600">'
+ t('history_latest_badge', 'Latest') + '</span>'
: '') +
'</div>' +
'<div style="font-size:10px;color:var(--muted)">' +
srcStr + ' &nbsp;\u00b7&nbsp; ' + sess.flagged_count + ' ' + t('history_items', 'items') +
'</div>';
row.addEventListener('mouseenter', () => { if (!isActive) row.style.background = 'var(--surface)'; });
row.addEventListener('mouseleave', () => { row.style.background = isActive ? 'var(--bg)' : ''; });
row.addEventListener('click', () => loadHistorySession(sess.ref_scan_id));
drop.appendChild(row);
});
}
function closeHistoryPicker() {
const drop = document.getElementById('historyDropdown');
if (drop) drop.style.display = 'none';
}
// Close picker when clicking outside its container
document.addEventListener('click', e => {
const wrap = document.getElementById('historyPickerBtn')?.closest('[data-history-wrap]');
if (wrap && !wrap.contains(e.target)) closeHistoryPicker();
}, true);
// ── Window exports ────────────────────────────────────────────────────────────
window.loadHistorySession = loadHistorySession;
window.openHistoryPicker = openHistoryPicker;
window.closeHistoryPicker = closeHistoryPicker;
window.exitHistoryMode = exitHistoryMode;
window.invalidateHistoryCache = invalidateHistoryCache;

View File

@ -161,9 +161,10 @@ function copyLog() {
document.querySelectorAll('#logPanel .log-line:not(#logLive)').forEach(function(d) {
lines.push(d.textContent);
});
const btn = document.querySelector('.log-copy-btn');
// _copyText (viewer.js) handles HTTP contexts where navigator.clipboard is undefined.
if (btn) window._copyText(lines.join('\n'), btn);
navigator.clipboard.writeText(lines.join('\n')).then(function() {
const btn = document.querySelector('.log-copy-btn');
if (btn) { btn.textContent = '✓ Copied'; setTimeout(function(){ btn.textContent = '⎘ Copy'; }, 1500); }
}).catch(function() {});
}
function _restoreLog() {

View File

@ -69,11 +69,6 @@ function _applyProfile(profile) {
// File sources may not be rendered yet (they load async), so store their IDs
// in S._pendingProfileSources for renderSourcesPanel() to apply after re-render.
const profileSources = profile.sources || [];
// Ensure at least M365 source checkboxes are present before reading the DOM.
// renderSourcesPanel() is idempotent and fast — safe to call here.
if (!document.querySelector('#sourcesPanel input[data-source-id]') && typeof renderSourcesPanel === 'function') {
renderSourcesPanel();
}
document.querySelectorAll('#sourcesPanel input[data-source-id]').forEach(function(cb) {
cb.checked = profileSources.includes(cb.dataset.sourceId);
});
@ -127,36 +122,6 @@ function _applyProfile(profile) {
if (el) el.checked = opts.scan_photos;
}
if (opts.skip_gps_images !== undefined) {
const el = document.getElementById('optSkipGps');
if (el) el.checked = opts.skip_gps_images;
}
if (opts.min_cpr_count !== undefined) {
const el = document.getElementById('optMinCpr');
if (el) el.value = opts.min_cpr_count;
}
if (opts.ocr_lang !== undefined) {
const el = document.getElementById('optOcrLang');
if (el) el.value = opts.ocr_lang;
}
if (opts.cpr_only !== undefined) {
const el = document.getElementById('optCprOnly');
if (el) el.checked = opts.cpr_only;
}
if (opts.scan_emails !== undefined) {
const el = document.getElementById('optScanEmails');
if (el) el.checked = opts.scan_emails;
}
if (opts.scan_phones !== undefined) {
const el = document.getElementById('optScanPhones');
if (el) el.checked = opts.scan_phones;
}
// ── Date filter ───────────────────────────────────────────────────────────
const days = opts.older_than_days;
if (days !== undefined) {
@ -206,13 +171,8 @@ function _applyProfile(profile) {
// ── User selection ────────────────────────────────────────────────────────
if (profile.user_ids === 'all') {
if (S._allUsers.length) {
S._allUsers.forEach(u => { u.selected = true; });
renderAccountList();
} else {
// Users not loaded yet — defer until loadUsers() resolves
window._pendingProfileAllUsers = true;
}
S._allUsers.forEach(u => { u.selected = true; });
if (S._allUsers.length) renderAccountList();
} else if (Array.isArray(profile.user_ids) && profile.user_ids.length) {
window._pendingProfileUserIds = profile.user_ids.map(u => u.id || u);
_applyPendingProfileUsers();
@ -375,8 +335,7 @@ function _openEditorForProfile(profile) {
: (u.platform || 'm365') === 'google' ? '<span style="font-size:9px;padding:1px 5px;border-radius:10px;background:#EAF3DE;color:#3B6D11;font-weight:500">GWS</span>'
: '<span style="font-size:9px;padding:1px 5px;border-radius:10px;background:#E6F1FB;color:#185FA5;font-weight:500">M365</span>';
const roleBadge = u.userRole === 'student' ? t('role_student','Elev') : u.userRole === 'staff' ? t('role_staff','Ansat') : t('role_other','Anden');
const roleOverrideStyle = u.roleOverride ? 'color:var(--color-text-info);outline:1px solid var(--color-border-info);' : '';
return `<label class="pmgmt-acct-row" data-uid="${_esc(u.id)}" data-role="${_esc(u.userRole || 'other')}"><input type="checkbox" ${checked} data-uid="${_esc(u.id)}"><span style="flex:1;color:var(--color-text-primary);overflow:hidden;text-overflow:ellipsis;white-space:nowrap">${_esc(u.displayName)}</span>${platBadge}<button type="button" class="pmgmt-role-badge" data-uid="${_esc(u.id)}" onclick="_pmgmtCycleRole(this.getAttribute('data-uid'),event)" style="font-size:9px;padding:1px 5px;border-radius:10px;background:#D3D1C7;border:none;cursor:pointer;${roleOverrideStyle}">${roleBadge}</button></label>`;
return `<label class="pmgmt-acct-row" data-uid="${_esc(u.id)}"><input type="checkbox" ${checked} data-uid="${_esc(u.id)}"><span style="flex:1;color:var(--color-text-primary);overflow:hidden;text-overflow:ellipsis;white-space:nowrap">${_esc(u.displayName)}</span>${platBadge}<span style="font-size:9px;padding:1px 5px;border-radius:10px;background:#D3D1C7;color:#444441">${roleBadge}</span></label>`;
}).join('');
body.innerHTML = `
@ -435,12 +394,6 @@ function _openEditorForProfile(profile) {
<div class="pmgmt-opt-row"><span>${t('m365_opt_max_emails','Maks. e-mails pr. bruger')}</span><input type="number" id="peOptMaxEmails" value="${opts.max_emails || 2000}" min="10" max="50000" style="width:56px;padding:3px 6px;font-size:11px;text-align:right"></div>
<div class="pmgmt-opt-row"><span>${t('m365_opt_delta','Delta-scanning')}</span><label class="toggle"><input type="checkbox" id="peOptDelta" ${opts.delta ? 'checked' : ''}><span class="toggle-slider"></span></label></div>
<div class="pmgmt-opt-row"><span>${t('m365_opt_scan_photos','Søg efter ansigter i billeder')}</span><label class="toggle"><input type="checkbox" id="peOptPhotos" ${opts.scan_photos ? 'checked' : ''}><span class="toggle-slider"></span></label></div>
<div class="pmgmt-opt-row"><span>${t('m365_opt_skip_gps','Ignorer GPS i billeder')}</span><label class="toggle"><input type="checkbox" id="peOptSkipGps" ${opts.skip_gps_images ? 'checked' : ''}><span class="toggle-slider"></span></label></div>
<div class="pmgmt-opt-row"><span style="color:var(--muted)">${t('m365_opt_min_cpr','Min. CPR-antal pr. fil')}</span><input type="number" id="peOptMinCpr" value="${opts.min_cpr_count || 1}" min="1" max="50" style="width:46px;padding:3px 6px;font-size:11px;text-align:right"></div>
<div class="pmgmt-opt-row"><span>${t('m365_opt_cpr_only','CPR-only mode')}</span><label class="toggle"><input type="checkbox" id="peOptCprOnly" ${opts.cpr_only ? 'checked' : ''}><span class="toggle-slider"></span></label></div>
<div class="pmgmt-opt-row"><span style="color:var(--muted)">${t('m365_opt_ocr_lang','OCR-sprog')}</span><select id="peOptOcrLang" style="font-size:11px;padding:2px 4px;background:var(--surface);border:1px solid var(--border);color:var(--text);border-radius:4px"><option value="dan+eng" ${(opts.ocr_lang||'dan+eng')==='dan+eng'?'selected':''}>dan+eng</option><option value="dan" ${opts.ocr_lang==='dan'?'selected':''}>dan</option><option value="eng" ${opts.ocr_lang==='eng'?'selected':''}>eng</option><option value="dan+eng+deu" ${opts.ocr_lang==='dan+eng+deu'?'selected':''}>dan+eng+deu</option><option value="dan+eng+swe" ${opts.ocr_lang==='dan+eng+swe'?'selected':''}>dan+eng+swe</option><option value="dan+eng+fra" ${opts.ocr_lang==='dan+eng+fra'?'selected':''}>dan+eng+fra</option></select></div>
<div class="pmgmt-opt-row"><span>${t('m365_opt_scan_emails','Søg efter e-mailadresser')}</span><label class="toggle"><input type="checkbox" id="peOptEmails" ${opts.scan_emails ? 'checked' : ''}><span class="toggle-slider"></span></label></div>
<div class="pmgmt-opt-row"><span>${t('m365_opt_scan_phones','Søg efter telefonnumre')}</span><label class="toggle"><input type="checkbox" id="peOptPhones" ${opts.scan_phones ? 'checked' : ''}><span class="toggle-slider"></span></label></div>
<hr style="border:none;border-top:1px solid var(--pmgmt-divider);margin:2px 0">
<div class="pmgmt-opt-row"><span>${t('m365_opt_retention','Opbevaringspolitik')}</span><label class="toggle"><input type="checkbox" id="peOptRetention" ${profile.retention_years ? 'checked' : ''}><span class="toggle-slider"></span></label></div>
<div style="padding:7px 8px;background:var(--bg);border-radius:6px">
@ -550,26 +503,6 @@ function _pmgmtCloseEditor() {
closeProfileMgmt();
}
async function _pmgmtCycleRole(uid, event) {
event.stopPropagation();
if (typeof cycleUserRole !== 'function') return;
await cycleUserRole(uid);
// Refresh the badge inside the profile modal to reflect the new role
const u = S._allUsers.find(function(u){ return u.id === uid; });
if (!u) return;
const lbl = document.querySelector('#pmgmtAcctList label[data-uid="' + uid.replace(/"/g, '\\"') + '"]');
if (!lbl) return;
const badge = lbl.querySelector('.pmgmt-role-badge');
if (!badge) return;
const roleText = u.userRole === 'student' ? t('role_student','Elev')
: u.userRole === 'staff' ? t('role_staff','Ansat')
: t('role_other','Anden');
badge.textContent = roleText;
lbl.dataset.role = u.userRole || 'other';
badge.style.color = u.roleOverride ? 'var(--color-text-info)' : '';
badge.style.outline = u.roleOverride ? '1px solid var(--color-border-info)' : '';
}
function _pmgmtSelectAllAccounts(checked) {
document.querySelectorAll('#pmgmtAcctList label input[type=checkbox]').forEach(function(cb) {
if (cb.closest('label').style.display !== 'none') cb.checked = checked;
@ -608,9 +541,10 @@ function _pmgmtAddManual() {
function _pmgmtFilterAccounts(q) {
q = (q || '').toLowerCase();
document.querySelectorAll('#pmgmtAcctList label').forEach(function(lbl) {
var name = (lbl.querySelector('span') || {}).textContent || '';
var role = lbl.dataset.role || 'other';
var roleOk = !_pmgmtRoleActive || role === _pmgmtRoleActive;
var name = (lbl.querySelector('span') || {}).textContent || '';
var uid = lbl.querySelector('input')?.dataset?.uid || '';
var user = S._allUsers.find(u => u.id === uid);
var roleOk = !_pmgmtRoleActive || (user && user.userRole === _pmgmtRoleActive);
var nameOk = !q || name.toLowerCase().includes(q);
lbl.style.display = (roleOk && nameOk) ? '' : 'none';
});
@ -655,12 +589,6 @@ async function _pmgmtSaveFullEdit() {
max_emails: parseInt(document.getElementById('peOptMaxEmails')?.value) || 2000,
delta: document.getElementById('peOptDelta')?.checked ?? false,
scan_photos: document.getElementById('peOptPhotos')?.checked ?? false,
skip_gps_images: document.getElementById('peOptSkipGps')?.checked ?? false,
min_cpr_count: parseInt(document.getElementById('peOptMinCpr')?.value) || 1,
ocr_lang: document.getElementById('peOptOcrLang')?.value || 'dan+eng',
cpr_only: document.getElementById('peOptCprOnly')?.checked ?? false,
scan_emails: document.getElementById('peOptEmails')?.checked ?? false,
scan_phones: document.getElementById('peOptPhones')?.checked ?? false,
},
retention_years: document.getElementById('peOptRetention')?.checked ? (parseInt(document.getElementById('peOptRetYears')?.value) || 5) : null,
fiscal_year_end: document.getElementById('peOptRetention')?.checked ? (document.getElementById('peOptFiscalYearEnd')?.value || '') : '',
@ -673,7 +601,6 @@ async function _pmgmtSaveFullEdit() {
const d = await r.json();
if (d.error) { alert(d.error); return; }
await loadProfiles();
_renderProfileMgmt();
window._pmgmtNewDraft = null;
log(t('m365_profile_saved','Profile saved') + ': ' + name);
// Show inline saved feedback without closing the modal
@ -687,10 +614,7 @@ async function _pmgmtSaveFullEdit() {
}
// Re-open the editor for the saved profile so it reflects the saved state
const saved = S._profiles.find(function(p) { return p.name === name; });
if (saved) {
window._pmgmtEditId = saved.id;
document.querySelectorAll('.pmgmt-row').forEach(r => r.classList.toggle('active', r.dataset.id === saved.id));
}
if (saved) { window._pmgmtEditId = saved.id; }
} catch(e) { alert('Save failed: ' + e.message); }
}
@ -774,7 +698,6 @@ window._peSetYear = _peSetYear;
window._renderEditorSources = _renderEditorSources;
window._pmgmtNewProfile = _pmgmtNewProfile;
window._pmgmtCloseEditor = _pmgmtCloseEditor;
window._pmgmtCycleRole = _pmgmtCycleRole;
window._pmgmtSelectAllAccounts = _pmgmtSelectAllAccounts;
window._pmgmtRoleFilter = _pmgmtRoleFilter;
window._pmgmtAddManual = _pmgmtAddManual;

View File

@ -1,18 +1,4 @@
import { S } from './state.js';
// Escape untrusted strings (filenames, account/display names, folders) before
// embedding them in innerHTML / title attributes. Scan-derived values can come
// from attacker-controlled content (e.g. a OneDrive file named with markup),
// so every such field must pass through esc() to prevent stored XSS.
function esc(s) {
return String(s == null ? '' : s)
.replace(/&/g, '&amp;')
.replace(/</g, '&lt;')
.replace(/>/g, '&gt;')
.replace(/"/g, '&quot;')
.replace(/'/g, '&#39;');
}
// ── Cards ─────────────────────────────────────────────────────────────────────
const SOURCE_BADGES = {
email: ['📧', 'badge-email', 'Outlook'],
@ -25,31 +11,6 @@ const SOURCE_BADGES = {
smb: ['🌐', 'badge-smb', 'Network'],
};
// Build the user/group pill for a card. The group (role) badge is driven by
// user_role alone so it shows even when no display name is available — e.g.
// items from earlier scans saved before account_name was persisted. For those
// the user label is resolved best-effort from the loaded user list (by id or
// email), falling back to an email-style account_id. Returns '' when there is
// neither a label nor a role to show.
function _accountPill(f) {
const roleBadge =
f.user_role === 'student' ? '<span class="role-badge">' + t('role_student', 'Elev') + '</span>' :
f.user_role === 'staff' ? '<span class="role-badge">' + t('role_staff', 'Ansat') + '</span>' : '';
let label = f.account_name || '';
if (!label && f.account_id) {
const aid = String(f.account_id);
const u = (S._allUsers || []).find(function(u) {
return u.id === f.account_id ||
(u.email && u.email.toLowerCase() === aid.toLowerCase());
});
if (u) label = u.displayName || '';
else if (aid.includes('@')) label = aid; // an email is already human-readable
}
if (!label && !roleBadge) return '';
const title = label || f.user_role || '';
return '<span class="account-pill" title="' + esc(title) + '">' + roleBadge + (label ? esc(label) : '') + '</span>';
}
function appendCard(f) {
const search = document.getElementById('filterSearch').value.trim().toLowerCase();
const srcVal = document.getElementById('filterSource').value;
@ -63,57 +24,36 @@ function appendCard(f) {
: '/api/thumb?name=' + encodeURIComponent(f.name) + '&type=' + encodeURIComponent(f.source_type);
const card = document.createElement('div');
card.className = 'card' + (S.isListView ? ' list-view' : '') + (S._selectedIds.has(f.id) ? ' card-selected-bulk' : '') + ((f._resolved || f._redacted || f._deleted) ? ' card-resolved' : '');
card.className = 'card' + (S.isListView ? ' list-view' : '');
card.dataset.id = f.id;
card.onclick = (e) => { if (S._selectMode) { toggleCardSelect(f.id, e); } else { openPreview(f); } };
card.onclick = () => openPreview(f);
const cb = document.createElement('input');
cb.type = 'checkbox';
cb.className = 'card-cb';
cb.checked = S._selectedIds.has(f.id);
cb.onclick = (e) => { e.stopPropagation(); toggleCardSelect(f.id, e); };
card.appendChild(cb);
const delBtn = (window.VIEWER_MODE || f._resolved || f._redacted || f._deleted) ? '' : `<button class="card-delete-btn" title="${t('m365_delete_confirm','Delete')}" onclick="event.stopPropagation();deleteItem(${JSON.stringify(f).replace(/"/g,'&quot;')},this.closest('.card'))">🗑</button>`;
const _redactExts = new Set(['.docx', '.xlsx', '.txt', '.csv', '.pdf']);
const _cloudRedactExts = new Set(['.docx', '.xlsx', '.pdf']);
const _m365Types = new Set(['onedrive', 'sharepoint', 'teams']);
const _fileExt = (f.name || '').substring((f.name || '').lastIndexOf('.')).toLowerCase();
const _redactable = !window.VIEWER_MODE && !f._resolved && !f._redacted && !f._deleted && f.cpr_count > 0 && (
f.source_type === 'local' ? _redactExts.has(_fileExt) :
_m365Types.has(f.source_type) ? _cloudRedactExts.has(_fileExt) :
f.source_type === 'gdrive' ? _cloudRedactExts.has(_fileExt) :
(f.source_type === 'smb' || f.source_type === 'sftp') ? _redactExts.has(_fileExt) : false
);
const redactBtn = _redactable ? `<button class="card-redact-btn" title="${t('redact_btn','Redact CPR')}" onclick="event.stopPropagation();redactItem(${JSON.stringify(f).replace(/"/g,'&quot;')},this.closest('.card'))">✏</button>` : '';
const acctPill = _accountPill(f);
const delBtn = window.VIEWER_MODE ? '' : `<button class="card-delete-btn" title="${t('m365_delete_confirm','Delete')}" onclick="event.stopPropagation();deleteItem(${JSON.stringify(f).replace(/"/g,'&quot;')},this.closest('.card'))">🗑</button>`;
if (S.isListView) {
card.innerHTML = `
<div style="font-size:24px; flex-shrink:0">${icon}</div>
<div class="card-info list-info">
<div class="card-name" title="${esc(f.name)}">${esc(f.name)}</div>
<div class="card-meta">${f.size_kb} KB · ${esc(f.modified || '')}${f.folder ? ' · 📂 ' + esc(f.folder) : ''}</div>
<div class="card-source"><span class="source-badge ${badgeCls}">${esc(label)}</span> ${esc(f.source || '')}${acctPill ? ' · ' + acctPill : ''}${f.transfer_risk === 'external-recipient' ? ' <span class="role-pill" style="background:#7B2D00;color:#FFD0B0"> Ext.</span>' : f.transfer_risk ? ' <span class="role-pill" style="background:#003D7B;color:#B0D4FF">🔗</span>' : ''}</div>
<div class="card-name" title="${f.name}">${f.name}</div>
<div class="card-meta">${f.size_kb} KB · ${f.modified || ''}${f.folder ? ' · 📂 ' + f.folder : ''}</div>
<div class="card-source"><span class="source-badge ${badgeCls}">${label}</span> ${f.source || ''}${f.account_name ? ' · <span class="account-pill" title="' + f.account_name + '">' + (f.user_role === 'student' ? '<span class="role-badge">' + t('role_student','Elev') + '</span>' : f.user_role === 'staff' ? '<span class="role-badge">' + t('role_staff','Ansat') + '</span>' : '') + f.account_name + '</span>' : ''}${f.transfer_risk === 'external-recipient' ? ' <span class="role-pill" style="background:#7B2D00;color:#FFD0B0"> Ext.</span>' : f.transfer_risk ? ' <span class="role-pill" style="background:#003D7B;color:#B0D4FF">🔗</span>' : ''}</div>
</div>
<span class="cpr-badge">${f.cpr_count} CPR</span>
${f.email_count > 0 ? '<span class="email-badge">' + f.email_count + ' ' + t('m365_badge_emails', 'e-mail') + '</span> ' : ''}
${f.phone_count > 0 ? '<span class="phone-badge">' + f.phone_count + ' ' + t('m365_badge_phones', 'tlf.') + '</span> ' : ''}
${f.face_count > 0 ? '<span class="photo-face-badge">' + f.face_count + ' ' + t('m365_badge_faces', f.face_count === 1 ? 'face' : 'faces') + '</span> ' : ''}
${f.exif && f.exif.gps ? '<span class="photo-face-badge" style="background:#0a3a5a;color:#7ec8d0">🌍 GPS</span> ' : ''}
${f.special_category && f.special_category.length ? '<span class="special-cat-badge">⚠ Art.9 — ' + f.special_category.filter(function(s){return s !== 'gps_location' && s !== 'exif_pii';}).join(', ') + '</span> ' : ''}${f._deleted ? '<span class="resolved-badge" style="background:#3a1a1a;color:#ff9b9b">🗑 ' + t('delete_badge', 'Deleted') + '</span> ' : ''}${f._redacted ? '<span class="resolved-badge">✏ ' + t('redact_badge', 'Redacted') + '</span> ' : ''}${f._resolved ? '<span class="resolved-badge">✓ ' + t('history_resolved_badge', 'Resolved') + '</span> ' : ''}${f.overdue ? '<span class="overdue-badge">🗓 Overdue</span>' : ''}
${delBtn}${redactBtn}`;
${f.special_category && f.special_category.length ? '<span class="special-cat-badge">⚠ Art.9 — ' + f.special_category.filter(function(s){return s !== 'gps_location' && s !== 'exif_pii';}).join(', ') + '</span> ' : ''}${f.overdue ? '<span class="overdue-badge">🗓 Overdue</span>' : ''}
${delBtn}`;
} else {
card.innerHTML = `
<div class="thumb-wrap"><img src="${src}" alt="${esc(f.name)}" loading="lazy"></div>
<div class="thumb-wrap"><img src="${src}" alt="${f.name}" loading="lazy"></div>
<div class="card-info">
<div class="card-name" title="${esc(f.name)}">${esc(f.name)}</div>
<div class="card-meta">${f.size_kb} KB · ${esc(f.modified || '')}</div>
${f.folder ? `<div class="card-meta" style="font-size:10px" title="${esc(f.folder)}">📂 ${esc(f.folder)}</div>` : ''}
<div class="card-source"><span class="source-badge ${badgeCls}">${esc(label)}</span>${acctPill ? ' ' + acctPill : ''}${f.transfer_risk === "external-recipient" ? ' <span class="role-pill" style="background:#7B2D00;color:#FFD0B0"> Ext.</span>' : f.transfer_risk ? ' <span class="role-pill" style="background:#003D7B;color:#B0D4FF">🔗</span>' : ''}</div>
<span class="cpr-badge">${f.cpr_count} CPR</span>${f.email_count > 0 ? ' <span class="email-badge">' + f.email_count + ' ' + t('m365_badge_emails', 'e-mail') + '</span>' : ''}${f.phone_count > 0 ? ' <span class="phone-badge">' + f.phone_count + ' ' + t('m365_badge_phones', 'tlf.') + '</span>' : ''}${f.face_count > 0 ? ' <span class="photo-face-badge">' + f.face_count + ' ' + t('m365_badge_faces', f.face_count === 1 ? 'face' : 'faces') + '</span>' : ''}${f.exif && f.exif.gps ? ' <span class="photo-face-badge" style="background:#0a3a5a;color:#7ec8d0">🌍 GPS</span>' : ''}${f._deleted ? ' <span class="resolved-badge" style="background:#3a1a1a;color:#ff9b9b">🗑 ' + t('delete_badge', 'Deleted') + '</span>' : ''}${f._redacted ? ' <span class="resolved-badge"> ' + t('redact_badge', 'Redacted') + '</span>' : ''}${f._resolved ? ' <span class="resolved-badge"> ' + t('history_resolved_badge', 'Resolved') + '</span>' : ''}${f.overdue ? ' <span class="overdue-badge">🗓 Overdue</span>' : ''}
<div class="card-name" title="${f.name}">${f.name}</div>
<div class="card-meta">${f.size_kb} KB · ${f.modified || ''}</div>
${f.folder ? `<div class="card-meta" style="font-size:10px" title="${f.folder}">📂 ${f.folder}</div>` : ''}
<div class="card-source"><span class="source-badge ${badgeCls}">${label}</span>${f.account_name ? ' <span class="account-pill" title="' + f.account_name + '">' + (f.user_role === "student" ? '<span class="role-badge">' + t("role_student","Elev") + "</span>" : f.user_role === "staff" ? '<span class="role-badge">' + t("role_staff","Ansat") + "</span>" : "") + f.account_name + '</span>' : ''}${f.transfer_risk === "external-recipient" ? ' <span class="role-pill" style="background:#7B2D00;color:#FFD0B0"> Ext.</span>' : f.transfer_risk ? ' <span class="role-pill" style="background:#003D7B;color:#B0D4FF">🔗</span>' : ''}</div>
<span class="cpr-badge">${f.cpr_count} CPR</span>${f.face_count > 0 ? ' <span class="photo-face-badge">' + f.face_count + ' ' + t('m365_badge_faces', f.face_count === 1 ? 'face' : 'faces') + '</span>' : ''}${f.exif && f.exif.gps ? ' <span class="photo-face-badge" style="background:#0a3a5a;color:#7ec8d0">🌍 GPS</span>' : ''}${f.overdue ? ' <span class="overdue-badge">🗓 Overdue</span>' : ''}
</div>
${delBtn}${redactBtn}`;
${delBtn}`;
}
grid.appendChild(card);
}
@ -122,19 +62,6 @@ function renderGrid(files) {
const grid = document.getElementById('grid');
grid.innerHTML = '';
files.forEach(f => appendCard(f));
// Whenever results are rendered, the landing/last-scan cards must be hidden —
// the live scan_file_flagged path shows the grid but does not clear them, so
// results would otherwise appear underneath the still-visible landing page
// until a manual refresh. Centralised here so every render path is covered.
if (files && files.length) {
const es = document.getElementById('emptyState');
if (es) es.style.display = 'none';
const ls = document.getElementById('lastScanSummary');
if (ls) ls.style.display = 'none';
if (grid) grid.style.display = S.isListView ? 'block' : 'grid';
}
_updateBulkBar();
updateDispositionStats();
}
// ── Preview panel ─────────────────────────────────────────────────────────────
@ -155,30 +82,22 @@ async function openPreview(f) {
panel.classList.remove('hidden');
const _savedW = sessionStorage.getItem('gdpr_preview_width');
if (_savedW) panel.style.width = _savedW + 'px';
// Opening the panel narrows .grid-area and reflows the grid to fewer columns,
// moving the selected card to a new row. Defer the scroll by two frames so it
// runs against the settled layout, and centre the card so it stays visible.
if (cardEl) requestAnimationFrame(() => requestAnimationFrame(() =>
cardEl.scrollIntoView({ behavior: 'smooth', block: 'center' })));
title.textContent = f.name;
frame.style.display = 'none';
loading.style.display = 'flex';
loading.textContent = 'Loading preview…';
meta.innerHTML = [
f.account_name ? `<span style="font-weight:500">👤 ${esc(f.account_name)}</span>` : '',
f.source ? `<span>${esc(f.source)}</span>` : '',
f.account_name ? `<span style="font-weight:500">👤 ${f.account_name}</span>` : '',
f.source ? `<span>${f.source}</span>` : '',
f.size_kb ? `<span>${f.size_kb} KB</span>` : '',
f.modified ? `<span>${esc(f.modified)}</span>` : '',
f.cpr_count ? `<span style="color:var(--danger)">${f.cpr_count} CPR</span>` : '',
f.email_count ? `<span style="color:#7ec8f0">${f.email_count} ${t('m365_badge_emails','e-mail')}</span>` : '',
f.phone_count ? `<span style="color:#7eeac0">${f.phone_count} ${t('m365_badge_phones','tlf.')}</span>` : '',
f.modified ? `<span>${f.modified}</span>` : '',
f.cpr_count ? `<span style="color:var(--danger)">${f.cpr_count} CPR</span>` : '',
f.url ? `<button class="preview-open-btn" onclick="window.open('${f.url}','_blank')">${t("m365_preview_open","Open in M365 ↗")}</button>` : '',
].filter(Boolean).join('');
_previewItemId = f.id;
loadDisposition(f.id);
_loadRelated(f);
loadDisposition(f.id); // load disposition for this item (#6)
try {
const r = await fetch('/api/preview/' + encodeURIComponent(f.id)
@ -244,44 +163,6 @@ async function openPreview(f) {
}
}
// ── Related documents (CPR cross-reference) ───────────────────────────────────
async function _loadRelated(f) {
const el = document.getElementById('previewRelated');
if (!el) return;
if (!f.cpr_count) { el.style.display = 'none'; return; }
const ref = S._historyRefScanId ? `&ref=${S._historyRefScanId}` : '';
try {
const r = await fetch(`/api/db/related/${encodeURIComponent(f.id)}?${ref}`);
const items = await r.json();
if (f.id !== _previewItemId) return; // stale
if (!items.length) { el.style.display = 'none'; return; }
const rows = items.map(item => {
const shared = item.shared_cprs ?? '';
const badge = shared ? `<span style="font-size:9px;padding:1px 5px;border-radius:10px;background:var(--danger);color:#fff;font-weight:500;flex-shrink:0">${shared} CPR</span>` : '';
const src = item.source ? `<span style="color:var(--muted);font-size:10px;flex-shrink:0">${esc(item.source)}</span>` : '';
return `<div onclick="window._openRelated('${item.id.replace(/'/g,"\\'")}',${JSON.stringify(item).replace(/"/g,'&quot;')})"
style="display:flex;align-items:center;gap:6px;padding:4px 0;cursor:pointer;border-radius:4px"
onmouseover="this.style.background='var(--surface)'" onmouseout="this.style.background=''">
<span style="flex:1;font-size:11px;overflow:hidden;text-overflow:ellipsis;white-space:nowrap" title="${esc(item.name)}">${esc(item.name)}</span>
${src}${badge}
</div>`;
}).join('');
el.innerHTML = `<div style="font-size:10px;font-weight:600;color:var(--muted);margin-bottom:4px;text-transform:uppercase;letter-spacing:.04em">${t('m365_related_docs','Related documents')} <span style="font-weight:400">(${items.length})</span></div>${rows}`;
el.style.display = 'block';
} catch(e) {
el.style.display = 'none';
}
}
window._openRelated = function(id, itemData) {
const cached = (S.flaggedData || []).find(x => x.id === id);
openPreview(cached || itemData);
};
// ── Retention policy (#1) ────────────────────────────────────────────────────
function toggleRetentionPanel() {
@ -406,9 +287,9 @@ async function runSubjectLookup() {
_dsubItems = d.items;
resultsEl.innerHTML = d.items.map(item => `
<div class="dsub-result-row">
<div class="dsub-result-name" title="${esc(item.name)}">${esc(item.name)}</div>
<div class="dsub-result-meta">${esc(item.source_type || "")}</div>
<div class="dsub-result-meta">${esc(item.modified || "")}</div>
<div class="dsub-result-name" title="${item.name}">${item.name}</div>
<div class="dsub-result-meta">${item.source_type || ""}</div>
<div class="dsub-result-meta">${item.modified || ""}</div>
<div class="dsub-result-meta" style="color:var(--danger)">${item.cpr_count} CPR</div>
</div>
`).join("");
@ -436,13 +317,10 @@ async function deleteSubjectItems() {
document.getElementById("dsubDeleteBtn").style.display = "none";
document.getElementById("dsubResults").innerHTML = "";
_dsubItems = [];
// Keep the deleted items in the grid (marked, greyed, buttons hidden)
// until the next scan run — only those the server actually deleted.
const deletedSet = new Set(d.deleted_ids || ids);
const _mark = (x) => { if (deletedSet.has(x.id)) x._deleted = true; };
S.flaggedData.forEach(_mark);
S.filteredData.forEach(_mark);
renderGrid(S.filteredData.length ? S.filteredData : S.flaggedData);
// Refresh grid
S.flaggedData = S.flaggedData.filter(f => !ids.includes(f.id));
S.filteredData = S.filteredData.filter(f => !ids.includes(f.id));
renderGrid();
updateStats();
} catch(e) {
statusEl.textContent = "Delete failed: " + e.message;
@ -489,7 +367,6 @@ async function saveDisposition() {
// Update cached value on the S.flaggedData item
const item = S.flaggedData.find(f => f.id === _dispositionItemId);
if (item) item.disposition = status;
updateDispositionStats();
// Refresh card badge if a disposition filter is active
const dispFilter = document.getElementById("filterDisposition")?.value;
if (dispFilter) applyFilters();
@ -498,133 +375,6 @@ async function saveDisposition() {
}
}
// ── Disposition stats ─────────────────────────────────────────────────────────
function updateDispositionStats() {
const el = document.getElementById('dispStats');
if (!el) return;
const data = S.flaggedData;
if (!data.length) { el.style.display = 'none'; return; }
let unreviewed = 0, retain = 0, del = 0, other = 0;
for (const f of data) {
const d = f.disposition || 'unreviewed';
if (d === 'unreviewed') unreviewed++;
else if (d.startsWith('retain')) retain++;
else if (d.startsWith('delete') || d === 'deleted') del++;
else other++;
}
const reviewed = data.length - unreviewed;
const pct = data.length ? Math.round(reviewed / data.length * 100) : 0;
el.style.display = 'flex';
el.innerHTML =
`<span>${data.length} ${t('disp_stats_total','total')}</span>` +
`<span class="disp-stat-sep"></span>` +
`<span class="${unreviewed ? 'disp-stat-warn' : 'disp-stat-ok'}">${unreviewed} ${t('disp_stats_unreviewed','unreviewed')}</span>` +
`<span class="disp-stat-sep"></span>` +
`<span>${retain} ${t('disp_stats_retain','retain')}</span>` +
`<span class="disp-stat-sep"></span>` +
`<span>${del} ${t('disp_stats_delete','delete')}</span>` +
(other ? `<span class="disp-stat-sep"></span><span>${other} ${t('disp_stats_other','other')}</span>` : '') +
`<span class="disp-stat-sep" style="margin-left:auto"></span>` +
`<span style="font-weight:600;color:var(--accent)">${pct}% ${t('disp_stats_reviewed','reviewed')}</span>`;
}
// ── Bulk disposition tagging ──────────────────────────────────────────────────
function toggleSelectMode() {
S._selectMode = !S._selectMode;
document.body.classList.toggle('select-mode', S._selectMode);
const btn = document.getElementById('selectModeBtn');
if (btn) {
btn.style.background = S._selectMode ? 'var(--accent)' : 'none';
btn.style.color = S._selectMode ? '#fff' : 'var(--muted)';
btn.style.borderColor = S._selectMode ? 'var(--accent)' : 'var(--border)';
}
if (!S._selectMode) {
S._selectedIds.clear();
_updateBulkBar();
} else {
closePreview();
}
// Re-render so card onclick handlers respect new mode
renderGrid(S.filteredData.length ? S.filteredData : S.flaggedData);
}
function toggleCardSelect(id, ev) {
if (ev) ev.stopPropagation();
if (S._selectedIds.has(id)) S._selectedIds.delete(id);
else S._selectedIds.add(id);
const cb = document.querySelector(`.card[data-id="${CSS.escape(id)}"] .card-cb`);
if (cb) cb.checked = S._selectedIds.has(id);
const card = document.querySelector(`.card[data-id="${CSS.escape(id)}"]`);
if (card) card.classList.toggle('card-selected-bulk', S._selectedIds.has(id));
_updateBulkBar();
}
function selectAllVisible() {
const allChecked = S.filteredData.every(f => S._selectedIds.has(f.id));
if (allChecked) {
S.filteredData.forEach(f => { S._selectedIds.delete(f.id); });
} else {
S.filteredData.forEach(f => { S._selectedIds.add(f.id); });
}
renderGrid(S.filteredData.length ? S.filteredData : S.flaggedData);
_updateBulkBar();
}
function _updateBulkBar() {
const bar = document.getElementById('bulkTagBar');
const cnt = document.getElementById('bulkTagCount');
const saEl = document.getElementById('bulkSelectAll');
if (!bar) return;
const n = S._selectedIds.size;
bar.style.display = (S._selectMode && n > 0) ? 'flex' : 'none';
if (cnt) cnt.textContent = n + ' ' + t('bulk_selected', 'selected');
if (saEl) {
const allVis = S.filteredData.length > 0 && S.filteredData.every(f => S._selectedIds.has(f.id));
saEl.textContent = allVis
? t('bulk_deselect_all', 'Deselect all')
: t('bulk_select_all', 'Select all visible');
}
}
async function applyBulkDisposition() {
const status = document.getElementById('bulkDispSelect')?.value;
if (!status || S._selectedIds.size === 0) return;
const ids = [...S._selectedIds];
const btn = document.getElementById('bulkTagApplyBtn');
const statusEl = document.getElementById('bulkTagStatus');
if (btn) btn.disabled = true;
if (statusEl) statusEl.textContent = '';
try {
const r = await fetch('/api/db/disposition/bulk', {
method: 'POST', headers: {'Content-Type': 'application/json'},
body: JSON.stringify({item_ids: ids, status}),
});
const d = await r.json();
if (d.error) throw new Error(d.error);
// Update in-memory items
for (const f of S.flaggedData) {
if (S._selectedIds.has(f.id)) f.disposition = status;
}
if (statusEl) {
statusEl.textContent = '✓ ' + d.saved + ' ' + t('bulk_applied', 'updated');
setTimeout(() => { if (statusEl) statusEl.textContent = ''; }, 2000);
}
S._selectedIds.clear();
_updateBulkBar();
// Refresh filter if disposition filter is active
const dispFilter = document.getElementById('filterDisposition')?.value;
if (dispFilter) applyFilters();
else renderGrid(S.filteredData.length ? S.filteredData : S.flaggedData);
updateDispositionStats();
} catch(e) {
if (statusEl) statusEl.textContent = e.message;
} finally {
if (btn) btn.disabled = false;
}
}
function closePreview() {
const panel = document.getElementById('previewPanel');
panel.style.width = ''; // clear inline width so CSS .hidden { width:0 } takes effect
@ -649,13 +399,9 @@ async function deleteItem(f, cardEl) {
});
const d = await r.json();
if (d.ok) {
// Keep the deleted item in the grid (marked, greyed, action buttons
// hidden) until the next scan run, so the operator can see what was
// handled. The grid is rebuilt on the next scan, clearing these.
const _mark = (x) => { if (x.id === f.id) x._deleted = true; };
S.flaggedData.forEach(_mark);
S.filteredData.forEach(_mark);
renderGrid(S.filteredData.length ? S.filteredData : S.flaggedData);
S.flaggedData = S.flaggedData.filter(x => x.id !== f.id);
S.filteredData = S.filteredData.filter(x => x.id !== f.id);
if (cardEl) cardEl.remove();
updateStats();
log(t('m365_log_deleted', 'Deleted:') + ' ' + f.name, 'ok');
if (_previewItemId === f.id) closePreview();
@ -667,36 +413,6 @@ async function deleteItem(f, cardEl) {
}
}
async function redactItem(f, cardEl) {
if (!confirm(t('redact_confirm', 'Redact all CPR numbers in') + ' "' + f.name + '"?\n\n' + t('redact_warning', 'CPR numbers will be replaced with █ characters. This cannot be undone.'))) return;
if (cardEl) { cardEl.style.opacity = '0.5'; cardEl.style.pointerEvents = 'none'; }
try {
const r = await fetch('/api/redact_item', {
method: 'POST', headers: {'Content-Type': 'application/json'},
body: JSON.stringify({id: f.id, source_type: f.source_type})
});
const d = await r.json();
if (d.ok) {
// Keep the redacted item in the grid (marked, greyed, action buttons
// hidden) until the next scan run, so the operator can see what was
// handled. The grid is rebuilt on the next scan, clearing these.
const _mark = (x) => { if (x.id === f.id) x._redacted = true; };
S.flaggedData.forEach(_mark);
S.filteredData.forEach(_mark);
renderGrid(S.filteredData.length ? S.filteredData : S.flaggedData);
updateStats();
log(t('redact_done', 'Redacted') + ' ' + f.name + ' (' + (d.redacted || 0) + ' ' + t('redact_spans', 'CPR spans') + ')', 'ok');
if (_previewItemId === f.id) closePreview();
} else {
if (cardEl) { cardEl.style.opacity = ''; cardEl.style.pointerEvents = ''; }
log(t('redact_failed', 'Redaction failed:') + ' ' + (d.error || '?'), 'err');
}
} catch(e) {
if (cardEl) { cardEl.style.opacity = ''; cardEl.style.pointerEvents = ''; }
log(t('redact_failed', 'Redaction failed:') + ' ' + e.message, 'err');
}
}
// ── Bulk delete modal ─────────────────────────────────────────────────────────
function openBulkDelete() {
@ -720,7 +436,6 @@ function _bdFilters() {
function _bdMatches() {
const f = _bdFilters();
return S.flaggedData.filter(x => {
if (x._deleted || x._redacted) return false; // already handled this session
if (f.source_type && x.source_type !== f.source_type) return false;
if (x.cpr_count < f.min_cpr) return false;
if (f.older_than_date && x.modified > f.older_than_date) return false;
@ -773,34 +488,25 @@ function _ensureSSE() {
function _sseWatchdog() {
fetch('/api/scan/status').then(function(r) { return r.json(); }).then(function(status) {
var anyRunning = status.running || status.google_running;
if (anyRunning) {
if (status.running) {
// A scan is in progress — make sure SSE is connected and progress UI is visible
_ensureSSE();
if (status.running && !S._m365ScanRunning && !S._googleScanRunning && !S._fileScanRunning) {
if (!S._m365ScanRunning && !S._googleScanRunning && !S._fileScanRunning) {
document.getElementById('scanBtn').disabled = true;
document.getElementById('stopBtn').style.display = 'inline-block';
// status.running reflects the M365 + file lock; treat as an M365 reconnect
// /api/scan/status checks the M365 lock — if running=true it's an M365 scan
S._m365ScanRunning = true; _renderProgressSegments();
document.getElementById('progressFile').textContent = t('m365_sse_reconnecting', 'Reconnecting to running scan…');
log(t('m365_sse_reconnecting', 'Reconnecting to running scan…'));
}
} else if (!S._historyRefScanId && !(S.flaggedData && S.flaggedData.length)) {
// No scan of any kind is running (authoritative, both locks free) and
// nothing is shown yet — restore the last saved session from the DB.
// Retried on every poll, not one-shot: the initial attempt can be blocked
// by running flags that SSE replay of a *completed* scan set but never
// cleared, and sse_replay_done only fires for a non-empty buffer (so it
// never retries after a server restart clears the replay buffer).
// Both locks are confirmed free, so clear any stale flags first.
S._m365ScanRunning = false;
S._googleScanRunning = false;
S._fileScanRunning = false;
window.loadHistorySession?.(null);
}
_initialStatusChecked = true;
// Keep polling even when idle — the SSE connection may have died and we
// need to detect the next scheduled scan (SSE is only opened on demand).
if (!_initialStatusChecked) {
_initialStatusChecked = true;
if (!status.running) loadLastScanSummary();
}
// When no scan is running, we still keep polling — the SSE connection
// may have died and we need to detect the *next* scheduled scan.
// The SSE itself is only opened/reopened when a scan is detected.
}).catch(function(err) {
// Status endpoint unavailable — server might be restarting
console.warn('[SSE] status poll failed:', err);
@ -935,12 +641,9 @@ async function executeBulkDelete() {
});
const d = await r.json();
if (d.ok) {
// Keep the deleted items in the grid (marked, greyed, buttons hidden)
// until the next scan run — only those the server actually deleted.
const deletedSet = new Set(d.deleted_ids || matches.map(x => x.id));
const _mark = (x) => { if (deletedSet.has(x.id)) x._deleted = true; };
S.flaggedData.forEach(_mark);
S.filteredData.forEach(_mark);
const deletedSet = new Set(matches.map(x => x.id));
S.flaggedData = S.flaggedData.filter(x => !deletedSet.has(x.id));
S.filteredData = S.filteredData.filter(x => !deletedSet.has(x.id));
renderGrid(S.filteredData.length ? S.filteredData : S.flaggedData);
updateStats();
prog.innerHTML = `<span style="color:var(--ok,#4c4)">✓ ${d.deleted} ${t('m365_bulk_deleted', 'deleted')}</span>` +
@ -966,7 +669,6 @@ function applyFilters() {
const dispVal = document.getElementById('filterDisposition')?.value || '';
const transferVal = document.getElementById('filterTransfer')?.value || '';
const specialVal = document.getElementById('filterSpecial')?.value || '';
const roleVal = document.getElementById('filterRole')?.value || '';
S.filteredData = S.flaggedData.filter(f => {
if (search && !f.name.toLowerCase().includes(search)) return false;
if (srcVal && f.source_type !== srcVal) return false;
@ -974,8 +676,6 @@ function applyFilters() {
if (transferVal && (f.transfer_risk || '') !== transferVal) return false;
if (specialVal === '1' && !(f.special_category && f.special_category.length)) return false;
if (specialVal === 'photo' && !(f.face_count > 0)) return false;
if (roleVal === 'student' && f.user_role !== 'student') return false;
if (roleVal === 'staff' && f.user_role === 'student') return false;
return true;
});
const grid = document.getElementById('grid');
@ -1021,8 +721,7 @@ async function exportExcel() {
return;
}
// Browser / localhost fallback: fetch as blob and trigger download
const _roleParam = document.getElementById('filterRole')?.value || '';
const r = await fetch('/api/export_excel' + (_roleParam ? '?role=' + encodeURIComponent(_roleParam) : ''));
const r = await fetch('/api/export_excel');
if (!r.ok) {
const err = await r.json().catch(() => ({error: 'Export failed'}));
log('Export error: ' + (err.error || r.status), 'err');
@ -1063,8 +762,7 @@ async function exportArticle30() {
const btn = document.getElementById('exportA30Btn');
if (btn) { btn.disabled = true; btn.textContent = '⏳'; }
try {
const _roleParam30 = document.getElementById('filterRole')?.value || '';
const r = await fetch('/api/export_article30' + (_roleParam30 ? '?role=' + encodeURIComponent(_roleParam30) : ''));
const r = await fetch('/api/export_article30');
if (!r.ok) {
const err = await r.json().catch(() => ({error: 'Export failed'}));
log('Article 30 export error: ' + (err.error || r.status), 'err');
@ -1098,8 +796,6 @@ function clearFilters() {
if (ft) ft.value = '';
const fs = document.getElementById('filterSpecial');
if (fs) fs.value = '';
const fr = document.getElementById('filterRole');
if (fr) fr.value = '';
applyFilters();
}
@ -1165,7 +861,6 @@ window.loadDisposition = loadDisposition;
window.saveDisposition = saveDisposition;
window.closePreview = closePreview;
window.deleteItem = deleteItem;
window.redactItem = redactItem;
window.openBulkDelete = openBulkDelete;
window.closeBulkDelete = closeBulkDelete;
window._bdFilters = _bdFilters;
@ -1177,10 +872,6 @@ window._autoConnectSSEIfRunning = _autoConnectSSEIfRunning;
window._loadViewerResults = _loadViewerResults;
window.executeBulkDelete = executeBulkDelete;
window.applyFilters = applyFilters;
window.toggleSelectMode = toggleSelectMode;
window.toggleCardSelect = toggleCardSelect;
window.selectAllVisible = selectAllVisible;
window.applyBulkDisposition = applyBulkDisposition;
window.exportExcel = exportExcel;
window.exportArticle30 = exportArticle30;
window.clearFilters = clearFilters;

View File

@ -67,7 +67,7 @@ async function doImportDB() {
}
if (mode === 'replace') {
if (!confirm(t('m365_db_import_replace_confirm',
'Replace mode will erase ALL existing scan data and restore from the archive.\n\nMake sure you have a manual backup of ~/.gdprscanner/scanner.db.\n\nProceed?'))) return;
'Replace mode will erase ALL existing scan data and restore from the archive.\n\nMake sure you have a manual backup of ~/.gdpr_scanner.db.\n\nProceed?'))) return;
}
btn.disabled = true;
stat.style.color = 'var(--muted)';
@ -125,12 +125,6 @@ function buildScanPayload() {
max_emails: parseInt(document.getElementById('optMaxEmails').value) || 200,
delta: document.getElementById('optDelta') ? document.getElementById('optDelta').checked : false,
scan_photos: document.getElementById('optScanPhotos') ? document.getElementById('optScanPhotos').checked : false,
skip_gps_images: document.getElementById('optSkipGps') ? document.getElementById('optSkipGps').checked : false,
min_cpr_count: document.getElementById('optMinCpr') ? (parseInt(document.getElementById('optMinCpr').value) || 1) : 1,
ocr_lang: document.getElementById('optOcrLang')?.value || 'dan+eng',
cpr_only: document.getElementById('optCprOnly') ? document.getElementById('optCprOnly').checked : false,
scan_emails: document.getElementById('optScanEmails') ? document.getElementById('optScanEmails').checked : false,
scan_phones: document.getElementById('optScanPhones') ? document.getElementById('optScanPhones').checked : false,
retention_enabled: document.getElementById('optRetention') ? document.getElementById('optRetention').checked : false,
retention_years: parseInt(document.getElementById('optRetentionYears')?.value) || 5,
fiscal_year_end: document.getElementById('optFiscalYearEnd')?.value || '',
@ -138,39 +132,26 @@ function buildScanPayload() {
return { sources, fileSources, allSources, googleSources, user_ids, options };
}
async function checkCheckpoint(onNoCheckpoint) {
async function checkCheckpoint() {
const payload = buildScanPayload();
const banner = document.getElementById('resumeBanner');
const hasSources = payload.sources.length > 0 || payload.fileSources.length > 0 || payload.googleSources.length > 0;
if (!hasSources) {
if (banner) banner.style.display = 'none';
onNoCheckpoint?.(); return;
}
// M365 sources without users — scan button will handle the alert
if (payload.sources.length && !payload.user_ids.length && !payload.googleSources.length) {
if (banner) banner.style.display = 'none';
onNoCheckpoint?.(); return;
}
// Collect Google user emails for server-side checkpoint key computation
const googleUserEmails = payload.googleSources.length > 0
? (S._allUsers || []).filter(u => u.selected !== false && (u.platform === 'google' || u.platform === 'both')).map(u => u.email || u.id).filter(Boolean)
: [];
if (!payload.sources.length && !payload.fileSources.length) return;
if (payload.sources.length && !payload.user_ids.length) return;
try {
const r = await fetch('/api/scan/checkpoint', {
method: 'POST', headers: {'Content-Type':'application/json'},
body: JSON.stringify({...payload, googleUserEmails})
body: JSON.stringify(payload)
});
const d = await r.json();
const banner = document.getElementById('resumeBanner');
if (d.exists) {
const ts = d.started_at ? new Date(d.started_at * 1000).toLocaleString([], {dateStyle:'short', timeStyle:'short'}) : '';
document.getElementById('resumeBannerText').textContent =
t('m365_resume_banner', `Previous scan interrupted (${d.scanned_count} scanned, ${d.flagged_count} found${ts ? ' — ' + ts : ''})`);
if (banner) banner.style.display = 'flex';
banner.style.display = 'flex';
} else {
if (banner) banner.style.display = 'none';
onNoCheckpoint?.();
banner.style.display = 'none';
}
} catch(e) { onNoCheckpoint?.(); }
} catch(e) { /* ignore */ }
}
async function clearCheckpointAndScan() {
@ -188,7 +169,8 @@ async function checkDeltaStatus() {
const row = document.getElementById('deltaStatusRow');
const txt = document.getElementById('deltaStatusText');
if (d.exists) {
txt.textContent = t('m365_delta_tokens_saved', 'Tokens saved for {n} source(s)').replace('{n}', d.count);
const src = d.count === 1 ? '1 source' : `${d.count} sources`;
txt.textContent = t('m365_delta_tokens_saved', `Tokens saved for ${src}`);
row.style.display = 'flex';
row.style.alignItems = 'center';
} else {
@ -336,16 +318,16 @@ function _attachScanListeners(source) {
var fill = document.getElementById('progressFill_' + src);
if (fill) fill.style.width = pct + '%';
document.getElementById('progressFile').textContent = d.file || '';
var statsEl = document.getElementById('progressStats');
var etaEl = document.getElementById('progressEta');
// Only update stats/ETA from M365 (has meaningful totals and ETA)
if (src === 'm365') {
// M365 sends index + total + ETA — show exact counter
if (statsEl && d.total) statsEl.textContent = (d.index || 0) + ' / ' + d.total;
if (etaEl && d.eta !== undefined) etaEl.textContent = d.eta ? ('ETA ' + d.eta) : '';
} else if (!S._m365ScanRunning) {
// Google / file: no total known upfront — show running count once M365 is done
if (statsEl && d.scanned !== undefined) statsEl.textContent = d.scanned + ' scanned';
if (etaEl) etaEl.textContent = '';
var statsEl = document.getElementById('progressStats');
if (statsEl && d.total) {
statsEl.textContent = (d.index || 0) + ' / ' + d.total;
}
var etaEl = document.getElementById('progressEta');
if (etaEl && d.eta !== undefined) {
etaEl.textContent = d.eta ? ('ETA ' + d.eta) : '';
}
}
});
source.addEventListener('scan_file', function(e) {
@ -381,24 +363,17 @@ function _attachScanListeners(source) {
source.addEventListener('scan_done', function(e) {
var d = JSON.parse(e.data);
console.log('[SSE] scan_done:', d);
// Only close SSE if the user started this scan via the Scan button.
// For scheduled scans, keep the SSE connection alive so future
// scheduler events are still received.
if (S._userStartedScan) {
S._userStartedScan = false;
if (S.es) { S.es.close(); S.es = null; }
}
S._srcPct.m365 = 100;
S._m365ScanRunning = false;
_renderProgressSegments();
var _anyRunning = S._googleScanRunning || S._fileScanRunning;
// Clear M365 counter/ETA so Google/file progress can take over the display
if (_anyRunning) {
var _se = document.getElementById('progressStats');
var _ee = document.getElementById('progressEta');
if (_se) _se.textContent = '';
if (_ee) _ee.textContent = '';
}
// Only close SSE once all concurrent scans have finished.
// Closing early would drop google_scan_done / file_scan_done events and
// leave the UI stuck in scanning state.
if (S._userStartedScan && !_anyRunning) {
S._userStartedScan = false;
if (S.es) { S.es.close(); S.es = null; }
}
if (!_anyRunning) setLogLive('');
document.getElementById('scanBtn').disabled = _anyRunning;
document.getElementById('stopBtn').style.display = _anyRunning ? 'inline-block' : 'none';
@ -422,7 +397,6 @@ function _attachScanListeners(source) {
if (d.delta) checkDeltaStatus();
markOverdueCards();
loadTrend();
window.invalidateHistoryCache?.();
});
source.addEventListener('google_scan_done', function(e) {
var d = JSON.parse(e.data);
@ -431,10 +405,6 @@ function _attachScanListeners(source) {
S._googleScanRunning = false;
_renderProgressSegments();
if (!S._m365ScanRunning && !S._fileScanRunning) {
if (S._userStartedScan) {
S._userStartedScan = false;
if (S.es) { S.es.close(); S.es = null; }
}
setLogLive('');
document.getElementById('scanBtn').disabled = false;
document.getElementById('stopBtn').style.display = 'none';
@ -451,7 +421,6 @@ function _attachScanListeners(source) {
log('Google scan complete \u2014 ' + d.flagged_count + ' flagged of ' + d.total_scanned, 'ok');
markOverdueCards();
loadTrend();
window.invalidateHistoryCache?.();
});
source.addEventListener('file_scan_done', function(e) {
var d = JSON.parse(e.data);
@ -460,10 +429,6 @@ function _attachScanListeners(source) {
S._fileScanRunning = false;
_renderProgressSegments();
if (!S._m365ScanRunning && !S._googleScanRunning) {
if (S._userStartedScan) {
S._userStartedScan = false;
if (S.es) { S.es.close(); S.es = null; }
}
setLogLive('');
document.getElementById('scanBtn').disabled = false;
document.getElementById('stopBtn').style.display = 'none';
@ -477,21 +442,14 @@ function _attachScanListeners(source) {
applyFilters();
}
}
log('Bestandsscan fuldf\u00f8rt \u2014 ' + d.flagged_count + ' flagget af ' + d.total_scanned, 'ok');
log('Bestandsscan fuldført \u2014 ' + d.flagged_count + ' flagget af ' + d.total_scanned, 'ok');
markOverdueCards();
loadTrend();
window.invalidateHistoryCache?.();
});
// sse_replay_done marks end of buffer replay — log a note so the user knows
// earlier events above were replayed from an already-running scan.
// Also retry loadHistorySession if it bailed during replay: scan_phase events
// from a completed scan's replay temporarily set running flags to true, causing
// the watchdog's loadHistorySession call to bail before scan_done clears them.
// earlier events above were replayed from an already-running scan
source.addEventListener('sse_replay_done', function() {
log(t('m365_sse_replay_note', 'Live log resumed \u2014 earlier entries replayed from running scan.'));
if (!S._m365ScanRunning && !S._googleScanRunning && !S._fileScanRunning && !S._historyRefScanId) {
window.loadHistorySession?.(null);
}
});
}
@ -552,8 +510,6 @@ function startScan(resume) {
document.getElementById('statsSection').style.display = 'none';
document.getElementById('statsPill').style.display = 'none';
}
// Exit history mode — live SSE takes over
window.exitHistoryMode?.();
document.getElementById('resumeBanner').style.display = 'none';
document.getElementById('logPanel').innerHTML = '<div class="log-line log-live" id="logLive" style="display:none"></div>';
try { sessionStorage.removeItem(_LOG_SESSION_KEY); } catch(e) {}
@ -584,22 +540,6 @@ function startScan(resume) {
S._userStartedScan = true;
_ensureSSE();
// Revert to idle if every scan type that was supposed to start got rejected.
// Called after each 409 so we don't leave the UI stuck in "running" state
// while the previous scan's thread finishes winding down.
function _onScanConflict(label) {
log(label + ' ' + t('scan_already_running_err', 'already running — previous scan still stopping. Please wait and try again.'), 'err');
if (label === 'm365') S._m365ScanRunning = false;
if (label === 'file') S._fileScanRunning = false;
if (label === 'google') S._googleScanRunning = false;
if (!S._m365ScanRunning && !S._googleScanRunning && !S._fileScanRunning) {
document.getElementById('scanBtn').disabled = false;
document.getElementById('stopBtn').style.display = 'none';
if (S.es) { S.es.close(); S.es = null; }
S._userStartedScan = false;
}
}
setTimeout(() => {
// Fire M365 scan if any M365 sources are selected
if (sources.length > 0) {
@ -608,7 +548,7 @@ function startScan(resume) {
body: JSON.stringify({sources, user_ids, options, resume: !!resume,
profile_id: S._activeProfileId || null})
}).then(r => {
if (r.status === 409) { _onScanConflict('m365'); }
if (r.status === 409) { log('Scan already running', 'err'); }
}).catch(e => { log('Scan start failed: ' + e, 'err'); });
}
@ -622,17 +562,7 @@ function startScan(resume) {
if (!source) return;
fetch('/api/file_scan/start', {
method: 'POST', headers: {'Content-Type':'application/json'},
body: JSON.stringify(Object.assign({}, source, {
scan_photos: options.scan_photos || false,
skip_gps_images: options.skip_gps_images || false,
min_cpr_count: options.min_cpr_count || 1,
scan_emails: options.scan_emails || false,
scan_phones: options.scan_phones || false,
cpr_only: options.cpr_only || false,
ocr_lang: options.ocr_lang || 'dan+eng',
}))
}).then(r => {
if (r.status === 409) { _onScanConflict('file'); }
body: JSON.stringify(Object.assign({}, source, {scan_photos: options.scan_photos || false}))
}).catch(e => { log('File scan error: ' + e, 'err'); });
});
@ -655,7 +585,7 @@ function startScan(resume) {
options: options
})
}).then(r => {
if (r.status === 409) { _onScanConflict('google'); }
if (r.status === 409) { log('Google scan already running', 'err'); }
}).catch(e => { log('Google scan error: ' + e, 'err'); });
}

View File

@ -18,19 +18,19 @@ function schedLoad() {
var descEl = document.getElementById('schedDesc_' + js.id);
if (!descEl) return;
var j2 = _schedJobs.find(function(x){ return x.id === js.id; });
var freqLabel = !j2 ? '' : (j2.frequency === 'weekly' ? t('m365_sched_freq_weekly','Weekly') : j2.frequency === 'monthly' ? t('m365_sched_freq_monthly','Monthly') : t('m365_sched_freq_daily','Daily'));
var freqLabel = !j2 ? '' : (j2.frequency === 'weekly' ? 'Weekly' : j2.frequency === 'monthly' ? 'Monthly' : 'Daily');
var timeStr = !j2 ? '' : String(j2.hour||0).padStart(2,'0') + ':' + String(j2.minute||0).padStart(2,'0');
var base = freqLabel + ' ' + timeStr;
var runBtn = document.getElementById('schedRunBtn_' + js.id);
if (js.is_running) {
descEl.textContent = base + ' \u00b7 ' + t('m365_sched_running','Running...');
descEl.textContent = base + ' \u00b7 Running...';
if (runBtn) { runBtn.style.borderColor='#22c55e'; runBtn.style.color='#22c55e'; }
} else if (js.next_run) {
var dt = new Date(js.next_run);
descEl.textContent = base + ' \u00b7 ' + t('m365_sched_next','Next') + ': ' + dt.toLocaleString(undefined,{month:'short',day:'numeric',hour:'2-digit',minute:'2-digit'});
descEl.textContent = base + ' \u00b7 Next: ' + dt.toLocaleString(undefined,{month:'short',day:'numeric',hour:'2-digit',minute:'2-digit'});
if (runBtn) { runBtn.style.borderColor='var(--border)'; runBtn.style.color='var(--muted)'; }
} else {
descEl.textContent = base + (js.enabled ? '' : ' \u00b7 ' + t('m365_sched_disabled','Disabled'));
descEl.textContent = base + (js.enabled ? '' : ' \u00b7 Disabled');
if (runBtn) { runBtn.style.borderColor='var(--border)'; runBtn.style.color='var(--muted)'; }
}
});
@ -41,23 +41,20 @@ function schedRenderJobs() {
var list = document.getElementById('schedJobList');
if (!list) return;
if (!_schedJobs.length) {
list.innerHTML = '<div style="font-size:11px;color:var(--muted);padding:4px 0">' + t('m365_sched_no_jobs','No scheduled scans yet.') + '</div>';
list.innerHTML = '<div style="font-size:11px;color:var(--muted);padding:4px 0">No scheduled scans yet.</div>';
return;
}
list.innerHTML = _schedJobs.map(function(j) {
var sid = _esc(j.id);
var sname = _esc(j.name || 'Unnamed');
var freqLabel = j.frequency === 'weekly' ? t('m365_sched_freq_weekly','Weekly') : j.frequency === 'monthly' ? t('m365_sched_freq_monthly','Monthly') : t('m365_sched_freq_daily','Daily');
var freqLabel = j.frequency === 'weekly' ? 'Weekly' : j.frequency === 'monthly' ? 'Monthly' : 'Daily';
var timeStr = String(j.hour||0).padStart(2,'0') + ':' + String(j.minute||0).padStart(2,'0');
var desc = freqLabel + ' ' + timeStr;
var chk = j.enabled ? ' checked' : '';
var roBadge = j.report_only
? '<span style="font-size:9px;padding:1px 5px;border-radius:10px;background:#E8F4FD;color:#2980B9;border:1px solid #AED6F1;margin-left:4px">' + t('m365_sched_report_only','Report only') + '</span>'
: '';
return '<div style="display:flex;align-items:center;gap:6px;padding:5px 6px;border:1px solid var(--border);border-radius:6px;background:var(--surface)">'
+ '<label class="toggle" style="flex:unset;margin:0"><input type="checkbox"'+chk+' onchange="schedToggleEnabled(\''+sid+'\',this.checked)"><span class="toggle-slider"></span></label>'
+ '<div style="flex:1;min-width:0">'
+ '<div style="font-size:12px;font-weight:600;white-space:nowrap;overflow:hidden;text-overflow:ellipsis">'+sname+roBadge+'</div>'
+ '<div style="font-size:12px;font-weight:600;white-space:nowrap;overflow:hidden;text-overflow:ellipsis">'+sname+'</div>'
+ '<div id="schedDesc_'+sid+'" style="font-size:10px;color:var(--muted)">'+desc+'</div>'
+ '</div>'
+ '<button onclick="schedRunJob(\''+sid+'\')" id="schedRunBtn_'+sid+'" style="background:none;border:1px solid var(--border);color:var(--muted);padding:2px 7px;border-radius:4px;font-size:10px;cursor:pointer" title="Run now">&#9654;</button>'
@ -92,8 +89,6 @@ function schedAddJob() {
document.getElementById('schedMinute').value = 0;
document.getElementById('schedAutoEmail').checked = false;
document.getElementById('schedAutoRetention').checked = false;
document.getElementById('schedReportOnly').checked = false;
schedToggleReportOnly();
var titleEl = document.getElementById('schedEditorTitle');
if (titleEl) titleEl.textContent = t('m365_sched_editor_new', 'New scheduled scan');
schedPopulateProfiles('');
@ -116,8 +111,6 @@ function schedEditJob(id) {
document.getElementById('schedMinute').value = j.minute != null ? j.minute : 0;
document.getElementById('schedAutoEmail').checked = !!j.auto_email;
document.getElementById('schedAutoRetention').checked = !!j.auto_retention;
document.getElementById('schedReportOnly').checked = !!j.report_only;
schedToggleReportOnly();
var titleEl = document.getElementById('schedEditorTitle');
if (titleEl) titleEl.textContent = t('m365_sched_editor_edit', 'Edit scheduled scan');
schedPopulateProfiles(j.profile_id || '');
@ -130,19 +123,6 @@ function schedCancelEdit() {
document.getElementById('schedJobEditor').style.display = 'none';
}
function schedToggleReportOnly() {
var ro = !!(document.getElementById('schedReportOnly') || {}).checked;
var profileRow = document.getElementById('schedProfileRow');
var hint = document.getElementById('schedReportOnlyHint');
if (profileRow) profileRow.style.opacity = ro ? '0.4' : '';
if (hint) hint.style.display = ro ? 'block' : 'none';
// Enforce auto_email when switching to report-only
if (ro) {
var ae = document.getElementById('schedAutoEmail');
if (ae) ae.checked = true;
}
}
function schedSaveJob() {
var name = document.getElementById('schedName').value.trim();
if (!name) {
@ -164,7 +144,6 @@ function schedSaveJob() {
profile_id: document.getElementById('schedProfile').value,
auto_email: document.getElementById('schedAutoEmail').checked,
auto_retention: document.getElementById('schedAutoRetention').checked,
report_only: document.getElementById('schedReportOnly').checked,
};
var st = document.getElementById('schedSaveStatus');
st.style.color = 'var(--muted)'; st.textContent = 'Saving...';
@ -238,7 +217,7 @@ function schedLoadHistory() {
if (!el) return;
fetch('/api/scheduler/history?limit=10').then(function(r){ return r.json(); }).then(function(d) {
var runs = d.runs || [];
if (!runs.length) { el.innerHTML = '<em>' + t('m365_sched_no_runs','No scheduled runs yet') + '</em>'; return; }
if (!runs.length) { el.innerHTML = '<em>No scheduled runs yet</em>'; return; }
var html = '';
runs.forEach(function(r) {
var ts = r.started_at ? new Date(r.started_at * 1000).toLocaleString() : '-';
@ -314,17 +293,13 @@ function stLoadSmtp() {
const set = function(id, val) { const el=document.getElementById(id); if(el) el.value=val||''; };
set('st-smtpHost', d.host);
set('st-smtpPort', d.port || 587);
set('st-smtpUser', d.username);
set('st-smtpUser', d.user);
set('st-smtpFrom', d.from_addr);
set('st-smtpTo', Array.isArray(d.recipients) ? d.recipients.join(', ') : (d.recipients||''));
const tls = document.getElementById('st-smtpTls');
if (tls) tls.checked = d.use_tls !== false;
if (tls) tls.checked = d.starttls !== false;
const pw = document.getElementById('st-smtpPw');
if (pw) pw.value = d.has_password ? '\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022' : '';
const ae = document.getElementById('st-smtpAutoEmail');
if (ae) ae.checked = !!d.auto_email_manual;
const ps = document.getElementById('st-smtpPreferSmtp');
if (ps) ps.checked = !!d.prefer_smtp;
}).catch(function(){});
}
@ -335,15 +310,10 @@ async function stSmtpSave() {
const body = {
host: document.getElementById('st-smtpHost').value.trim(),
port: parseInt(document.getElementById('st-smtpPort').value) || 587,
// Backend (routes/email.py) reads these exact keys — `username`/`use_tls`,
// not `user`/`starttls`. Sending the wrong keys leaves username empty so
// server.login() is skipped and the SMTP server rejects the send.
username: document.getElementById('st-smtpUser').value.trim(),
user: document.getElementById('st-smtpUser').value.trim(),
from_addr: document.getElementById('st-smtpFrom').value.trim(),
recipients: document.getElementById('st-smtpTo').value.split(/[,;]/).map(function(s){return s.trim();}).filter(Boolean),
use_tls: document.getElementById('st-smtpTls').checked,
auto_email_manual: !!(document.getElementById('st-smtpAutoEmail') || {}).checked,
prefer_smtp: !!(document.getElementById('st-smtpPreferSmtp') || {}).checked,
starttls: document.getElementById('st-smtpTls').checked,
};
if (pw !== null) body.password = pw;
st.style.color = 'var(--muted)'; st.textContent = t('m365_smtp_saving','Saving...');
@ -364,16 +334,7 @@ async function stSmtpTest() {
body:JSON.stringify({})});
const d = await r.json();
if (d.ok) {
let msg;
if (d.method === 'graph') {
msg = t('m365_smtp_test_ok_graph','Test email sent via Microsoft Graph to') + ' ' + (d.recipients||[]).join(', ');
} else if (d.method === 'smtp') {
msg = t('m365_smtp_test_ok_smtp','Test email sent via SMTP to') + ' ' + (d.recipients||[]).join(', ');
if (d.graph_also_failed) msg += ' ' + t('m365_smtp_graph_also_failed','(⚠ Graph also failed — Mail.Send not granted)');
} else {
msg = d.message || t('m365_smtp_test_ok','Test email sent');
}
if (st) { st.style.color='var(--accent)'; st.textContent='\u2714 ' + msg; }
if (st) { st.style.color='var(--accent)'; st.textContent='\u2714 ' + (d.message || t('m365_smtp_test_ok','Connection successful')); }
} else {
if (st) { st.style.color='var(--danger)'; st.textContent='\u2717 ' + (d.error || t('m365_smtp_test_fail','Connection failed')); }
}
@ -464,7 +425,6 @@ window.schedSaveJob = schedSaveJob;
window.schedDeleteJob = schedDeleteJob;
window.schedRunJob = schedRunJob;
window.schedToggleFreqRows = schedToggleFreqRows;
window.schedToggleReportOnly = schedToggleReportOnly;
window.schedPopulateProfiles = schedPopulateProfiles;
window.schedLoadHistory = schedLoadHistory;
window.schedUpdateSidebarIndicator = schedUpdateSidebarIndicator;

View File

@ -62,15 +62,14 @@ function renderSourcesPanel() {
S._pendingGoogleSources = null;
}
// File sources (local / SMB / SFTP) — one entry per saved source
// File sources (local / SMB) — one entry per saved source
if (S._fileSources.length > 0) {
html += '<div style="margin:6px 0 2px;font-size:10px;color:var(--muted);text-transform:uppercase;letter-spacing:.04em">'
+ '<hr style="border:none;border-top:1px solid var(--border);margin:1px 0 2px">';
S._fileSources.forEach(function(s) {
const isSftp = s.source_type === 'sftp';
const isSmb = !isSftp && s.path && (s.path.startsWith('//') || s.path.startsWith('\\\\'));
const icon = isSftp ? '\uD83D\uDD12' : (isSmb ? '\uD83C\uDF10' : '\uD83D\uDCC1');
const label = s.label || s.path || s.id;
const isSmb = s.path && (s.path.startsWith('//') || s.path.startsWith('\\\\'));
const icon = isSmb ? '\uD83C\uDF10' : '\uD83D\uDCC1';
const label = s.label || s.path || s.id;
const isChecked = (s.id in checked) ? checked[s.id] : true;
html += '<label class="source-check">'
+ '<input type="checkbox" data-source-id="' + _esc(s.id) + '" data-source-type="file"' + (isChecked ? ' checked' : '') + '>'
@ -237,209 +236,17 @@ function closeSettings() {
}
function switchSettingsTab(tab) {
['general','security','scheduler','email','database','auditlog','ai'].forEach(function(t) {
['general','security','scheduler','email','database'].forEach(function(t) {
var cap = t.charAt(0).toUpperCase() + t.slice(1);
var pane = document.getElementById('stPane' + cap);
var btn = document.getElementById('stTab' + cap);
if (pane) pane.classList.toggle('active', t === tab);
if (btn) btn.classList.toggle('active', t === tab);
});
if (tab === 'general') stLoadUpdateSettings();
if (tab === 'security') { stLoadPinStatus(); if (typeof stLoadViewerPinStatus === 'function') stLoadViewerPinStatus(); if (typeof stLoadInterfacePinStatus === 'function') stLoadInterfacePinStatus(); }
if (tab === 'security') { stLoadPinStatus(); if (typeof stLoadViewerPinStatus === 'function') stLoadViewerPinStatus(); }
if (tab === 'email') stLoadSmtp();
if (tab === 'database') stLoadDbStats();
if (tab === 'scheduler') schedLoad();
if (tab === 'auditlog') stLoadAuditLog();
if (tab === 'ai') stLoadAiSettings();
}
async function stLoadAuditLog() {
const tbody = document.getElementById('stAuditTableBody');
if (!tbody) return;
tbody.innerHTML = `<tr><td colspan="4" style="padding:8px;color:var(--muted)">${t('m365_audit_loading')}</td></tr>`;
try {
const rows = await fetch('/api/audit_log?limit=200').then(r => r.json());
if (!Array.isArray(rows) || !rows.length) {
tbody.innerHTML = `<tr><td colspan="4" style="padding:8px;color:var(--muted)">${t('m365_audit_empty')}</td></tr>`;
return;
}
tbody.innerHTML = rows.map(function(r) {
const d = new Date(r.ts * 1000);
const ts = d.toLocaleDateString() + ' ' + d.toLocaleTimeString();
return '<tr style="border-bottom:1px solid var(--border)">'
+ '<td style="padding:4px 8px;white-space:nowrap;color:var(--muted);font-size:11px">' + window._escHtml(ts) + '</td>'
+ '<td style="padding:4px 8px"><span style="font-family:monospace;background:var(--bg);border:1px solid var(--border);border-radius:3px;padding:1px 4px;font-size:11px">' + window._escHtml(r.action) + '</span></td>'
+ '<td style="padding:4px 8px;color:var(--text);font-size:12px">' + window._escHtml(r.detail) + '</td>'
+ '<td style="padding:4px 8px;color:var(--muted);font-size:11px">' + window._escHtml(r.ip) + '</td>'
+ '</tr>';
}).join('');
} catch(e) {
tbody.innerHTML = '<tr><td colspan="4" style="padding:8px;color:var(--danger)">' + window._escHtml(String(e)) + '</td></tr>';
}
}
// ── AI / Claude NER settings ─────────────────────────────────────────────────
async function stLoadAiSettings() {
try {
const cfg = await fetch('/api/settings/claude').then(r => r.json());
const cb = document.getElementById('aiEnabled');
if (cb) cb.checked = !!cfg.enabled;
const ks = document.getElementById('aiKeyStatus');
if (ks) ks.textContent = cfg.api_key_set
? t('m365_ai_key_set', 'API key saved')
: t('m365_ai_key_not_set', 'No API key saved');
} catch(e) { /* ignore */ }
}
async function stAiSave() {
const enabled = !!(document.getElementById('aiEnabled') || {}).checked;
const keyVal = (document.getElementById('aiApiKey') || {}).value || '';
const status = document.getElementById('aiStatus');
const payload = { enabled };
if (keyVal) payload.api_key = keyVal;
try {
await fetch('/api/settings/claude', {
method: 'POST',
headers: {'Content-Type': 'application/json'},
body: JSON.stringify(payload),
});
if (status) { status.textContent = t('m365_ai_saved', 'Saved'); status.style.color = 'var(--success)'; }
if (keyVal) {
const inp = document.getElementById('aiApiKey');
if (inp) inp.value = '';
const ks = document.getElementById('aiKeyStatus');
if (ks) ks.textContent = t('m365_ai_key_set', 'API key saved');
}
setTimeout(function() { if (status) status.textContent = ''; }, 2000);
} catch(e) {
if (status) { status.textContent = String(e); status.style.color = 'var(--danger)'; }
}
}
async function stAiTest() {
const status = document.getElementById('aiStatus');
if (status) { status.textContent = t('m365_ai_testing', 'Testing…'); status.style.color = 'var(--muted)'; }
try {
const res = await fetch('/api/settings/claude/test', { method: 'POST' }).then(r => r.json());
if (status) {
status.textContent = res.ok
? t('m365_ai_test_ok', 'API key valid')
: (t('m365_ai_test_fail', 'Test failed') + ': ' + (res.error || ''));
status.style.color = res.ok ? 'var(--success)' : 'var(--danger)';
}
} catch(e) {
if (status) { status.textContent = String(e); status.style.color = 'var(--danger)'; }
}
}
// ── Software updates ─────────────────────────────────────────────────────────
async function stLoadUpdateSettings() {
try {
const cfg = await fetch('/api/update/settings').then(r => r.json());
const grp = document.getElementById('stUpdateGroup');
if (grp) grp.style.display = cfg.supported ? '' : 'none';
const cb = document.getElementById('stAutoUpdate');
if (cb) cb.checked = !!cfg.auto_update;
} catch(e) { /* ignore */ }
}
async function stSaveAutoUpdate() {
const cb = document.getElementById('stAutoUpdate');
try {
await fetch('/api/update/settings', {
method: 'POST',
headers: {'Content-Type': 'application/json'},
body: JSON.stringify({ auto_update: !!(cb && cb.checked) }),
});
} catch(e) { /* ignore */ }
}
async function stCheckUpdate() {
const status = document.getElementById('stUpdateStatus');
const commits = document.getElementById('stUpdateCommits');
const applyBtn = document.getElementById('stApplyUpdateBtn');
if (status) { status.textContent = t('m365_update_checking', 'Checking…'); status.style.color = 'var(--muted)'; }
if (commits) commits.style.display = 'none';
if (applyBtn) applyBtn.style.display = 'none';
try {
const res = await fetch('/api/update/check').then(r => r.json());
if (!status) return;
if (res.error) {
status.textContent = t('m365_update_failed', 'Update check failed') + ': ' + res.error;
status.style.color = 'var(--danger)';
} else if (res.up_to_date) {
status.textContent = t('m365_update_uptodate', 'You are running the latest version.') + ' (' + res.current + ')';
status.style.color = 'var(--success)';
} else {
status.textContent = t('m365_update_available', 'Update available') + ': ' + res.current + ' → ' + res.latest;
status.style.color = 'var(--accent)';
if (commits && res.commits && res.commits.length) {
commits.innerHTML = res.commits.map(function(c) { return window._escHtml(c); }).join('<br>');
commits.style.display = '';
}
if (applyBtn) applyBtn.style.display = '';
}
} catch(e) {
if (status) { status.textContent = String(e); status.style.color = 'var(--danger)'; }
}
}
async function stApplyUpdate() {
const status = document.getElementById('stUpdateStatus');
const applyBtn = document.getElementById('stApplyUpdateBtn');
const checkBtn = document.getElementById('stCheckUpdateBtn');
if (applyBtn) applyBtn.disabled = true;
if (checkBtn) checkBtn.disabled = true;
if (status) { status.textContent = t('m365_update_installing', 'Installing update — the app will restart…'); status.style.color = 'var(--muted)'; }
try {
const res = await fetch('/api/update/apply', { method: 'POST' }).then(r => r.json());
if (!res.ok) {
const msg = res.code === 'scan_running'
? t('m365_update_scan_running', 'Cannot update while a scan is running.')
: (res.error || 'Update failed');
if (status) { status.textContent = msg; status.style.color = 'var(--danger)'; }
if (applyBtn) applyBtn.disabled = false;
if (checkBtn) checkBtn.disabled = false;
return;
}
if (!res.updated) { // already up to date
if (status) { status.textContent = t('m365_update_uptodate', 'You are running the latest version.'); status.style.color = 'var(--success)'; }
if (applyBtn) { applyBtn.disabled = false; applyBtn.style.display = 'none'; }
if (checkBtn) checkBtn.disabled = false;
return;
}
_stWaitForRestart();
} catch(e) {
if (status) { status.textContent = String(e); status.style.color = 'var(--danger)'; }
if (applyBtn) applyBtn.disabled = false;
if (checkBtn) checkBtn.disabled = false;
}
}
// Poll until the server has gone down and come back, then reload the page.
function _stWaitForRestart() {
let tries = 0, sawDown = false;
const iv = setInterval(async function() {
tries++;
try {
await fetch('/api/about', { cache: 'no-store' }).then(r => { if (!r.ok) throw new Error(); });
if (sawDown || tries >= 5) { clearInterval(iv); location.reload(); }
} catch(e) {
sawDown = true;
}
if (tries > 90) clearInterval(iv); // give up after ~3 minutes
}, 2000);
}
function stAiToggleKey() {
const inp = document.getElementById('aiApiKey');
const btn = document.getElementById('aiShowKeyBtn');
if (!inp) return;
const show = inp.type === 'password';
inp.type = show ? 'text' : 'password';
if (btn) btn.textContent = show ? t('m365_ai_hide_key', 'Hide') : t('m365_ai_show_key', 'Show');
}
// ── Window exports (HTML handlers + cross-module calls) ─────────────────────
@ -458,14 +265,5 @@ window.confirmPinPrompt = confirmPinPrompt;
window.openSettings = openSettings;
window.closeSettings = closeSettings;
window.switchSettingsTab = switchSettingsTab;
window.stLoadAuditLog = stLoadAuditLog;
window.stLoadAiSettings = stLoadAiSettings;
window.stAiSave = stAiSave;
window.stAiTest = stAiTest;
window.stAiToggleKey = stAiToggleKey;
window.stLoadUpdateSettings = stLoadUpdateSettings;
window.stSaveAutoUpdate = stSaveAutoUpdate;
window.stCheckUpdate = stCheckUpdate;
window.stApplyUpdate = stApplyUpdate;
window._M365_SOURCES = _M365_SOURCES;
window._pinCallback = _pinCallback;

View File

@ -28,9 +28,4 @@ export const S = {
_pendingGoogleSources: null,
// Sources
_fileSources: [],
// History browser
_historyRefScanId: null, // null = live/SSE, number = viewing a past session
// Bulk disposition
_selectMode: false,
_selectedIds: new Set(),
};

View File

@ -28,11 +28,6 @@ async function loadUsers() {
u.selected = prevSelected.has(u.id) ? prevSelected.get(u.id) : false;
});
S._allUsers = [...fetched, ...toAdd];
// Apply deferred "select all" from a profile chosen before users loaded
if (window._pendingProfileAllUsers) {
S._allUsers.forEach(u => { u.selected = true; });
window._pendingProfileAllUsers = false;
}
renderAccountList(fetched.length <= 1);
// Merge Google users separately so they're not blocked by M365 auth timing
_mergeGoogleUsers();
@ -176,7 +171,7 @@ async function loadLastScanSummary() {
try {
const r = await fetch('/api/db/stats');
const d = await r.json();
if (!d.scan_id || S.flaggedData.length > 0 || S._m365ScanRunning || S._googleScanRunning || S._fileScanRunning) return;
if (!d.scan_id || S.flaggedData.length > 0) return;
const panel = document.getElementById('lastScanSummary');
const empty = document.getElementById('emptyState');
if (!panel || !empty) return;

View File

@ -1,160 +1,25 @@
// ── Viewer token management (#33) ─────────────────────────────────────────────
// Share button → modal to create, copy, and revoke read-only viewer links.
import { S } from './state.js';
let _shareBaseUrl = null; // cached so Copy buttons can build the URL synchronously
async function _getShareBaseUrl() {
if (_shareBaseUrl) return _shareBaseUrl;
// The LAN-IP probe exists only to fix links when the operator browses the
// app at localhost — those would be unusable for remote users. Any other
// origin (LAN IP, or a reverse-proxied HTTPS hostname) is already routable,
// and rewriting it to http://<LAN-IP> would bypass the proxy's TLS.
const host = window.location.hostname;
if (window.location.protocol === 'https:' ||
(host !== 'localhost' && host !== '127.0.0.1' && host !== '[::1]')) {
_shareBaseUrl = window.location.origin;
return _shareBaseUrl;
}
// Use the machine's LAN IP so links work for remote users, not just localhost.
try {
const r = await fetch('/api/local_ip');
if (r.ok) {
const d = await r.json();
if (d.ip && d.ip !== '127.0.0.1') {
_shareBaseUrl = 'http://' + d.ip + ':' + window.location.port;
return _shareBaseUrl;
return 'http://' + d.ip + ':' + window.location.port;
}
}
} catch(e) {}
_shareBaseUrl = window.location.origin;
return _shareBaseUrl;
}
// ── User autocomplete for Share modal ────────────────────────────────────────
// Holds the resolved user when one is picked from the dropdown.
// Cleared on modal reset or when the input is edited manually.
let _selectedScopeUser = null; // { emails: string[], display_name: string }
let _userAcInit = false;
function _initUserAutocomplete() {
if (_userAcInit) return;
_userAcInit = true;
const input = document.getElementById('shareScopeUser');
const drop = document.getElementById('shareScopeUserDropdown');
if (!input || !drop) return;
input.addEventListener('input', () => {
_selectedScopeUser = null; // user edited manually — discard dropdown selection
_renderUserDropdown(input.value);
});
input.addEventListener('focus', () => _renderUserDropdown(input.value));
input.addEventListener('keydown', e => {
if (e.key === 'Escape') { drop.style.display = 'none'; }
if (e.key === 'ArrowDown') { e.preventDefault(); drop.querySelector('[data-uid]')?.focus(); }
});
drop.addEventListener('keydown', e => {
if (e.key === 'Escape') { drop.style.display = 'none'; input.focus(); }
if (e.key === 'ArrowDown') { e.preventDefault(); document.activeElement?.nextElementSibling?.focus(); }
if (e.key === 'ArrowUp') {
e.preventDefault();
const prev = document.activeElement?.previousElementSibling;
prev ? prev.focus() : input.focus();
}
if (e.key === 'Enter') {
const el = document.activeElement;
if (el?.dataset?.uid) _selectUser(parseInt(el.dataset.uid, 10));
}
});
document.addEventListener('click', e => {
if (!document.getElementById('shareScopeUserWrap')?.contains(e.target))
drop.style.display = 'none';
}, true);
}
function _renderUserDropdown(query) {
const drop = document.getElementById('shareScopeUserDropdown');
if (!drop) return;
const users = S._allUsers;
if (!users.length) { drop.style.display = 'none'; return; }
const q = (query || '').trim().toLowerCase();
const matches = (q
? users.filter(u =>
(u.displayName || '').toLowerCase().includes(q) ||
(u.email || '').toLowerCase().includes(q) ||
(u.googleEmail || '').toLowerCase().includes(q))
: users
).slice(0, 8);
if (!matches.length) { drop.style.display = 'none'; return; }
drop.innerHTML = '';
matches.forEach((u, i) => {
const emails = [u.email, u.googleEmail].filter(Boolean);
const emailLbl = emails.join(', ');
const roleLbl = u.userRole === 'staff' ? t('share_scope_staff', 'Staff')
: u.userRole === 'student' ? t('share_scope_student', 'Students')
: '';
const row = document.createElement('div');
row.tabIndex = 0;
row.dataset.uid = i; // index into matches; resolved in _selectUser
row.style.cssText = 'display:flex;align-items:center;gap:8px;padding:6px 10px;cursor:pointer;font-size:12px'
+ (i < matches.length - 1 ? ';border-bottom:1px solid var(--border)' : '');
row.innerHTML =
'<div style="flex:1;min-width:0">' +
'<div style="font-weight:500;color:var(--text);overflow:hidden;text-overflow:ellipsis;white-space:nowrap">' +
(u.displayName || emails[0] || '') +
(roleLbl ? ' <span style="font-size:9px;padding:1px 5px;border-radius:10px;background:var(--accent);color:#fff;font-weight:600">' + roleLbl + '</span>' : '') +
'</div>' +
'<div style="font-size:10px;color:var(--muted);overflow:hidden;text-overflow:ellipsis;white-space:nowrap">' + emailLbl + '</div>' +
'</div>';
row.addEventListener('mouseenter', () => row.style.background = 'var(--surface)');
row.addEventListener('mouseleave', () => row.style.background = '');
row.addEventListener('focus', () => row.style.background = 'var(--surface)');
row.addEventListener('blur', () => row.style.background = '');
row.addEventListener('mousedown', e => {
e.preventDefault();
_selectUser(u);
});
drop.appendChild(row);
});
drop.style.display = '';
}
function _selectUser(u) {
const input = document.getElementById('shareScopeUser');
const drop = document.getElementById('shareScopeUserDropdown');
const emails = [u.email, u.googleEmail].filter(Boolean);
_selectedScopeUser = {
emails: emails,
display_name: u.displayName || emails[0] || '',
};
if (input) input.value = u.displayName || emails[0] || '';
if (drop) drop.style.display = 'none';
}
function _shareScopeTypeChanged() {
const type = document.getElementById('shareScopeType')?.value || '';
document.getElementById('shareScopeRoleWrap').style.display = type === 'role' ? '' : 'none';
document.getElementById('shareScopeUserWrap').style.display = type === 'user' ? '' : 'none';
if (type === 'user') _initUserAutocomplete();
}
function _resetShareForm() {
document.getElementById('shareLabel').value = '';
document.getElementById('shareExpiry').value = '30';
const scopeType = document.getElementById('shareScopeType');
if (scopeType) { scopeType.value = ''; _shareScopeTypeChanged(); }
_selectedScopeUser = null;
const scopeUser = document.getElementById('shareScopeUser');
if (scopeUser) scopeUser.value = '';
const scopeDrop = document.getElementById('shareScopeUserDropdown');
if (scopeDrop) scopeDrop.style.display = 'none';
const vf = document.getElementById('shareValidFrom'); if (vf) vf.value = '';
const vt = document.getElementById('shareValidTo'); if (vt) vt.value = '';
return window.location.origin;
}
function openShareModal() {
document.getElementById('shareBackdrop').classList.add('open');
_resetShareForm();
document.getElementById('shareNewLinkRow').style.display = 'none';
document.getElementById('shareLabel').value = '';
document.getElementById('shareExpiry').value = '30';
_renderTokenList();
fetch('/api/viewer/pin').then(function(r){ return r.json(); }).then(function(d) {
const el = document.getElementById('sharePinStatus');
@ -166,7 +31,7 @@ function closeShareModal() {
document.getElementById('shareBackdrop').classList.remove('open');
}
async function _renderTokenList(highlightToken) {
async function _renderTokenList() {
const list = document.getElementById('shareTokenList');
list.innerHTML = '<div style="font-size:12px;color:var(--muted);padding:4px 0">' + t('lbl_loading', 'Loading…') + '</div>';
try {
@ -186,31 +51,10 @@ async function _renderTokenList(highlightToken) {
: '—';
const row = document.createElement('div');
row.style.cssText = 'display:flex;align-items:center;gap:8px;padding:6px 10px;background:var(--bg);border:1px solid var(--border);border-radius:6px;font-size:12px';
const roleVal = tok.scope?.role || '';
const roleLbl = roleVal === 'student' ? t('share_scope_student', 'Students')
: roleVal === 'staff' ? t('share_scope_staff', 'Staff')
: '';
const roleBadge = roleLbl
? '<span style="font-size:9px;padding:1px 5px;border-radius:10px;background:var(--accent);color:#fff;margin-left:5px;font-weight:600;vertical-align:middle">' + roleLbl + '</span>'
: '';
const userScope = tok.scope?.user;
const userLbl = tok.scope?.display_name
|| (Array.isArray(userScope) ? userScope.join(', ') : (userScope || ''));
const userBadge = userLbl
? '<span style="font-size:9px;padding:1px 5px;border-radius:10px;background:var(--muted);color:#fff;margin-left:5px;font-weight:600;vertical-align:middle;max-width:140px;overflow:hidden;text-overflow:ellipsis;white-space:nowrap;display:inline-block">' + userLbl + '</span>'
: '';
const dateFrom = tok.scope?.valid_from || '';
const dateTo = tok.scope?.valid_to || '';
const dateBadge = (dateFrom || dateTo)
? '<span style="font-size:9px;padding:1px 5px;border-radius:10px;background:rgba(80,160,80,.25);color:var(--text);margin-left:5px;font-weight:600;vertical-align:middle">' +
(dateFrom || '…') + ' ' + (dateTo || '…') +
'</span>'
: '';
row.innerHTML =
'<div style="flex:1;min-width:0">' +
'<div style="font-weight:500;color:var(--text);overflow:hidden;text-overflow:ellipsis;white-space:nowrap">' +
(tok.label || '<span style="color:var(--muted);font-style:italic">' + t('share_unlabelled', 'Unlabelled') + '</span>') +
roleBadge + userBadge + dateBadge +
'</div>' +
'<div style="font-size:10px;color:var(--muted);margin-top:1px">' +
t('share_expires_prefix', 'Expires:') + ' ' + expires + ' &nbsp;·&nbsp; ' + t('share_last_used', 'Last used:') + ' ' + lastUsed +
@ -221,17 +65,6 @@ async function _renderTokenList(highlightToken) {
'<button title="' + t('share_revoke', 'Revoke') + '" onclick="revokeToken(\'' + tok.token + '\',this.closest(\'div[style]\'))" ' +
'style="height:24px;padding:0 8px;background:none;border:1px solid var(--danger);color:var(--danger);border-radius:4px;font-size:11px;cursor:pointer;flex-shrink:0">' + t('share_revoke', 'Revoke') + '</button>';
list.appendChild(row);
// Briefly highlight a freshly created link so it is easy to find and copy.
if (highlightToken && tok.token === highlightToken) {
row.style.transition = 'border-color .3s, background .3s';
row.style.borderColor = 'var(--accent)';
row.style.background = 'rgba(80,160,80,.18)';
setTimeout(function() { row.scrollIntoView({block: 'nearest'}); }, 0);
setTimeout(function() {
row.style.borderColor = 'var(--border)';
row.style.background = 'var(--bg)';
}, 2500);
}
});
} catch(e) {
list.innerHTML = '<div style="font-size:12px;color:var(--danger);padding:4px 0">' + t('share_load_error', 'Failed to load links.') + '</div>';
@ -239,34 +72,10 @@ async function _renderTokenList(highlightToken) {
}
async function createShareLink() {
const label = document.getElementById('shareLabel').value.trim();
const expiry = document.getElementById('shareExpiry').value;
const scopeType = document.getElementById('shareScopeType')?.value || '';
const validFrom = document.getElementById('shareValidFrom')?.value || '';
const validTo = document.getElementById('shareValidTo')?.value || '';
const body = {label};
const label = document.getElementById('shareLabel').value.trim();
const expiry = document.getElementById('shareExpiry').value;
const body = {label};
if (expiry) body.expires_days = parseInt(expiry);
if (scopeType === 'role') {
const role = document.getElementById('shareScope')?.value || '';
if (role) body.scope = {role};
} else if (scopeType === 'user') {
if (_selectedScopeUser) {
body.scope = { user: _selectedScopeUser.emails, display_name: _selectedScopeUser.display_name };
} else {
// Manual entry fallback — treat raw input as a single email
const email = (document.getElementById('shareScopeUser')?.value || '').trim().toLowerCase();
if (!email || !email.includes('@')) {
alert(t('share_scope_user_invalid', 'Please enter a valid email address for the user scope.'));
return;
}
body.scope = { user: [email], display_name: email };
}
}
if (validFrom || validTo) {
if (!body.scope) body.scope = {};
if (validFrom) body.scope.valid_from = validFrom;
if (validTo) body.scope.valid_to = validTo;
}
try {
const r = await fetch('/api/viewer/tokens', {
method: 'POST', headers: {'Content-Type':'application/json'},
@ -274,51 +83,48 @@ async function createShareLink() {
});
if (!r.ok) throw new Error('Server error ' + r.status);
const entry = await r.json();
// The new link appears in the active-links list below (each row has its
// own Copy button) — reset the form and highlight the just-created row
// rather than leaving a stale link preview in the create box.
_resetShareForm();
_renderTokenList(entry.token);
const url = (await _getShareBaseUrl()) + '/view?token=' + encodeURIComponent(entry.token);
const urlInput = document.getElementById('shareNewLinkUrl');
urlInput.value = url;
document.getElementById('shareNewLinkRow').style.display = 'block';
document.getElementById('shareCopyBtn').textContent = t('log_copy', 'Copy');
document.getElementById('shareLabel').value = '';
_renderTokenList();
} catch(e) {
alert(t('share_create_error', 'Failed to create link:') + ' ' + e.message);
}
}
function copyShareLink() {
const url = document.getElementById('shareNewLinkUrl').value;
_copyText(url, document.getElementById('shareCopyBtn'));
}
async function copyTokenLink(token, btn) {
const url = (await _getShareBaseUrl()) + '/view?token=' + encodeURIComponent(token);
_copyText(url, btn);
}
function _copyText(text, btn) {
const done = () => {
navigator.clipboard.writeText(text).then(() => {
const orig = btn.textContent;
btn.textContent = t('share_copied', 'Copied!');
setTimeout(() => { btn.textContent = orig; }, 1800);
};
// Fallback for HTTP contexts, where navigator.clipboard is undefined
// (the Clipboard API only exists in secure contexts — HTTPS or localhost).
const fallback = () => {
let ok = false;
}).catch(() => {
// Fallback for HTTP contexts
try {
const ta = document.createElement('textarea');
ta.value = text;
ta.style.position = 'fixed'; ta.style.opacity = '0';
ta.setAttribute('readonly', '');
document.body.appendChild(ta);
ta.focus();
ta.select();
ok = document.execCommand('copy');
document.execCommand('copy');
document.body.removeChild(ta);
} catch(_) { ok = false; }
if (ok) done();
// Last resort: show the link in a prompt so it can be copied manually.
else prompt(t('share_copy_link_prompt', 'Copy link:'), text);
};
if (navigator.clipboard && navigator.clipboard.writeText) {
navigator.clipboard.writeText(text).then(done).catch(fallback);
} else {
fallback();
}
const orig = btn.textContent;
btn.textContent = t('share_copied', 'Copied!');
setTimeout(() => { btn.textContent = orig; }, 1800);
} catch(_) {}
});
}
async function revokeToken(token, rowEl) {
@ -331,6 +137,12 @@ async function revokeToken(token, rowEl) {
if (!list.children.length) {
list.innerHTML = '<div style="font-size:12px;color:var(--muted);padding:4px 0">' + t('share_no_links', 'No active links.') + '</div>';
}
// Hide the copy row if the just-revoked token was the last created
const newRow = document.getElementById('shareNewLinkRow');
if (newRow) {
const shownUrl = document.getElementById('shareNewLinkUrl')?.value || '';
if (shownUrl.includes(token)) newRow.style.display = 'none';
}
} catch(e) {
alert(t('share_revoke_error', 'Failed to revoke:') + ' ' + e.message);
}
@ -415,96 +227,13 @@ async function stClearViewerPin() {
}
}
// ── Interface PIN — Settings UI ───────────────────────────────────────────────
async function stLoadInterfacePinStatus() {
try {
const r = await fetch('/api/interface/pin');
const d = await r.json();
const statusEl = document.getElementById('stInterfacePinStatus');
const currentRow = document.getElementById('stInterfaceCurrentPinRow');
const clearBtn = document.getElementById('stInterfacePinClearBtn');
if (d.pin_set) {
if (statusEl) statusEl.textContent = '\u2714 ' + t('interface_pin_is_set', 'Interface PIN is set');
if (currentRow) currentRow.style.display = '';
if (clearBtn) clearBtn.style.display = '';
} else {
if (statusEl) statusEl.textContent = t('interface_pin_not_set_msg', 'No PIN set \u2014 interface is open to anyone on the network');
if (currentRow) currentRow.style.display = 'none';
if (clearBtn) clearBtn.style.display = 'none';
}
} catch(e) {}
}
async function stSaveInterfacePin() {
const newPin = (document.getElementById('stInterfaceNewPin')?.value || '').trim();
const currentPin = (document.getElementById('stInterfaceCurrentPin')?.value || '').trim();
const st = document.getElementById('stInterfacePinSaveStatus');
if (!newPin) {
if (st) { st.style.color = 'var(--danger)'; st.textContent = t('m365_settings_pin_required', 'PIN is required.'); }
return;
}
if (!/^\d{4,8}$/.test(newPin)) {
if (st) { st.style.color = 'var(--danger)'; st.textContent = t('viewer_pin_format', 'PIN must be 4\u20138 digits.'); }
return;
}
if (st) { st.style.color = 'var(--muted)'; st.textContent = t('viewer_pin_saving', 'Saving\u2026'); }
try {
const r = await fetch('/api/interface/pin', {
method: 'POST', headers: {'Content-Type': 'application/json'},
body: JSON.stringify({pin: newPin, current_pin: currentPin})
});
const d = await r.json();
if (!r.ok) {
if (st) { st.style.color = 'var(--danger)'; st.textContent = d.error || 'Error.'; }
return;
}
if (st) { st.style.color = 'var(--accent)'; st.textContent = '\u2714 ' + t('interface_pin_saved', 'PIN saved'); }
if (document.getElementById('stInterfaceNewPin')) document.getElementById('stInterfaceNewPin').value = '';
if (document.getElementById('stInterfaceCurrentPin')) document.getElementById('stInterfaceCurrentPin').value = '';
stLoadInterfacePinStatus();
} catch(e) {
if (st) { st.style.color = 'var(--danger)'; st.textContent = e.message; }
}
}
async function stClearInterfacePin() {
const currentPin = (document.getElementById('stInterfaceCurrentPin')?.value || '').trim();
const st = document.getElementById('stInterfacePinSaveStatus');
if (!currentPin) {
if (st) { st.style.color = 'var(--danger)'; st.textContent = t('m365_settings_pin_required', 'PIN is required.'); }
document.getElementById('stInterfaceCurrentPin')?.focus();
return;
}
if (!confirm(t('interface_pin_clear_confirm', 'Remove the interface PIN? The scanner will be accessible to anyone on the network.'))) return;
try {
const r = await fetch('/api/interface/pin', {
method: 'DELETE', headers: {'Content-Type': 'application/json'},
body: JSON.stringify({current_pin: currentPin})
});
const d = await r.json();
if (!r.ok) {
if (st) { st.style.color = 'var(--danger)'; st.textContent = d.error || 'Error.'; }
return;
}
if (st) { st.style.color = 'var(--muted)'; st.textContent = t('interface_pin_cleared', 'PIN cleared'); }
stLoadInterfacePinStatus();
} catch(e) {
if (st) { st.style.color = 'var(--danger)'; st.textContent = e.message; }
}
}
// ── Window exports ────────────────────────────────────────────────────────────
window._shareScopeTypeChanged = _shareScopeTypeChanged;
window.openShareModal = openShareModal;
window.closeShareModal = closeShareModal;
window.createShareLink = createShareLink;
window._copyText = _copyText;
window.copyShareLink = copyShareLink;
window.copyTokenLink = copyTokenLink;
window.revokeToken = revokeToken;
window.stLoadViewerPinStatus = stLoadViewerPinStatus;
window.stSaveViewerPin = stSaveViewerPin;
window.stClearViewerPin = stClearViewerPin;
window.stLoadInterfacePinStatus = stLoadInterfacePinStatus;
window.stSaveInterfacePin = stSaveInterfacePin;
window.stClearInterfacePin = stClearInterfacePin;
window.stLoadViewerPinStatus = stLoadViewerPinStatus;
window.stSaveViewerPin = stSaveViewerPin;
window.stClearViewerPin = stClearViewerPin;

View File

@ -197,7 +197,7 @@
.filter-clear:hover { border-color: var(--danger); color: var(--danger); }
/* Grid */
.grid-area { flex: 1; overflow-y: auto; overflow-anchor: none; padding: 24px; min-width: 0; scrollbar-width: thin; scrollbar-color: var(--border) transparent; }
.grid-area { flex: 1; overflow-y: auto; padding: 24px; min-width: 0; scrollbar-width: thin; scrollbar-color: var(--border) transparent; }
.grid-area::-webkit-scrollbar { width: 4px; }
.grid-area::-webkit-scrollbar-track { background: transparent; }
.grid-area::-webkit-scrollbar-thumb { background: var(--border); border-radius: 2px; }
@ -234,7 +234,7 @@
.preview-meta { padding: 10px 14px; border-top: 1px solid var(--border); font-size: 11px; color: var(--muted); display: flex; gap: 10px; flex-wrap: wrap; flex-shrink: 0; }
.preview-open-btn { margin-left: auto; background: var(--accent); color: #fff; border: none; border-radius: 5px; padding: 4px 10px; font-size: 11px; cursor: pointer; white-space: nowrap; }
.card.selected { outline: 2px solid var(--accent); outline-offset: 2px; }
.card { position: relative; background: var(--surface); border: 1px solid var(--border); border-radius: 10px; overflow: hidden; cursor: pointer; transition: border-color .15s, box-shadow .15s; }
.card { background: var(--surface); border: 1px solid var(--border); border-radius: 10px; overflow: hidden; cursor: pointer; transition: border-color .15s, box-shadow .15s; }
.card:hover { border-color: var(--accent); box-shadow: 0 0 0 1px var(--accent); }
.card.list-view { display: flex; align-items: center; gap: 12px; padding: 10px 14px; border-radius: 8px; }
.thumb-wrap { aspect-ratio: 7/9; overflow: hidden; background: var(--bg); }
@ -253,31 +253,6 @@
.card-delete-btn { position:absolute; top:6px; right:6px; background:rgba(0,0,0,0.45); color:#fff; border:none; border-radius:50%; width:22px; height:22px; font-size:13px; line-height:22px; text-align:center; cursor:pointer; opacity:0.35; transition:opacity .15s; padding:0; z-index:1; }
.card:hover .card-delete-btn { opacity:1; }
.card.list-view .card-delete-btn { position:static; opacity:1; background:transparent; color:var(--muted); flex-shrink:0; }
.card-redact-btn { position:absolute; top:6px; right:32px; background:rgba(0,80,40,0.55); color:#7effc0; border:none; border-radius:50%; width:22px; height:22px; font-size:12px; line-height:22px; text-align:center; cursor:pointer; opacity:0.35; transition:opacity .15s; padding:0; z-index:1; }
.card:hover .card-redact-btn { opacity:1; }
.card.list-view .card-redact-btn { position:static; opacity:1; background:transparent; color:#7effc0; flex-shrink:0; }
/* Per-card checkbox (select mode) */
.card-cb { position:absolute; top:6px; left:6px; width:16px; height:16px; margin:0; cursor:pointer; z-index:2;
display:none; accent-color:var(--accent); }
body.select-mode .card-cb { display:block; }
.card.card-selected-bulk { outline:2px solid var(--accent); outline-offset:2px; background:color-mix(in srgb, var(--accent) 8%, var(--surface)); }
body.select-mode .card { cursor:default; }
body.select-mode .card:hover { border-color:var(--accent); }
/* Disposition stats bar */
.disp-stats-bar { display:flex; align-items:center; gap:8px; padding:4px 16px;
background:var(--bg); border-bottom:1px solid var(--border);
font-size:11px; color:var(--muted); flex-shrink:0; flex-wrap:wrap; }
.disp-stat-sep { width:1px; height:10px; background:var(--border); flex-shrink:0; }
.disp-stat-warn { color:var(--danger); font-weight:600; }
.disp-stat-ok { color:var(--success); }
/* Bulk tag bar */
.bulk-tag-bar { display:flex; align-items:center; gap:8px; padding:6px 16px;
background:var(--surface); border-top:1px solid var(--border);
font-size:12px; color:var(--text); flex-shrink:0; flex-wrap:wrap; }
.bulk-tag-bar button { height:26px; padding:0 10px; border-radius:5px; font-size:12px; cursor:pointer; box-sizing:border-box; }
.bulk-delete-modal { max-width:460px; }
.bulk-criteria-row { display:flex; align-items:center; gap:8px; margin-bottom:8px; font-size:12px; }
.bulk-criteria-row label { flex:0 0 130px; color:var(--muted); }
@ -361,17 +336,17 @@
.settings-backdrop.open { display:flex; }
.settings-modal {
background:var(--surface); border:1px solid var(--border);
border-radius:10px; width:min(720px,96vw);
border-radius:10px; width:min(540px,96vw);
display:flex; flex-direction:column; overflow:hidden;
font-size:12px; color:var(--text);
}
.settings-header { padding:16px 20px 0; display:flex; align-items:center; justify-content:space-between; }
.settings-header h2 { font-size:14px; font-weight:700; margin:0; }
.settings-tabs { display:flex; border-bottom:1px solid var(--border); padding:0 20px; margin-top:12px; flex-wrap:wrap; }
.settings-tabs { display:flex; border-bottom:1px solid var(--border); padding:0 20px; margin-top:12px; }
.settings-tab {
height:36px; padding:0 14px; font-size:12px; cursor:pointer; border:none;
background:none; color:var(--muted); border-bottom:2px solid transparent;
margin-bottom:-1px; font-weight:500; white-space:nowrap;
margin-bottom:-1px; font-weight:500;
}
.settings-tab.active { color:var(--accent); border-bottom-color:var(--accent); font-weight:600; }
.settings-body { padding:16px 20px; overflow-y:auto; max-height:65vh; display:flex; flex-direction:column; gap:14px; }
@ -494,18 +469,6 @@
.overdue-badge { font-size: 9px; padding: 1px 5px; border-radius: 10px;
background: #7c3200; color: #ffb347; font-weight: 600; white-space: nowrap; }
[data-theme="light"] .overdue-badge { background: #fff3e0; color: #c55a00; }
.resolved-badge { font-size: 9px; padding: 1px 5px; border-radius: 10px;
background: #1a3a28; color: #7effc0; font-weight: 600; white-space: nowrap; }
[data-theme="light"] .resolved-badge { background: #d0f5ea; color: #005a3a; }
.card-resolved { opacity: 0.6; }
.resolved-divider { grid-column: 1 / -1; padding: 8px 2px; font-size: 11px;
color: var(--muted); border-top: 1px dashed var(--border); text-align: center; }
.email-badge { font-size: 9px; padding: 1px 5px; border-radius: 10px;
background: #1a3a5c; color: #7ec8f0; font-weight: 500; white-space: nowrap; }
[data-theme="light"] .email-badge { background: #d0eaff; color: #004a80; }
.phone-badge { font-size: 9px; padding: 1px 5px; border-radius: 10px;
background: #1a4030; color: #7eeac0; font-weight: 500; white-space: nowrap; }
[data-theme="light"] .phone-badge { background: #d0f5ea; color: #005a3a; }
.badge-email { background: rgba(139,68,173,.2); color: #b87fd8; }
.badge-onedrive { background: rgba(0,120,212,.2); color: #5ba4e8; }
.badge-sharepoint { background: rgba(0,160,100,.2); color: #2ecc71; }
@ -644,8 +607,6 @@
body.viewer-mode .config-group { display: none !important; }
body.viewer-mode #resumeBanner { display: none !important; }
body.viewer-mode #bulkDeleteBtn { display: none !important; }
body.viewer-mode #selectModeBtn { display: none !important; }
body.viewer-mode #bulkTagBar { display: none !important; }
body.viewer-mode .card-delete-btn { display: none !important; }
body.viewer-mode #dsubDeleteBtn { display: none !important; }
body.viewer-mode #shareBtn { display: none !important; }

View File

@ -12,8 +12,7 @@
// ── i18n ─────────────────────────────────────────────────────────────────────
var LANG = {{ lang_json | safe }};
// ── Viewer mode ───────────────────────────────────────────────────────────────
window.VIEWER_MODE = {{ 'true' if viewer_mode else 'false' }};
window.VIEWER_SCOPE = {{ viewer_scope | safe if viewer_scope is defined else '{}' }};
window.VIEWER_MODE = {{ 'true' if viewer_mode else 'false' }};
function t(key, fallback) {
return LANG[key] !== undefined ? LANG[key] : (fallback !== undefined ? fallback : key);
}
@ -110,7 +109,6 @@ document.addEventListener('DOMContentLoaded', applyI18n);
<div id="deltaStatusRow" style="display:none;font-size:10px;padding:3px 0 2px;color:var(--muted)">
<span id="deltaStatusText"></span>
<button onclick="clearDeltaTokens()" style="background:none;border:none;color:var(--danger);font-size:10px;cursor:pointer;padding:0 0 0 6px" data-i18n="m365_delta_clear">Clear tokens</button>
<span class="hint-wrap"><span class="hint-icon" onclick="toggleHint(this)">?</span><span class="hint-bubble" data-i18n="m365_delta_tokens_hint">Saved change-tokens let delta scans fetch only items modified since the last scan. Clear tokens forces the next scan to be a full scan.</span></span>
</div>
<!-- Photo / biometric scan (#9) -->
@ -121,62 +119,6 @@ document.addEventListener('DOMContentLoaded', applyI18n);
<label class="toggle"><input type="checkbox" id="optScanPhotos"><span class="toggle-slider"></span></label>
</div>
<!-- Skip GPS in images -->
<div class="toggle-row">
<span class="toggle-label" style="flex:1">
<span data-i18n="m365_opt_skip_gps">Ignorer GPS i billeder</span><span class="hint-wrap"><span class="hint-icon" onclick="toggleHint(this)">?</span><span class="hint-bubble" data-i18n="m365_opt_skip_gps_hint">Billeder med GPS-koordinater flagges ikke — nyttigt ved elevscanninger, hvor smartphones indlejrer placering i alle fotos.</span></span>
</span>
<label class="toggle"><input type="checkbox" id="optSkipGps"><span class="toggle-slider"></span></label>
</div>
<!-- Minimum CPR count per file -->
<div class="toggle-row">
<span class="toggle-label" style="flex:1">
<span data-i18n="m365_opt_min_cpr">Min. CPR-antal pr. fil</span><span class="hint-wrap"><span class="hint-icon" onclick="toggleHint(this)">?</span><span class="hint-bubble" data-i18n="m365_opt_min_cpr_hint">Filer med færre distinkte CPR-numre end denne tærskel rapporteres ikke. Sæt til 2 for at undgå falske positive, når elever har egne CPR-numre i filer.</span></span>
</span>
<input type="number" id="optMinCpr" value="1" min="1" max="50"
style="width:46px;padding:3px 6px;font-size:11px;text-align:right">
</div>
<!-- OCR language -->
<div class="toggle-row">
<span class="toggle-label" style="flex:1">
<span data-i18n="m365_opt_ocr_lang">OCR language</span><span class="hint-wrap"><span class="hint-icon" onclick="toggleHint(this)">?</span><span class="hint-bubble" data-i18n="m365_opt_ocr_lang_hint">Tesseract language pack(s) used when scanning scanned PDFs and images. Must match installed language packs.</span></span>
</span>
<select id="optOcrLang" style="font-size:11px;padding:2px 4px;background:var(--surface);border:1px solid var(--border);color:var(--text);border-radius:4px">
<option value="dan+eng">dan+eng</option>
<option value="dan">dan</option>
<option value="eng">eng</option>
<option value="dan+eng+deu">dan+eng+deu</option>
<option value="dan+eng+swe">dan+eng+swe</option>
<option value="dan+eng+fra">dan+eng+fra</option>
</select>
</div>
<!-- CPR-only mode -->
<div class="toggle-row">
<span class="toggle-label" style="flex:1">
<span data-i18n="m365_opt_cpr_only">CPR-only mode</span><span class="hint-wrap"><span class="hint-icon" onclick="toggleHint(this)">?</span><span class="hint-bubble" data-i18n="m365_opt_cpr_only_hint">Only flag files that contain CPR numbers. Files with only email addresses, phone numbers, faces, or EXIF metadata are ignored.</span></span>
</span>
<label class="toggle"><input type="checkbox" id="optCprOnly"><span class="toggle-slider"></span></label>
</div>
<!-- Scan for email addresses -->
<div class="toggle-row">
<span class="toggle-label" style="flex:1">
<span data-i18n="m365_opt_scan_emails">Scan for email addresses</span><span class="hint-wrap"><span class="hint-icon" onclick="toggleHint(this)">?</span><span class="hint-bubble" data-i18n="m365_opt_scan_emails_hint">Flags files that contain email addresses. Off by default — email addresses are very common and may produce many results.</span></span>
</span>
<label class="toggle"><input type="checkbox" id="optScanEmails"><span class="toggle-slider"></span></label>
</div>
<!-- Scan for phone numbers -->
<div class="toggle-row">
<span class="toggle-label" style="flex:1">
<span data-i18n="m365_opt_scan_phones">Scan for phone numbers</span><span class="hint-wrap"><span class="hint-icon" onclick="toggleHint(this)">?</span><span class="hint-bubble" data-i18n="m365_opt_scan_phones_hint">Flags files containing Danish phone numbers (8 digits). Useful for finding contact lists and parent correspondence.</span></span>
</span>
<label class="toggle"><input type="checkbox" id="optScanPhones"><span class="toggle-slider"></span></label>
</div>
<!-- Retention policy (suggestion #1) -->
<div class="toggle-row">
<span class="toggle-label" style="flex:1">
@ -326,7 +268,7 @@ document.addEventListener('DOMContentLoaded', applyI18n);
<!-- Topbar -->
<div class="topbar">
<span id="viewerBrand" style="display:none;font-size:15px;font-weight:600;color:var(--text);white-space:nowrap;margin-right:6px">🔍 GDPRScanner</span>
<button class="scan-btn" id="scanBtn" onclick="checkCheckpoint(() => startScan(false))" data-i18n="m365_btn_scan">Scan</button>
<button class="scan-btn" id="scanBtn" onclick="startScan()" data-i18n="m365_btn_scan">Scan</button>
<button class="stop-btn" id="stopBtn" style="display:none" onclick="stopScan()" data-i18n="m365_btn_stop">Stop</button>
<!-- Profile selector (15c) -->
@ -367,17 +309,6 @@ document.addEventListener('DOMContentLoaded', applyI18n);
<button onclick="clearCheckpointAndScan()" style="padding:3px 10px;border-radius:5px;background:none;border:1px solid var(--border);color:var(--muted);cursor:pointer;font-size:12px" data-i18n="m365_btn_start_fresh">Start fresh</button>
</div>
<!-- History mode banner -->
<div id="historyBanner" style="display:none;align-items:center;gap:10px;padding:6px 14px;background:var(--surface);border-bottom:1px solid var(--border);font-size:12px">
<span style="font-size:11px;font-weight:600;color:var(--muted);flex-shrink:0" data-i18n="history_lbl">History</span>
<span id="historyBannerText" style="flex:1;font-size:11px;color:var(--text);overflow:hidden;text-overflow:ellipsis;white-space:nowrap"></span>
<div data-history-wrap style="position:relative;flex-shrink:0">
<button id="historyPickerBtn" type="button" onclick="openHistoryPicker()" style="height:24px;padding:0 10px;background:none;border:1px solid var(--border);color:var(--muted);border-radius:4px;font-size:11px;cursor:pointer" data-i18n="history_btn_sessions">Sessions</button>
<div id="historyDropdown" style="display:none;position:absolute;right:0;top:calc(100% + 4px);background:var(--surface);border:1px solid var(--border);border-radius:6px;z-index:9999;width:300px;max-height:260px;overflow-y:auto;box-shadow:0 4px 12px rgba(0,0,0,.25)"></div>
</div>
<button id="historyLatestBtn" type="button" onclick="loadHistorySession(null)" style="display:none;height:24px;padding:0 10px;background:none;border:1px solid var(--accent);color:var(--accent);border-radius:4px;font-size:11px;cursor:pointer;flex-shrink:0" data-i18n="history_btn_latest">Open items</button>
</div>
<!-- Filter bar — full width, above grid + preview -->
<div class="filter-bar" id="filterBar">
<input type="text" id="filterSearch" data-i18n-placeholder="m365_filter_search" placeholder="Search…" oninput="applyFilters()">
@ -413,24 +344,14 @@ document.addEventListener('DOMContentLoaded', applyI18n);
<option value="1" data-i18n="m365_filter_special_only">⚠ Art. 9 only</option>
<option value="photo" data-i18n="m365_filter_photo_only">📷 Photos / biometric</option>
</select>
<span id="viewerIdentityBadge" style="display:none;font-size:11px;padding:2px 8px;border-radius:10px;background:var(--muted);color:#fff;font-weight:600;white-space:nowrap;max-width:180px;overflow:hidden;text-overflow:ellipsis"></span>
<select id="filterRole" onchange="applyFilters()" style="width:120px">
<option value="" data-i18n="m365_filter_all_roles">All roles</option>
<option value="staff" data-i18n="m365_filter_staff">Ansatte</option>
<option value="student" data-i18n="m365_filter_student">Elever</option>
</select>
<button class="filter-clear" onclick="clearFilters()" data-i18n="m365_filter_clear">Ryd</button>
<div class="spacer"></div>
<button id="exportBtn" onclick="exportExcel()" style="background:none;border:1px solid var(--border);color:var(--muted)" data-i18n="m365_btn_export_excel" title="Export results as Excel">Excel</button>
<button id="exportA30Btn" onclick="exportArticle30()" style="background:none;border:1px solid var(--accent);color:var(--accent)" data-i18n="m365_btn_export_article30" title="Export GDPR Article 30 report as Word document">Art.30</button>
<button id="bulkDeleteBtn" onclick="openBulkDelete()" style="background:none;border:1px solid var(--danger);color:var(--danger)" data-i18n="m365_btn_bulk_delete" title="Bulk delete">Slet</button>
<button id="selectModeBtn" style="background:none;border:1px solid var(--border);color:var(--muted)" onclick="toggleSelectMode()" data-i18n="bulk_select_mode">Vælg</button>
<button id="listViewBtn" style="background:none;border:1px solid var(--border);color:var(--muted)" onclick="toggleView()" data-i18n="m365_btn_list_view">Liste</button>
</div>
<!-- Disposition stats bar -->
<div id="dispStats" class="disp-stats-bar" style="display:none"></div>
<!-- Content area: grid + preview panel -->
<div class="content-area">
<div style="flex:1; display:flex; flex-direction:column; overflow:hidden; min-width:220px">
@ -443,24 +364,6 @@ document.addEventListener('DOMContentLoaded', applyI18n);
</div>
<div id="lastScanSummary" style="display:none" class="empty-state last-scan-summary"></div>
<div class="grid" id="grid" style="display:none"></div>
<!-- Bulk disposition tag bar (visible only in select mode with items selected) -->
<div id="bulkTagBar" class="bulk-tag-bar" style="display:none">
<span id="bulkTagCount" style="font-weight:600;white-space:nowrap"></span>
<button id="bulkSelectAll" type="button" onclick="selectAllVisible()" data-i18n="bulk_select_all">Vælg alle synlige</button>
<select id="bulkDispSelect" style="height:26px;font-size:12px;padding:0 8px;flex:0 0 auto">
<option value="retain-legal" data-i18n="m365_disp_retain_legal">Behold — juridisk</option>
<option value="retain-legitimate" data-i18n="m365_disp_retain_legit">Behold — legitimt</option>
<option value="retain-contract" data-i18n="m365_disp_retain_contract">Behold — kontrakt</option>
<option value="delete-scheduled" data-i18n="m365_disp_delete_sched">Slet — planlagt</option>
<option value="deleted" data-i18n="m365_disp_deleted">Slettet</option>
<option value="personal-use" data-i18n="m365_disp_personal_use">Personlig brug</option>
<option value="unreviewed" data-i18n="m365_disp_unreviewed">Ikke gennemgået</option>
</select>
<button id="bulkTagApplyBtn" type="button" onclick="applyBulkDisposition()" style="background:var(--accent);color:#fff;border:none;height:26px;padding:0 14px;border-radius:5px;font-size:12px;cursor:pointer;font-weight:600" data-i18n="bulk_apply">Anvend</button>
<span id="bulkTagStatus" style="font-size:11px;color:var(--accent)"></span>
<button type="button" onclick="toggleSelectMode()" style="margin-left:auto;background:none;border:1px solid var(--border);color:var(--muted);height:26px;padding:0 10px;border-radius:5px;font-size:12px;cursor:pointer" data-i18n="bulk_done">Afslut</button>
</div>
</div>
<!-- Progress bar -->
@ -502,8 +405,6 @@ document.addEventListener('DOMContentLoaded', applyI18n);
<iframe id="previewFrame" sandbox="allow-scripts allow-same-origin allow-forms allow-popups" style="display:none"></iframe>
</div>
<div class="preview-meta" id="previewMeta"></div>
<!-- Related documents -->
<div id="previewRelated" style="display:none;padding:8px 14px 4px;border-top:1px solid var(--border)"></div>
<!-- Disposition widget (#6) -->
<div class="disposition-row" id="dispositionRow" style="display:none">
<span class="disposition-label" data-i18n="m365_disposition_label">Disposition</span>
@ -616,8 +517,6 @@ document.addEventListener('DOMContentLoaded', applyI18n);
<button class="settings-tab" id="stTabScheduler" onclick="switchSettingsTab('scheduler')" data-i18n="m365_settings_tab_scheduler">Scheduler</button>
<button class="settings-tab" id="stTabEmail" onclick="switchSettingsTab('email')" data-i18n="m365_settings_tab_email">Email report</button>
<button class="settings-tab" id="stTabDatabase" onclick="switchSettingsTab('database')" data-i18n="m365_settings_tab_database">Database</button>
<button class="settings-tab" id="stTabAuditlog" onclick="switchSettingsTab('auditlog')" data-i18n="m365_settings_tab_auditlog">Audit Log</button>
<button class="settings-tab" id="stTabAi" onclick="switchSettingsTab('ai')" data-i18n="m365_settings_tab_ai">AI / NER</button>
</div>
<div class="settings-body">
@ -642,19 +541,6 @@ document.addEventListener('DOMContentLoaded', applyI18n);
<div class="settings-about-row"><span>Requests</span><span id="st-about-requests" style="color:var(--muted)"></span></div>
<div class="settings-about-row"><span>openpyxl</span><span id="st-about-openpyxl" style="color:var(--muted)"></span></div>
</div>
<div class="settings-group" id="stUpdateGroup" style="display:none">
<div class="settings-group-title" data-i18n="m365_settings_updates">Software update</div>
<div id="stUpdateStatus" style="font-size:11px;color:var(--muted);margin-bottom:8px" data-i18n="m365_update_idle">Check whether a newer version is available.</div>
<div id="stUpdateCommits" style="display:none;font-size:11px;color:var(--muted);font-family:monospace;line-height:1.6;background:var(--bg);border:1px solid var(--border);border-radius:6px;padding:6px 10px;margin-bottom:8px;max-height:120px;overflow-y:auto"></div>
<div style="display:flex;align-items:center;gap:10px;margin-bottom:10px">
<label class="toggle" style="flex:unset"><input type="checkbox" id="stAutoUpdate" onchange="stSaveAutoUpdate()"><span class="toggle-slider"></span></label>
<span style="font-size:12px" data-i18n="m365_update_auto">Install updates automatically (checked daily — the app restarts itself)</span>
</div>
<div style="display:flex;justify-content:flex-end;gap:8px">
<button type="button" onclick="stCheckUpdate()" id="stCheckUpdateBtn" style="height:26px;padding:0 14px;background:none;border:1px solid var(--border);color:var(--text);border-radius:6px;font-size:12px;cursor:pointer;box-sizing:border-box" data-i18n="m365_update_check">Check for updates</button>
<button type="button" onclick="stApplyUpdate()" id="stApplyUpdateBtn" style="display:none;height:26px;padding:0 14px;background:var(--accent);color:#fff;border:none;border-radius:6px;font-size:12px;cursor:pointer;font-weight:600;box-sizing:border-box" data-i18n="m365_update_install">Install update</button>
</div>
</div>
</div>
<!-- ── Security pane ─────────────────────────────────────────────────── -->
@ -698,24 +584,6 @@ document.addEventListener('DOMContentLoaded', applyI18n);
<button type="button" onclick="stSaveViewerPin()" style="background:var(--accent);color:#fff;border:none;height:26px;padding:0 14px;border-radius:6px;font-size:12px;cursor:pointer;font-weight:600;box-sizing:border-box" data-i18n="m365_settings_save_pin">Save PIN</button>
</div>
</div>
<div class="settings-group">
<div class="settings-group-title" data-i18n="interface_pin_group_title">Interface PIN</div>
<div style="font-size:10px;color:var(--muted);line-height:1.5;margin-bottom:4px" data-i18n="interface_pin_desc">A numeric PIN (48 digits) that must be entered before accessing the main scanner interface. Viewers accessing <code style="font-size:10px">/view</code> are not affected.</div>
<div id="stInterfacePinStatus" style="font-size:10px;color:var(--muted);margin-bottom:6px"></div>
<div class="settings-row" id="stInterfaceCurrentPinRow" style="display:none">
<label data-i18n="m365_settings_current_pin">Current PIN</label>
<input id="stInterfaceCurrentPin" type="password" autocomplete="off" placeholder="••••">
</div>
<div class="settings-row">
<label data-i18n="m365_settings_new_pin">New PIN</label>
<input id="stInterfaceNewPin" type="password" inputmode="numeric" maxlength="8" autocomplete="off" placeholder="48 digits">
</div>
<div style="display:flex;justify-content:flex-end;gap:8px;margin-top:4px">
<div id="stInterfacePinSaveStatus" style="flex:1;font-size:11px;color:var(--muted);align-self:center"></div>
<button type="button" onclick="stClearInterfacePin()" id="stInterfacePinClearBtn" style="display:none;background:none;border:1px solid var(--danger);color:var(--danger);height:26px;padding:0 12px;border-radius:6px;font-size:12px;cursor:pointer;box-sizing:border-box" data-i18n="interface_pin_clear">Clear PIN</button>
<button type="button" onclick="stSaveInterfacePin()" style="background:var(--accent);color:#fff;border:none;height:26px;padding:0 14px;border-radius:6px;font-size:12px;cursor:pointer;font-weight:600;box-sizing:border-box" data-i18n="m365_settings_save_pin">Save PIN</button>
</div>
</div>
</div>
<!-- ── Scheduler pane (#19) ──────────────────────────────────────────── -->
@ -772,19 +640,12 @@ document.addEventListener('DOMContentLoaded', applyI18n);
<input id="schedMinute" type="number" min="0" max="59" value="0" style="width:50px">
</div>
</div>
<div class="settings-row" id="schedProfileRow">
<div class="settings-row">
<label data-i18n="m365_sched_profile">Profile</label>
<select id="schedProfile" style="flex:1;height:26px;padding:0 8px;border:1px solid var(--border);border-radius:5px;background:var(--surface);color:var(--text);font-size:12px;box-sizing:border-box">
<option value="" data-i18n="m365_sched_profile_last">Last saved settings</option>
</select>
</div>
<div class="settings-row">
<label data-i18n="m365_sched_report_only">Report only</label>
<label class="toggle" style="flex:unset"><input type="checkbox" id="schedReportOnly" onchange="schedToggleReportOnly()"><span class="toggle-slider"></span></label>
</div>
<div class="settings-row" id="schedReportOnlyHint" style="display:none">
<span style="font-size:10px;color:var(--muted);line-height:1.4" data-i18n="m365_sched_report_only_hint">Email the latest scan results without running a new scan. Requires scan results in the database.</span>
</div>
<div class="settings-row">
<label data-i18n="m365_sched_auto_email">Email report automatically</label>
<label class="toggle" style="flex:unset"><input type="checkbox" id="schedAutoEmail"><span class="toggle-slider"></span></label>
@ -841,14 +702,6 @@ document.addEventListener('DOMContentLoaded', applyI18n);
<label data-i18n="m365_smtp_recipients">Recipients</label>
<input id="st-smtpTo" type="text" placeholder="a@school.dk, b@school.dk">
</div>
<div class="settings-row">
<label data-i18n="m365_smtp_auto_email_manual">Email report after manual scan</label>
<label class="toggle" style="flex:unset"><input type="checkbox" id="st-smtpAutoEmail"><span class="toggle-slider"></span></label>
</div>
<div class="settings-row">
<label data-i18n="m365_smtp_prefer_smtp">Always send via SMTP (skip Microsoft Graph)</label>
<label class="toggle" style="flex:unset"><input type="checkbox" id="st-smtpPreferSmtp"><span class="toggle-slider"></span></label>
</div>
<div style="display:flex;justify-content:flex-end;gap:8px;margin-top:4px">
<div id="st-smtpStatus" style="flex:1;font-size:11px;color:var(--muted);align-self:center"></div>
<button onclick="stSmtpSave()" style="background:none;border:1px solid var(--border);color:var(--muted);height:26px;padding:0 12px;border-radius:6px;font-size:12px;cursor:pointer;box-sizing:border-box" data-i18n="btn_save">Save</button>
@ -876,56 +729,6 @@ document.addEventListener('DOMContentLoaded', applyI18n);
</div>
</div>
<!-- ── Audit Log pane ─────────────────────────────────────────────────── -->
<div class="settings-pane" id="stPaneAuditlog">
<div class="settings-group">
<div class="settings-group-title" data-i18n="m365_audit_title">Compliance Audit Log</div>
<div style="overflow-x:auto">
<table id="stAuditTable" style="width:100%;border-collapse:collapse;font-size:12px">
<thead>
<tr style="text-align:left">
<th style="padding:4px 8px;border-bottom:1px solid var(--border);color:var(--muted);font-weight:500" data-i18n="m365_audit_col_time">Time</th>
<th style="padding:4px 8px;border-bottom:1px solid var(--border);color:var(--muted);font-weight:500" data-i18n="m365_audit_col_action">Action</th>
<th style="padding:4px 8px;border-bottom:1px solid var(--border);color:var(--muted);font-weight:500" data-i18n="m365_audit_col_detail">Detail</th>
<th style="padding:4px 8px;border-bottom:1px solid var(--border);color:var(--muted);font-weight:500" data-i18n="m365_audit_col_ip">IP</th>
</tr>
</thead>
<tbody id="stAuditTableBody">
<tr><td colspan="4" style="padding:8px;color:var(--muted)" data-i18n="m365_audit_loading">Loading…</td></tr>
</tbody>
</table>
</div>
</div>
</div>
<div class="settings-pane" id="stPaneAi">
<div class="settings-group">
<div class="settings-group-title" data-i18n="m365_ai_title">AI-Enhanced NER</div>
<p style="margin:0 0 12px;font-size:12px;color:var(--muted)" data-i18n="m365_ai_desc">Use Claude AI instead of spaCy for name, address, and organisation detection. Significantly more accurate on Danish text — especially hyphenated surnames and foreign-origin names. Requires an Anthropic API key; charged per token.</p>
<div style="display:flex;align-items:center;gap:10px;margin-bottom:14px">
<label class="toggle" style="flex-shrink:0">
<input type="checkbox" id="aiEnabled">
<span class="toggle-track"></span>
</label>
<span style="font-size:13px" data-i18n="m365_ai_enable">Enable Claude NER</span>
</div>
<div style="margin-bottom:12px">
<label style="font-size:12px;color:var(--muted);display:block;margin-bottom:4px" data-i18n="m365_ai_api_key_label">Anthropic API key</label>
<div style="display:flex;gap:6px">
<input type="password" id="aiApiKey" placeholder="sk-ant-…" autocomplete="off" style="flex:1;height:26px;padding:0 8px;border:1px solid var(--border);border-radius:6px;background:var(--bg);color:var(--text);font-size:12px;box-sizing:border-box">
<button type="button" onclick="stAiToggleKey()" id="aiShowKeyBtn" style="height:26px;padding:0 10px;border:1px solid var(--border);background:none;color:var(--muted);border-radius:6px;font-size:12px;cursor:pointer" data-i18n="m365_ai_show_key">Show</button>
</div>
<span id="aiKeyStatus" style="font-size:11px;color:var(--muted);margin-top:4px;display:block"></span>
</div>
<div style="display:flex;gap:8px;align-items:center;flex-wrap:wrap">
<button type="button" onclick="stAiSave()" style="height:26px;padding:0 14px;background:var(--accent);color:#fff;border:none;border-radius:6px;font-size:12px;cursor:pointer" data-i18n="btn_save">Save</button>
<button type="button" onclick="stAiTest()" style="height:26px;padding:0 14px;background:none;border:1px solid var(--border);color:var(--text);border-radius:6px;font-size:12px;cursor:pointer" data-i18n="m365_ai_test">Test key</button>
<span id="aiStatus" style="font-size:12px"></span>
</div>
<p style="margin:14px 0 0;font-size:11px;color:var(--muted)" data-i18n="m365_ai_model_note">Model: claude-haiku-4-5 · billed at Anthropic token rates · results cached per document.</p>
</div>
</div>
</div><!-- /.settings-body -->
<div class="settings-footer">
<button onclick="closeSettings()" style="background:none;border:1px solid var(--border);color:var(--muted);height:26px;padding:0 14px;border-radius:6px;font-size:12px;cursor:pointer;box-sizing:border-box" data-i18n="btn_close">Close</button>
@ -1056,36 +859,6 @@ document.addEventListener('DOMContentLoaded', applyI18n);
<div style="font-size:11px;color:var(--muted);margin-bottom:3px" data-i18n="share_label_lbl">Label (optional)</div>
<input id="shareLabel" type="text" data-i18n-placeholder="share_label_placeholder" placeholder="e.g. DPO review 2026" style="width:100%;box-sizing:border-box;font-size:12px;padding:5px 8px;background:var(--surface);border:1px solid var(--border);border-radius:5px;color:var(--text)">
</div>
<div style="width:120px">
<div style="font-size:11px;color:var(--muted);margin-bottom:3px" data-i18n="share_scope_lbl">Scope</div>
<select id="shareScopeType" onchange="_shareScopeTypeChanged()" style="width:100%;font-size:12px;padding:5px 6px;background:var(--surface);border:1px solid var(--border);border-radius:5px;color:var(--text)">
<option value="" data-i18n="share_scope_all">All</option>
<option value="role" data-i18n="share_scope_type_role">Role</option>
<option value="user" data-i18n="share_scope_type_user">User</option>
</select>
</div>
<div id="shareScopeRoleWrap" style="width:110px;display:none">
<div style="font-size:11px;color:var(--muted);margin-bottom:3px" data-i18n="share_scope_role_lbl">Role</div>
<select id="shareScope" style="width:100%;font-size:12px;padding:5px 6px;background:var(--surface);border:1px solid var(--border);border-radius:5px;color:var(--text)">
<option value="staff" data-i18n="share_scope_staff">Staff</option>
<option value="student" data-i18n="share_scope_student">Students</option>
</select>
</div>
<div id="shareScopeUserWrap" style="flex:1.5;min-width:140px;display:none;position:relative">
<div style="font-size:11px;color:var(--muted);margin-bottom:3px" data-i18n="share_scope_user_lbl">User email</div>
<input id="shareScopeUser" type="text" autocomplete="off" data-i18n-placeholder="share_scope_user_placeholder" placeholder="alice@school.dk" style="width:100%;box-sizing:border-box;font-size:12px;padding:5px 8px;background:var(--surface);border:1px solid var(--border);border-radius:5px;color:var(--text)">
<div id="shareScopeUserDropdown" style="display:none;position:absolute;top:100%;left:0;right:0;margin-top:2px;background:var(--surface);border:1px solid var(--border);border-radius:6px;z-index:9999;max-height:220px;overflow-y:auto;box-shadow:0 4px 12px rgba(0,0,0,.3)"></div>
</div>
<div style="display:flex;gap:6px;flex:1.5;min-width:200px">
<div style="flex:1">
<div style="font-size:11px;color:var(--muted);margin-bottom:3px" data-i18n="share_date_from">Items from</div>
<input id="shareValidFrom" type="date" style="width:100%;box-sizing:border-box;font-size:12px;padding:5px 6px;background:var(--surface);border:1px solid var(--border);border-radius:5px;color:var(--text)">
</div>
<div style="flex:1">
<div style="font-size:11px;color:var(--muted);margin-bottom:3px" data-i18n="share_date_to">Items until</div>
<input id="shareValidTo" type="date" style="width:100%;box-sizing:border-box;font-size:12px;padding:5px 6px;background:var(--surface);border:1px solid var(--border);border-radius:5px;color:var(--text)">
</div>
</div>
<div style="width:100px">
<div style="font-size:11px;color:var(--muted);margin-bottom:3px" data-i18n="share_expires_in">Expires in</div>
<select id="shareExpiry" style="width:100%;font-size:12px;padding:5px 6px;background:var(--surface);border:1px solid var(--border);border-radius:5px;color:var(--text)">
@ -1098,6 +871,13 @@ document.addEventListener('DOMContentLoaded', applyI18n);
</div>
<button onclick="createShareLink()" style="height:30px;padding:0 14px;background:var(--accent);color:#fff;border:none;border-radius:5px;font-size:12px;cursor:pointer;flex-shrink:0" data-i18n="share_create">Create</button>
</div>
<div id="shareNewLinkRow" style="display:none;margin-top:10px">
<div style="font-size:11px;color:var(--muted);margin-bottom:4px" data-i18n="share_copy_link_prompt">Copy link:</div>
<div style="display:flex;gap:6px;align-items:center">
<input id="shareNewLinkUrl" type="text" readonly style="flex:1;font-size:11px;padding:5px 8px;background:var(--bg2,var(--bg));border:1px solid var(--border);border-radius:5px;color:var(--text);min-width:0">
<button onclick="copyShareLink()" id="shareCopyBtn" style="height:26px;padding:0 10px;background:none;border:1px solid var(--border);color:var(--muted);border-radius:5px;font-size:11px;cursor:pointer;flex-shrink:0" data-i18n="log_copy">Copy</button>
</div>
</div>
</div>
<!-- Existing tokens -->
@ -1340,93 +1120,30 @@ document.addEventListener('DOMContentLoaded', applyI18n);
<div class="srcmgmt-group">
<div class="srcmgmt-group-title" data-i18n="m365_file_sources_add">Add source</div>
<div class="fsrc-form" style="border-color:var(--border)">
<!-- Source type selector -->
<div class="fsrc-form-row">
<label>Type</label>
<div style="display:flex;background:var(--bg);border:1px solid var(--border);border-radius:6px;overflow:hidden">
<button type="button" id="srcTypeLocal" onclick="srcFileTypeSelect('local')" style="flex:1;border:none;padding:3px 8px;font-size:11px;cursor:pointer;background:var(--accent);color:#fff" data-i18n="m365_fsrc_type_local">Local folder</button>
<button type="button" id="srcTypeSmb" onclick="srcFileTypeSelect('smb')" style="flex:1;border:none;border-left:1px solid var(--border);padding:3px 8px;font-size:11px;cursor:pointer;background:none;color:var(--muted)" data-i18n="m365_fsrc_type_smb">Network (SMB)</button>
<button type="button" id="srcTypeSftp" onclick="srcFileTypeSelect('sftp')" style="flex:1;border:none;border-left:1px solid var(--border);padding:3px 8px;font-size:11px;cursor:pointer;background:none;color:var(--muted)" data-i18n="m365_fsrc_type_sftp">SFTP</button>
</div>
<label>Name <span style="color:var(--accent)">*</span></label>
<input id="srcFileLabel" type="text" placeholder="e.g. Teacher files, NAS archive" maxlength="80" autocomplete="off">
</div>
<input type="hidden" id="srcFileSourceType" value="local">
<div class="fsrc-form-row">
<label><span data-i18n="m365_fsrc_name">Name</span> <span style="color:var(--accent)">*</span></label>
<input id="srcFileLabel" type="text" data-i18n-placeholder="m365_fsrc_name_placeholder" placeholder="e.g. Teacher files, NAS archive" maxlength="80" autocomplete="off">
</div>
<!-- Local / SMB path field -->
<div id="srcFilePathRow" class="fsrc-form-row">
<label data-i18n="m365_fsrc_path">Path</label>
<input id="srcFilePath" type="text" data-i18n-placeholder="m365_fsrc_path_placeholder" placeholder="~/Documents or //nas/shares" oninput="srcFileDetectSmb(); srcFileAutoName()">
<input id="srcFilePath" type="text" placeholder="~/Documents or //nas/shares" oninput="srcFileDetectSmb(); srcFileAutoName()">
</div>
<div id="srcFileSmbFields" style="display:none;flex-direction:column;gap:6px">
<div style="font-size:10px;color:var(--accent)" data-i18n="m365_fsrc_smb_detected">SMB/CIFS network share detected</div>
<div class="fsrc-form-row">
<label data-i18n="m365_fsrc_smb_host">SMB host</label>
<input id="srcFileSmbHost" type="text" data-i18n-placeholder="m365_fsrc_smb_host_placeholder" placeholder="nas.school.dk">
<input id="srcFileSmbHost" type="text" placeholder="nas.school.dk">
</div>
<div class="fsrc-form-row">
<label data-i18n="m365_fsrc_smb_user">Username</label>
<input id="srcFileSmbUser" type="text" data-i18n-placeholder="m365_fsrc_smb_user_placeholder" placeholder="DOMAIN\\username">
<input id="srcFileSmbUser" type="text" placeholder="DOMAIN\\username">
</div>
<div class="fsrc-form-row">
<label data-i18n="m365_fsrc_smb_pw">Password</label>
<input id="srcFileSmbPw" type="password" data-i18n-placeholder="m365_fsrc_pw_keychain_placeholder" placeholder="Stored in OS keychain">
<input id="srcFileSmbPw" type="password" placeholder="Stored in OS keychain">
</div>
<div style="font-size:10px;color:var(--muted)" data-i18n="m365_fsrc_smb_pw_hint">Saved to OS keychain — never stored in a file.</div>
</div>
<!-- SFTP fields -->
<div id="srcFileSftpFields" style="display:none;flex-direction:column;gap:6px">
<div class="fsrc-form-row">
<label data-i18n="m365_fsrc_sftp_host">SFTP host</label>
<input id="srcFileSftpHost" type="text" data-i18n-placeholder="m365_fsrc_sftp_host_placeholder" placeholder="sftp.school.dk" oninput="srcFileAutoNameSftp()">
</div>
<div class="fsrc-form-row">
<label data-i18n="m365_fsrc_sftp_port">Port</label>
<input id="srcFileSftpPort" type="number" value="22" min="1" max="65535" style="width:70px">
</div>
<div class="fsrc-form-row">
<label data-i18n="m365_fsrc_sftp_user">Username</label>
<input id="srcFileSftpUser" type="text" data-i18n-placeholder="m365_fsrc_sftp_user_placeholder" placeholder="backup_user">
</div>
<div class="fsrc-form-row">
<label data-i18n="m365_fsrc_sftp_remote_path">Remote path</label>
<input id="srcFileSftpPath" type="text" data-i18n-placeholder="m365_fsrc_sftp_path_placeholder" placeholder="/var/data" value="/">
</div>
<!-- Auth type toggle -->
<div class="fsrc-form-row">
<label data-i18n="m365_fsrc_sftp_auth">Auth</label>
<div style="display:flex;background:var(--bg);border:1px solid var(--border);border-radius:6px;overflow:hidden">
<button type="button" id="srcSftpAuthPw" onclick="srcFileSftpAuthSelect('password')" style="flex:1;border:none;padding:3px 8px;font-size:11px;cursor:pointer;background:var(--accent);color:#fff" data-i18n="m365_fsrc_sftp_auth_password">Password</button>
<button type="button" id="srcSftpAuthKey" onclick="srcFileSftpAuthSelect('key')" style="flex:1;border:none;border-left:1px solid var(--border);padding:3px 8px;font-size:11px;cursor:pointer;background:none;color:var(--muted)" data-i18n="m365_fsrc_sftp_auth_key">SSH key</button>
</div>
</div>
<input type="hidden" id="srcFileSftpAuth" value="password">
<!-- Password auth -->
<div id="srcSftpPwFields">
<div class="fsrc-form-row">
<label data-i18n="m365_fsrc_sftp_pw">Password</label>
<input id="srcFileSftpPw" type="password" data-i18n-placeholder="m365_fsrc_pw_keychain_placeholder" placeholder="Stored in OS keychain">
</div>
<div style="font-size:10px;color:var(--muted)" data-i18n="m365_fsrc_sftp_pw_hint">Password is saved to the OS keychain — never stored in a file.</div>
</div>
<!-- Key auth -->
<div id="srcSftpKeyFields" style="display:none;flex-direction:column;gap:6px">
<div class="fsrc-form-row">
<label data-i18n="m365_fsrc_sftp_key_upload">Private key</label>
<div style="display:flex;gap:6px;align-items:center">
<input id="srcFileSftpKeyFile" type="file" accept=".pem,.key,.pub,*" style="flex:1;font-size:11px">
<span id="srcFileSftpKeyStatus" style="font-size:10px;color:var(--muted)"></span>
</div>
</div>
<input type="hidden" id="srcFileSftpKeyPath" value="">
<div class="fsrc-form-row">
<label data-i18n="m365_fsrc_sftp_passphrase">Passphrase</label>
<input id="srcFileSftpPassphrase" type="password" data-i18n-placeholder="m365_fsrc_sftp_passphrase_placeholder" placeholder="Leave blank if key has no passphrase">
</div>
<div style="font-size:10px;color:var(--muted)" data-i18n="m365_fsrc_sftp_passphrase_hint">Passphrase is saved to the OS keychain — never stored in a file.</div>
</div>
</div>
<div style="display:flex;align-items:center;gap:8px">
<input type="hidden" id="srcFileEditId" value="">
<div id="srcFileStatus" style="flex:1;font-size:11px;color:var(--muted)"></div>
@ -1457,26 +1174,26 @@ document.addEventListener('DOMContentLoaded', applyI18n);
<div class="fsrc-form" id="fsrcForm">
<div style="font-size:11px;font-weight:600;color:var(--text)" data-i18n="m365_file_sources_add">Add source</div>
<div class="fsrc-form-row">
<label><span data-i18n="m365_fsrc_name">Name</span> <span style="color:var(--accent)">*</span></label>
<input id="fsrcLabel" type="text" data-i18n-placeholder="m365_fsrc_name_placeholder" placeholder="e.g. Teacher files, NAS archive" maxlength="80" autocomplete="off">
<label data-i18n="m365_fsrc_label">Name <span style="color:var(--accent)">*</span></label>
<input id="fsrcLabel" type="text" placeholder="e.g. Teacher files, NAS archive" maxlength="80" autocomplete="off">
</div>
<div class="fsrc-form-row">
<label data-i18n="m365_fsrc_path">Path</label>
<input id="fsrcPath" type="text" data-i18n-placeholder="m365_fsrc_path_placeholder" placeholder="~/Documents or //nas/shares" oninput="fsrcDetectSmb(); fsrcAutoName()">
<input id="fsrcPath" type="text" placeholder="~/Documents or //nas/shares" oninput="fsrcDetectSmb(); fsrcAutoName()">
</div>
<div id="fsrcSmbFields" class="fsrc-smb-fields" style="display:none;flex-direction:column;gap:6px">
<div style="font-size:10px;color:var(--accent);margin:-2px 0 2px" data-i18n="m365_fsrc_smb_detected">SMB/CIFS network share detected</div>
<div class="fsrc-form-row">
<label data-i18n="m365_fsrc_smb_host">SMB host</label>
<input id="fsrcSmbHost" type="text" data-i18n-placeholder="m365_fsrc_smb_host_placeholder" placeholder="nas.school.dk">
<input id="fsrcSmbHost" type="text" placeholder="nas.school.dk">
</div>
<div class="fsrc-form-row">
<label data-i18n="m365_fsrc_smb_user">Username</label>
<input id="fsrcSmbUser" type="text" data-i18n-placeholder="m365_fsrc_smb_user_edit_placeholder" placeholder="DOMAIN\\username or username">
<input id="fsrcSmbUser" type="text" placeholder="DOMAIN\\username or username">
</div>
<div class="fsrc-form-row">
<label data-i18n="m365_fsrc_smb_pw">Password</label>
<input id="fsrcSmbPw" type="password" data-i18n-placeholder="m365_fsrc_pw_keychain_placeholder" placeholder="Stored in OS keychain">
<input id="fsrcSmbPw" type="password" placeholder="Stored in OS keychain">
</div>
<div style="font-size:10px;color:var(--muted)" data-i18n="m365_fsrc_smb_pw_hint">Password is saved to the OS keychain — never stored in a file.</div>
</div>
@ -1535,7 +1252,7 @@ document.addEventListener('DOMContentLoaded', applyI18n);
<option value="replace" data-i18n="m365_db_import_replace">Replace (full restore)</option>
</select>
</div>
<div id="importDbReplaceWarn" style="display:none;background:#7c1a0060;border:1px solid var(--danger);border-radius:6px;padding:8px 10px;font-size:11px;color:#ff7070;line-height:1.5" data-i18n="m365_db_import_replace_warn">⚠ Replace mode will erase all existing scan data before restoring. Make sure you have a backup of ~/.gdprscanner/scanner.db first.</div>
<div id="importDbReplaceWarn" style="display:none;background:#7c1a0060;border:1px solid var(--danger);border-radius:6px;padding:8px 10px;font-size:11px;color:#ff7070;line-height:1.5" data-i18n="m365_db_import_replace_warn">⚠ Replace mode will erase all existing scan data before restoring. Make sure you have a backup of ~/.gdpr_scanner.db first.</div>
<div id="importDbStatus" style="min-height:16px;font-size:11px;color:var(--muted)"></div>
<div style="display:flex;justify-content:flex-end;gap:8px;padding-top:4px;border-top:1px solid var(--border)">
<button onclick="closeImportDBModal()" style="background:none;border:1px solid var(--border);color:var(--muted);padding:5px 14px;border-radius:6px;font-size:12px;cursor:pointer" data-i18n="btn_close">Close</button>
@ -1555,6 +1272,5 @@ document.addEventListener('DOMContentLoaded', applyI18n);
<script type="module" src="/static/js/scheduler.js"></script>
<script type="module" src="/static/js/connector.js"></script>
<script type="module" src="/static/js/viewer.js"></script>
<script type="module" src="/static/js/history.js"></script>
</body>
</html>

View File

@ -1,86 +0,0 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>GDPRScanner — {{ LANG.get('interface_pin_login_btn', 'Sign in') }}</title>
<link rel="stylesheet" href="{{ url_for('static', filename='style.css') }}">
<style>
body { display: flex; align-items: center; justify-content: center; min-height: 100vh; margin: 0; }
.pin-card {
background: var(--surface);
border: 1px solid var(--border);
border-radius: 8px;
padding: 32px 40px;
width: min(340px, 92vw);
box-sizing: border-box;
}
.pin-card h1 { font-size: 15px; font-weight: 600; margin: 0 0 6px; color: var(--text); }
.pin-card p { font-size: 12px; color: var(--muted); margin: 0 0 18px; }
.pin-input {
width: 100%; box-sizing: border-box;
font-size: 22px; letter-spacing: .3em; text-align: center;
padding: 10px 12px; border-radius: 6px;
border: 1px solid var(--border); background: var(--bg);
color: var(--text); outline: none; margin-bottom: 12px;
}
.pin-input:focus { border-color: var(--accent); }
.pin-btn {
width: 100%; padding: 10px; border: none; border-radius: 6px;
background: var(--accent); color: #fff; font-size: 13px;
font-weight: 600; cursor: pointer; font-family: var(--sans);
}
.pin-btn:disabled { opacity: .5; cursor: default; }
.pin-error { font-size: 12px; color: var(--danger); margin-top: 8px; min-height: 16px; text-align: center; }
</style>
</head>
<body data-theme="dark">
<div class="pin-card">
<h1>GDPRScanner</h1>
<p>{{ LANG.get('interface_pin_login_desc', 'Enter the interface PIN to continue.') }}</p>
<input id="pinInput" class="pin-input" type="password" inputmode="numeric"
maxlength="8" placeholder="••••" autocomplete="off"
onkeydown="if(event.key==='Enter')submitPin()">
<button class="pin-btn" id="pinBtn" onclick="submitPin()">{{ LANG.get('interface_pin_login_btn', 'Continue') }}</button>
<div class="pin-error" id="pinError"></div>
</div>
<script>
const _L = {
incorrect: {{ LANG.get('interface_pin_err_incorrect', 'Incorrect PIN.') | tojson }},
tooMany: {{ LANG.get('interface_pin_err_too_many', 'Too many attempts. Try again later.') | tojson }},
network: {{ LANG.get('interface_pin_err_network', 'Network error. Please try again.') | tojson }}
};
async function submitPin() {
const pin = document.getElementById('pinInput').value.trim();
if (!pin) return;
const btn = document.getElementById('pinBtn');
const err = document.getElementById('pinError');
btn.disabled = true;
err.textContent = '';
try {
const r = await fetch('/api/interface/pin/verify', {
method: 'POST',
headers: {'Content-Type': 'application/json'},
body: JSON.stringify({pin})
});
if (r.ok) {
const next = new URLSearchParams(window.location.search).get('next') || '/';
window.location.href = next;
} else {
const d = await r.json().catch(() => ({}));
err.textContent = r.status === 429 ? (d.error || _L.tooMany) : (d.error || _L.incorrect);
if (r.status !== 429) {
document.getElementById('pinInput').value = '';
document.getElementById('pinInput').focus();
}
btn.disabled = false;
}
} catch(e) {
err.textContent = _L.network;
btn.disabled = false;
}
}
document.getElementById('pinInput').focus();
</script>
</body>
</html>

View File

@ -1,19 +0,0 @@
Personoplysninger — Elevakt
===========================
Elevens navn: Lars Bjerregaard Nielsen
Klasse: 8B
Skole: Gudenaaskolen
CPR-nummer: 010172-1019
Fødselsdato: 1. januar 1972
Adresse: Skolevej 14, 8680 Ry
Telefon: +45 86 89 12 34
E-mail: lars.nielsen@privat.dk
Notater:
Eleven har haft fravær i uge 12 og 14. Forældrene er kontaktet.
Der er afholdt møde den 3. marts 2024 med klasselærer og skoleleder.
Underskrift: _______________________
Dato: ___________________

View File

@ -1,15 +0,0 @@
Besøgslog — Sundhedscenter Skanderborg
=======================================
Dato: 28. april 2024
Sagsbehandler: M. Andersen
Borger: Hanne Kirstine Pedersen
Registreringsnummer: 280490-0120
Henvendelse vedrørende: Sygedagpenge, paragraf 7 opfølgning
Samtalen fandt sted kl. 10:15 og varede 45 minutter.
Borger mødte op til tiden og var forberedt.
Aftale om næste møde: 26. maj 2024 kl. 10:00
Sted: Mødelokale 3, Adelgade 44, 8660 Skanderborg

View File

@ -1,24 +0,0 @@
Tilmelding til SFO — Gudenaaskolen
===================================
Barnets navn: Emma Sofie Christensen
Personnummer: 150315-4321
Klasse: 1A (skolestart august 2022)
Forældrenes oplysninger
-----------------------
Forældrenes navn: Søren og Pia Christensen
Adresse: Birkevej 7, 8680 Ry
Telefon: +45 23 45 67 89
E-mail: soeren.christensen@familie.dk
Fremmødetider valgt:
Morgen-SFO: 07:0008:00
Eftermiddag: 13:0017:00
Særlige oplysninger til pædagoger:
Emma har en lettere nøddeallergi (jordnødder og cashewnødder).
Kontaktperson ved allergi: Pia Christensen, tlf. 23 45 67 89
Dato for tilmelding: 15. marts 2022
Underskrift: _______________________

View File

@ -1,31 +0,0 @@
Personalemappe — Fortroligt
============================
Afdeling: Administrationen, Skanderborg Kommune
Medarbejder 1
-------------
Navn: Christian Bøgh Hansen
CPR: 150365-1102
Stilling: Skoleleder
Ansættelsesdato: 1. august 2005
Løngruppe: L4
Medarbejder 2
-------------
Navn: Lise Ravn Johansen
CPR: 020898-0203
Stilling: Pædagog, fuldtid
Ansættelsesdato: 15. september 2021
Løngruppe: L2
Medarbejder 3
-------------
Navn: Anders Munk Mortensen
CPR: 010172-1019
Stilling: Administrativ medarbejder
Ansættelsesdato: 1. marts 2010
Løngruppe: L3
Dokument oprettet: 20. april 2026
Sidst opdateret: 20. april 2026
Udarbejdet af: HR-afdelingen

View File

@ -1,9 +0,0 @@
Klasse,Navn,CPR-nummer,Adresse,Forælder tlf,Bemærkninger
7A,Magnus Lund Eriksen,010172-1019,Egevej 3 8680 Ry,+45 40 12 34 56,
7A,Nora Bjerrum Nielsen,280490-0120,Møllevej 11 8680 Ry,+45 50 23 45 67,Brillebærer
7A,Oliver Skov Madsen,250372-0100,Kirkegade 2 8660 Skanderborg,+45 60 34 56 78,
7A,Ida Holst Andersen,020898-0203,Skovbrynet 19 8680 Ry,+45 70 45 67 89,Kontaktperson: Far
7B,Rasmus Dal Kristensen,150365-1102,Rosenvej 5 8680 Ry,+45 21 56 78 90,
7B,Sofie Holm Thomsen,111111-1010,Birkevej 22 8660 Skanderborg,+45 31 67 89 01,Allergi: nødder
7B,Emil Sand Jensen,010107-4102,Hybenvej 7 8680 Ry,+45 41 78 90 12,
7B,Laura Bak Møller,410172-1200,Pilevej 4 8660 Skanderborg,+45 51 89 01 23,Beskyttet adresse
1 Klasse Navn CPR-nummer Adresse Forælder tlf Bemærkninger
2 7A Magnus Lund Eriksen 010172-1019 Egevej 3 8680 Ry +45 40 12 34 56
3 7A Nora Bjerrum Nielsen 280490-0120 Møllevej 11 8680 Ry +45 50 23 45 67 Brillebærer
4 7A Oliver Skov Madsen 250372-0100 Kirkegade 2 8660 Skanderborg +45 60 34 56 78
5 7A Ida Holst Andersen 020898-0203 Skovbrynet 19 8680 Ry +45 70 45 67 89 Kontaktperson: Far
6 7B Rasmus Dal Kristensen 150365-1102 Rosenvej 5 8680 Ry +45 21 56 78 90
7 7B Sofie Holm Thomsen 111111-1010 Birkevej 22 8660 Skanderborg +45 31 67 89 01 Allergi: nødder
8 7B Emil Sand Jensen 010107-4102 Hybenvej 7 8680 Ry +45 41 78 90 12
9 7B Laura Bak Møller 410172-1200 Pilevej 4 8660 Skanderborg +45 51 89 01 23 Beskyttet adresse

View File

@ -1,6 +0,0 @@
Medarbejder-ID,Navn,Personnummer,Afdeling,Stilling,E-mail,Telefon,Ansættelses-dato
EMP-001,Christian Bøgh Hansen,150365-1102,Ledelse,Skoleleder,c.hansen@gudenaaskolen.dk,+45 86 89 10 01,2005-08-01
EMP-002,Mette Dahl Andersen,280490-0120,Administration,Sekretær,m.andersen@gudenaaskolen.dk,+45 86 89 10 02,2012-01-15
EMP-003,Søren Lykke Jakobsen,010172-1019,Pædagogik,Lærer,s.jakobsen@gudenaaskolen.dk,+45 86 89 10 03,2009-08-01
EMP-004,Hanne Frost Pedersen,250372-0100,Pædagogik,Lærer,h.pedersen@gudenaaskolen.dk,+45 86 89 10 04,2015-08-01
EMP-005,Lise Ravn Johansen,020898-0203,SFO,Pædagog,l.johansen@gudenaaskolen.dk,+45 86 89 10 05,2021-09-15
1 Medarbejder-ID Navn Personnummer Afdeling Stilling E-mail Telefon Ansættelses-dato
2 EMP-001 Christian Bøgh Hansen 150365-1102 Ledelse Skoleleder c.hansen@gudenaaskolen.dk +45 86 89 10 01 2005-08-01
3 EMP-002 Mette Dahl Andersen 280490-0120 Administration Sekretær m.andersen@gudenaaskolen.dk +45 86 89 10 02 2012-01-15
4 EMP-003 Søren Lykke Jakobsen 010172-1019 Pædagogik Lærer s.jakobsen@gudenaaskolen.dk +45 86 89 10 03 2009-08-01
5 EMP-004 Hanne Frost Pedersen 250372-0100 Pædagogik Lærer h.pedersen@gudenaaskolen.dk +45 86 89 10 04 2015-08-01
6 EMP-005 Lise Ravn Johansen 020898-0203 SFO Pædagog l.johansen@gudenaaskolen.dk +45 86 89 10 05 2021-09-15

View File

@ -1,16 +0,0 @@
Fortrolig personoplysning — Navne- og adressebeskyttelse
==========================================================
VIGTIGT: Denne person har navne- og adressebeskyttelse i CPR-registeret.
Oplysningerne må ikke videregives uden samtykke.
Navn: Laura Bak Møller
CPR-nummer: 410172-1200
(Dag + 40 angiver beskyttet adresse)
Kontaktoplysninger administreres af kommunen.
Henvendelse via: Borgerservice, Skanderborg Kommune
Telefon: 86 52 10 00
Dokumentet er klassificeret FORTROLIGT.
Opbevares i aflåst arkiv — ikke i fællesnetværk.

View File

@ -1,21 +0,0 @@
Lægeerklæring — Helbredsattest
================================
Udstedt af: Skanderborg Lægepraksis, Adelgade 10, 8660 Skanderborg
Praktiserende læge: Dr. P. Holm
Patient: Søren Lykke Jakobsen
Fødselsdato / CPR: 010172-1019
Adresse: Skolevej 22, 8680 Ry
Telefon: +45 22 33 44 55
E-mail: soeren.jakobsen@privat.dk
Diagnose (ICD-10): F41.1 — Generaliseret angst
Behandling: Psykoterapi + medicinsk behandling (SSRI)
Særlig kategori: Psykisk lidelse — GDPR Art. 9
Erklæringens formål: Sygedagpenge, §7-opfølgning
Periode: 1. april 2026 30. juni 2026
Lægens underskrift: _______________________
Dato: 20. april 2026
Stempel: [Skanderborg Lægepraksis]

Binary file not shown.

View File

@ -1,25 +0,0 @@
Mødereferat — Pædagogisk råd
==============================
Dato: 20. april 2026
Sted: Personalerummet, Gudenaaskolen
Ordstyrer: Skolelederen
Referent: Administrationen
Dagsorden:
1. Godkendelse af referat fra seneste møde
2. Orientering om skoleårets planlægning 2026/2027
3. Status på inklusion og trivselsundersøgelse
4. Eventuelt
Ad 1: Referatet fra mødet den 15. marts 2026 blev godkendt uden bemærkninger.
Ad 2: Skolelederen orienterede om planlægningen for det kommende skoleår.
Skemaerne for 0.-9. klasse offentliggøres i Aula senest 1. juni 2026.
Der er planlagt en fælles pædagogisk dag den 10. august 2026.
Ad 3: Trivselsundersøgelsen viste generelt gode resultater.
Inklusionsvejlederen præsenterer en handlingsplan på næste møde.
Ad 4: Intet til eventuelt.
Næste møde: Tirsdag den 19. maj 2026 kl. 14:00 i personalerummet.

View File

@ -1,31 +0,0 @@
FAKTURA
=======
Leverandør: Kontor & Papir A/S
Industriparken 22, 8600 Silkeborg
CVR: 12345678
Kunde: Gudenaaskolen
Skolevej 1, 8680 Ry
EAN: 5790001234567
Fakturanr: 250372-0100
Fakturadato: 20. april 2026
Forfaldsdato: 20. maj 2026
Ordrenr: 020898-0203
Varenr: 150365-1102
Linjer:
---------------------------------------------------------------------------
Beskrivelse Antal Enhedspris Moms Total
---------------------------------------------------------------------------
Kopipapir A4 80g, pk/500 20 89,00 kr 20% 2.136,00 kr
Blækpatroner HP 305, sort 5 149,00 kr 20% 894,00 kr
Whiteboardmarker, ass. farver 3 49,95 kr 20% 179,82 kr
---------------------------------------------------------------------------
Subtotal ekskl. moms: 2.561,95 kr
Moms 25%: 640,49 kr
I alt inkl. moms: 3.202,44 kr
Betalingsbetingelser: Netto 30 dage
Bank: Jyske Bank, Reg. 7600, Konto 1234567

View File

@ -1,20 +0,0 @@
Inventarliste — Klasselokale 7A
================================
Opdateret: 20. april 2026
Af: Teknisk servicepersonale
Rum-ID: 7A-GS-2026
Lokale: Bygning C, 1. sal
Inventar:
---------
Elevborde 32 stk (serienr. påtegnet under bordet)
Elevstole 32 stk (standard, justerbar højde)
Lærerbord 1 stk (inkl. skuff, lås medfølger)
Whiteboard 2 stk (160×120 cm)
Projektor 1 stk (Epson EB-W51, serienr. 150315-4321)
Projektordug 1 stk (180 cm, motor-betjent)
Gardinmotor 2 stk (fjernstyret)
Næste serviceeftersyn: Oktober 2026
Ansvarlig: Teknisk afdeling, Skanderborg Kommune

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

View File

@ -1,347 +0,0 @@
"""
Generate binary fixture files for the local-file GDPR scan test suite.
Run from repo root:
source venv/bin/activate
python tests/fixtures/local_files/generate_fixtures.py
Fixtures produced
Document fixtures (require python-docx + openpyxl):
09_cpr_in_docx.docx Word document with 2 CPR numbers Flag
13_cpr_in_xlsx.xlsx Excel workbook with CPR numbers Flag
Audio fixtures (require mutagen):
14_audio_artist_pii.mp3 MP3 with artist/title tags (personal name) Flag
15_audio_artist_pii.flac FLAC with artist/title Vorbis comments Flag
16_audio_no_pii.mp3 MP3 with no metadata tags No flag
17_audio_no_pii.flac FLAC with no metadata No flag
Video fixtures (require mutagen):
18_video_gps.mp4 MP4 with GPS coordinates + artist tag Flag
19_video_no_pii.mp4 MP4 with no metadata tags No flag
"""
import struct
import tempfile
import os
from pathlib import Path
import sys
HERE = Path(__file__).parent
def _require(pkg):
try:
return __import__(pkg)
except ImportError:
print(f"Missing: {pkg} → pip install {pkg}", file=sys.stderr)
sys.exit(1)
openpyxl = _require("openpyxl")
docx = _require("docx")
_require("mutagen")
from openpyxl import Workbook
from openpyxl.styles import Font, PatternFill, Alignment
from docx import Document
from docx.shared import Pt, RGBColor
from docx.enum.text import WD_ALIGN_PARAGRAPH
# ── 09_cpr_in_docx.docx ───────────────────────────────────────────────────────
def make_docx():
doc = Document()
doc.add_heading("Elevjournal — Gudenaaskolen", level=1)
p = doc.add_paragraph()
p.add_run("Dette dokument indeholder personoplysninger og er fortroligt.")
p.runs[0].italic = True
doc.add_heading("Elevoplysninger", level=2)
# Use labelled paragraphs so CPR values are always preceded by ": " —
# avoids the _CPR_PREFIX_NOISE guard that fires when table-cell runs are
# concatenated without a separator.
fields = [
("Navn", "Magnus Lund Eriksen"),
("CPR-nummer", "010172-1019"),
("Klasse", "8B"),
("Adresse", "Egevej 3, 8680 Ry"),
("Telefon", "+45 40 12 34 56"),
("E-mail", "magnus.eriksen@elev.gudenaaskolen.dk"),
]
for label, value in fields:
p = doc.add_paragraph()
run_label = p.add_run(f"{label}: ")
run_label.bold = True
p.add_run(value + " ")
doc.add_heading("Forældrekontakt", level=2)
doc.add_paragraph(
"Forældrene er orienteret om elevens situation den 15. marts 2026. "
"Begge forældre deltog i mødet. Næste opfølgning er planlagt til "
"maj 2026."
)
doc.add_heading("Anden elev — tabel", level=2)
doc.add_paragraph(
"Nedenstående tabel viser en anden elev, der deler klasse med Magnus."
)
for label, value in [
("Navn", "Nora Bjerrum Nielsen"),
("Personnummer", "280490-0120"),
("Klasse", "8B"),
]:
p = doc.add_paragraph()
p.add_run(f"{label}: ").bold = True
p.add_run(value + " ")
doc.add_heading("Sagsbehandlernote", level=2)
doc.add_paragraph(
"Sagsbehandler: M. Andersen\n"
"Dato: 20. april 2026\n"
"Der er ikke fundet grundlag for yderligere foranstaltninger."
)
out = HERE / "09_cpr_in_docx.docx"
doc.save(str(out))
print(f"Written: {out.name}")
# ── 13_cpr_in_xlsx.xlsx ───────────────────────────────────────────────────────
def make_xlsx():
wb = Workbook()
# Sheet 1: Elevliste
ws1 = wb.active
ws1.title = "Elevliste"
header_font = Font(bold=True, color="FFFFFF")
header_fill = PatternFill("solid", fgColor="2B5F9E")
headers = ["Klasse", "Navn", "CPR-nummer", "Adresse", "Forælder tlf", "Bemærkninger"]
for col, h in enumerate(headers, 1):
cell = ws1.cell(row=1, column=col, value=h)
cell.font = header_font
cell.fill = header_fill
cell.alignment = Alignment(horizontal="center")
students = [
("7A", "Magnus Lund Eriksen", "010172-1019", "Egevej 3, 8680 Ry", "+45 40 12 34 56", ""),
("7A", "Nora Bjerrum Nielsen", "280490-0120", "Møllevej 11, 8680 Ry", "+45 50 23 45 67", "Brillebærer"),
("7A", "Oliver Skov Madsen", "250372-0100", "Kirkegade 2, 8660 Skanderborg", "+45 60 34 56 78", ""),
("7B", "Rasmus Dal Kristensen", "150365-1102", "Rosenvej 5, 8680 Ry", "+45 21 56 78 90", ""),
("7B", "Sofie Holm Thomsen", "111111-1010", "Birkevej 22, 8660 Skanderborg", "+45 31 67 89 01", "Allergi: nødder"),
("7B", "Emil Sand Jensen", "010107-4102", "Hybenvej 7, 8680 Ry", "+45 41 78 90 12", ""),
]
for row_i, row_data in enumerate(students, 2):
for col_i, val in enumerate(row_data, 1):
ws1.cell(row=row_i, column=col_i, value=val)
for col in ws1.columns:
max_len = max(len(str(c.value or "")) for c in col)
ws1.column_dimensions[col[0].column_letter].width = max_len + 4
# Sheet 2: Medarbejdere
ws2 = wb.create_sheet("Medarbejdere")
emp_headers = ["ID", "Navn", "Personnummer", "Afdeling", "E-mail"]
for col, h in enumerate(emp_headers, 1):
cell = ws2.cell(row=1, column=col, value=h)
cell.font = header_font
cell.fill = header_fill
cell.alignment = Alignment(horizontal="center")
employees = [
("EMP-001", "Christian Bøgh Hansen", "150365-1102", "Ledelse", "c.hansen@gudenaaskolen.dk"),
("EMP-002", "Mette Dahl Andersen", "280490-0120", "Administration", "m.andersen@gudenaaskolen.dk"),
("EMP-003", "Søren Lykke Jakobsen", "010172-1019", "Pædagogik", "s.jakobsen@gudenaaskolen.dk"),
]
for row_i, row_data in enumerate(employees, 2):
for col_i, val in enumerate(row_data, 1):
ws2.cell(row=row_i, column=col_i, value=val)
for col in ws2.columns:
max_len = max(len(str(c.value or "")) for c in col)
ws2.column_dimensions[col[0].column_letter].width = max_len + 4
out = HERE / "13_cpr_in_xlsx.xlsx"
wb.save(str(out))
print(f"Written: {out.name}")
# ── Audio / video helpers ─────────────────────────────────────────────────────
# Two silent MPEG1 Layer3 frames (128 kbps / 44100 Hz / mono).
# mutagen needs at least 2 consecutive frame headers to confirm sync.
# 4-byte header + 413 bytes frame body = 417 bytes × 2 = 834 bytes total.
_MPEG_FRAMES = (b'\xff\xfb\x90\x00' + b'\x00' * 413) * 2
def _flac_block_header(block_type: int, data_len: int, last: bool = False) -> bytes:
first = (0x80 if last else 0x00) | block_type
return bytes([first, (data_len >> 16) & 0xFF, (data_len >> 8) & 0xFF, data_len & 0xFF])
def _vorbis_comment_block(comments: dict) -> bytes:
vendor = b'GDPRScanner fixture'
data = struct.pack('<I', len(vendor)) + vendor
data += struct.pack('<I', len(comments))
for key, value in comments.items():
entry = f'{key}={value}'.encode('utf-8')
data += struct.pack('<I', len(entry)) + entry
return data
def _minimal_flac(comments: dict) -> bytes:
"""Return bytes for a valid minimal FLAC file with Vorbis comments."""
# STREAMINFO (34 bytes): 44100 Hz, mono, 16-bit, 0 samples, zero MD5.
si = bytearray(34)
si[0:2] = struct.pack('>H', 4096) # min block size
si[2:4] = struct.pack('>H', 4096) # max block size
# bytes 4-9: min/max frame sizes = 0 (unknown)
# Bits 80-99: sample_rate=44100 (0xAC44 in 20-bit field)
# Bits 100-102: channels-1 = 0 (mono)
# Bits 103-107: bits_per_sample-1 = 15 (16-bit)
# Bits 108-143: total_samples = 0; bytes 14-17 remain zero
si[10] = 0x0A # 0000_1010 — top 8 of 44100 in 20-bit field
si[11] = 0xC4 # 1100_0100
si[12] = 0x40 # bottom 4 of sample_rate | channels(000) | bps_msb(0)
si[13] = 0xF0 # bps remaining 4 bits (1111) | top 4 of total_samples (0)
vc = _vorbis_comment_block(comments)
return (
b'fLaC'
+ _flac_block_header(0, 34, last=not comments) # STREAMINFO
+ bytes(si)
+ (_flac_block_header(4, len(vc), last=True) + vc if comments else b'')
)
def _mp4_atom(name: bytes, data: bytes) -> bytes:
return struct.pack('>I', 8 + len(data)) + name + data
def _minimal_mp4_base() -> bytes:
"""Return bytes for the smallest valid MPEG-4 container mutagen can tag."""
# ftyp — identifies the file as M4A
ftyp = _mp4_atom(
b'ftyp',
b'M4A ' + struct.pack('>I', 0) + b'M4A ' + b'mp42' + b'isom',
)
# mvhd version 0 — 100 bytes of content (ISO 14496-12 §8.2.2)
mvhd = bytearray(100)
mvhd[0:4] = b'\x00\x00\x00\x00' # version + flags
struct.pack_into('>IIII', mvhd, 4, 0, 0, 1000, 0) # creation, modification, timescale, duration
struct.pack_into('>I', mvhd, 16, 0x00010000) # rate = 1.0
struct.pack_into('>H', mvhd, 20, 0x0100) # volume = 1.0
# bytes 22-31: reserved (10 bytes, already zero)
struct.pack_into('>9i', mvhd, 32, # unity matrix
0x00010000, 0, 0, 0, 0x00010000, 0, 0, 0, 0x40000000)
# bytes 68-91: pre-defined (24 bytes, already zero)
struct.pack_into('>I', mvhd, 96, 0xFFFFFFFF) # next_track_ID
return ftyp + _mp4_atom(b'moov', _mp4_atom(b'mvhd', bytes(mvhd)))
def _mp4_with_tags(tags: dict) -> bytes:
"""Return bytes for a minimal MP4 with the given mutagen tag dict."""
import mutagen.mp4
tmp = tempfile.mktemp(suffix='.mp4')
try:
with open(tmp, 'wb') as fh:
fh.write(_minimal_mp4_base())
f = mutagen.mp4.MP4(tmp)
f.add_tags()
for key, value in tags.items():
f.tags[key] = [value]
f.save()
with open(tmp, 'rb') as fh:
return fh.read()
finally:
if os.path.exists(tmp):
os.unlink(tmp)
# ── 14_audio_artist_pii.mp3 ───────────────────────────────────────────────────
def make_mp3_pii():
from mutagen.easyid3 import EasyID3
tmp = tempfile.mktemp(suffix='.mp3')
try:
t = EasyID3()
t['artist'] = ['Emma Slot Henriksen']
t['title'] = ['Fortrolig optagelse — personalemøde']
t['date'] = ['2026-04-21']
t.save(tmp)
with open(tmp, 'rb') as fh:
id3_bytes = fh.read()
finally:
if os.path.exists(tmp):
os.unlink(tmp)
out = HERE / '14_audio_artist_pii.mp3'
out.write_bytes(id3_bytes + _MPEG_FRAMES)
print(f"Written: {out.name}")
# ── 15_audio_artist_pii.flac ──────────────────────────────────────────────────
def make_flac_pii():
out = HERE / '15_audio_artist_pii.flac'
out.write_bytes(_minimal_flac({
'ARTIST': 'Emma Slot Henriksen',
'TITLE': 'Fortrolig optagelse — personalemøde',
'DATE': '2026-04-21',
}))
print(f"Written: {out.name}")
# ── 16_audio_no_pii.mp3 ───────────────────────────────────────────────────────
def make_mp3_no_pii():
from mutagen.easyid3 import EasyID3
tmp = tempfile.mktemp(suffix='.mp3')
try:
EasyID3().save(tmp) # empty ID3 header, no tags
with open(tmp, 'rb') as fh:
id3_bytes = fh.read()
finally:
if os.path.exists(tmp):
os.unlink(tmp)
out = HERE / '16_audio_no_pii.mp3'
out.write_bytes(id3_bytes + _MPEG_FRAMES)
print(f"Written: {out.name}")
# ── 17_audio_no_pii.flac ──────────────────────────────────────────────────────
def make_flac_no_pii():
out = HERE / '17_audio_no_pii.flac'
out.write_bytes(_minimal_flac({})) # no Vorbis comment block
print(f"Written: {out.name}")
# ── 18_video_gps.mp4 ─────────────────────────────────────────────────────────
def make_mp4_gps():
out = HERE / '18_video_gps.mp4'
out.write_bytes(_mp4_with_tags({
'©xyz': '+55.6761+012.5683+000.000/', # Copenhagen
'©ART': 'Emma Slot Henriksen',
'©nam': 'Optagelse fra skolegården',
}))
print(f"Written: {out.name}")
# ── 19_video_no_pii.mp4 ──────────────────────────────────────────────────────
def make_mp4_no_pii():
out = HERE / '19_video_no_pii.mp4'
out.write_bytes(_minimal_mp4_base()) # no moov/udta/meta/ilst — no tags
print(f"Written: {out.name}")
if __name__ == "__main__":
make_docx()
make_xlsx()
make_mp3_pii()
make_flac_pii()
make_mp3_no_pii()
make_flac_no_pii()
make_mp4_gps()
make_mp4_no_pii()
print("Done.")

View File

@ -252,36 +252,3 @@ class TestFernet:
def test_decrypt_empty_returns_empty(self):
result = app_config._decrypt_password("")
assert result == ""
class TestSmtpConfigLegacyKeys:
"""SMTP config saved by the older settings tab used `user`/`starttls`;
readers expect `username`/`use_tls`. _load_smtp_config must normalise them."""
def test_legacy_keys_normalised_on_load(self, tmp_path, monkeypatch):
import json
p = tmp_path / "smtp.json"
p.write_text(json.dumps({
"host": "smtp.gmail.com", "port": 587,
"user": "netadmin@adm.example.dk", # legacy key
"starttls": True, # legacy key
"from_addr": "netadmin@adm.example.dk",
"recipients": ["a@example.dk"],
}), encoding="utf-8")
monkeypatch.setattr(app_config, "_SMTP_CONFIG_PATH", p)
cfg = app_config._load_smtp_config()
assert cfg["username"] == "netadmin@adm.example.dk"
assert cfg["use_tls"] is True
def test_canonical_keys_take_precedence(self, tmp_path, monkeypatch):
import json
p = tmp_path / "smtp.json"
p.write_text(json.dumps({
"username": "canonical@example.dk",
"user": "legacy@example.dk",
}), encoding="utf-8")
monkeypatch.setattr(app_config, "_SMTP_CONFIG_PATH", p)
cfg = app_config._load_smtp_config()
assert cfg["username"] == "canonical@example.dk"

View File

@ -22,8 +22,8 @@ import checkpoint
@pytest.fixture(autouse=True)
def _isolate(tmp_path, monkeypatch):
"""Redirect all disk writes to a temp dir for each test."""
monkeypatch.setattr(checkpoint, "_DATA_DIR", tmp_path)
monkeypatch.setattr(checkpoint, "_DELTA_PATH", tmp_path / "delta.json")
monkeypatch.setattr(checkpoint, "_CHECKPOINT_PATH", tmp_path / "checkpoint.json")
monkeypatch.setattr(checkpoint, "_DELTA_PATH", tmp_path / "delta.json")
_OPTS = {

View File

@ -265,71 +265,3 @@ class TestExportImport:
tgt.import_db(str(export_path), mode="replace")
results = tgt.lookup_data_subject("290472-1234")
assert len(results) >= 1
# ─────────────────────────────────────────────────────────────────────────────
# Orphan-scan recovery (crash / kill / mid-scan restart)
# ─────────────────────────────────────────────────────────────────────────────
class TestOrphanScanRecovery:
def _start_unfinished_scan(self, db, item_id):
"""Begin a scan and save an item but never call finish_scan."""
sid = db.begin_scan({"sources": ["email"], "user_ids": []})
db.save_item(sid, _make_card(item_id=item_id))
return sid
def test_unfinished_scan_items_hidden_until_recovery(self, tmp_db):
self._start_unfinished_scan(tmp_db, "orphan-1")
# Not finalised → invisible to the open-items view
assert tmp_db.get_open_items() == []
def test_recovery_finalises_and_reveals_items(self, tmp_db):
self._start_unfinished_scan(tmp_db, "orphan-1")
self._start_unfinished_scan(tmp_db, "orphan-2")
recovered = tmp_db.finalize_orphan_scans()
assert recovered == 2
ids = {row["id"] for row in tmp_db.get_open_items()}
assert ids == {"orphan-1", "orphan-2"}
def test_recovery_leaves_finished_scans_untouched(self, tmp_db):
sid = tmp_db.begin_scan({"sources": ["email"], "user_ids": []})
tmp_db.save_item(sid, _make_card(item_id="done-1"))
tmp_db.finish_scan(sid, total_scanned=1)
before = tmp_db._connect().execute(
"SELECT finished_at FROM scans WHERE id=?", (sid,)
).fetchone()[0]
assert tmp_db.finalize_orphan_scans() == 0 # nothing to recover
after = tmp_db._connect().execute(
"SELECT finished_at FROM scans WHERE id=?", (sid,)
).fetchone()[0]
assert after == before # finished_at not rewritten
def test_recovery_is_idempotent(self, tmp_db):
self._start_unfinished_scan(tmp_db, "orphan-1")
assert tmp_db.finalize_orphan_scans() == 1
assert tmp_db.finalize_orphan_scans() == 0
# ─────────────────────────────────────────────────────────────────────────────
# account_name persistence (user/group badge data)
# ─────────────────────────────────────────────────────────────────────────────
class TestAccountNamePersistence:
def test_account_name_round_trips(self, tmp_db):
sid = tmp_db.begin_scan({"sources": ["email"], "user_ids": []})
tmp_db.save_item(sid, _make_card(item_id="an-1")) # account_name="Test User"
tmp_db.finish_scan(sid, total_scanned=1)
row = [r for r in tmp_db.get_open_items() if r["id"] == "an-1"][0]
assert row.get("account_name") == "Test User"
def test_account_name_column_exists(self, tmp_db):
cols = [r[1] for r in tmp_db._connect().execute(
"PRAGMA table_info(flagged_items)").fetchall()]
assert "account_name" in cols

View File

@ -1,311 +0,0 @@
"""
Route and engine tests for the Google Workspace scan module.
Covers:
- GET /api/google/scan/users auth guard, user list, error propagation
- POST /api/google/scan/start auth guard, concurrency lock, successful start, lock release
- POST /api/google/scan/cancel abort signal
- _run_google_scan no-connector broadcast, CPR hit flagging, source_type tagging
"""
from __future__ import annotations
import threading
import time
from unittest.mock import MagicMock
import pytest
# ── Fixtures ──────────────────────────────────────────────────────────────────
@pytest.fixture(scope="module")
def flask_app():
import gdpr_scanner
gdpr_scanner.app.config["TESTING"] = True
gdpr_scanner.app.config["WTF_CSRF_ENABLED"] = False
return gdpr_scanner.app
@pytest.fixture()
def client(flask_app):
with flask_app.test_client() as c:
yield c
@pytest.fixture()
def mock_google_connector(monkeypatch):
from routes import state
conn = MagicMock()
conn.list_users.return_value = []
monkeypatch.setattr(state, "google_connector", conn)
return conn
@pytest.fixture(autouse=True)
def clean_google_state():
yield
from routes import state
# Release the Google scan lock if a test left it acquired
acquired = state._google_scan_lock.acquire(blocking=False)
if acquired:
state._google_scan_lock.release()
state._google_scan_abort.clear()
# ── GET /api/google/scan/users ────────────────────────────────────────────────
class TestGoogleScanUsers:
def test_not_connected_returns_401(self, client, monkeypatch):
from routes import state
monkeypatch.setattr(state, "google_connector", None)
r = client.get("/api/google/scan/users")
assert r.status_code == 401
assert r.json["error"] == "not connected"
def test_returns_user_list(self, client, mock_google_connector):
mock_google_connector.list_users.return_value = [
{"id": "1", "email": "alice@test.dk", "displayName": "Alice", "userRole": "student"},
]
r = client.get("/api/google/scan/users")
assert r.status_code == 200
assert len(r.json["users"]) == 1
assert r.json["users"][0]["email"] == "alice@test.dk"
def test_returns_empty_list_when_no_users(self, client, mock_google_connector):
mock_google_connector.list_users.return_value = []
r = client.get("/api/google/scan/users")
assert r.status_code == 200
assert r.json["users"] == []
def test_connector_error_returns_500(self, client, mock_google_connector):
mock_google_connector.list_users.side_effect = Exception("Admin SDK unavailable")
r = client.get("/api/google/scan/users")
assert r.status_code == 500
assert "error" in r.json
# ── POST /api/google/scan/start ───────────────────────────────────────────────
class TestGoogleScanStart:
def test_not_connected_returns_401(self, client, monkeypatch):
from routes import state
monkeypatch.setattr(state, "google_connector", None)
r = client.post("/api/google/scan/start", json={})
assert r.status_code == 401
assert "not connected" in r.json["error"]
def test_already_running_returns_409(self, client, mock_google_connector):
from routes import state
state._google_scan_lock.acquire()
try:
r = client.post("/api/google/scan/start", json={})
assert r.status_code == 409
assert "already running" in r.json["error"]
finally:
state._google_scan_lock.release()
def test_starts_successfully(self, client, mock_google_connector, monkeypatch):
import routes.google_scan
monkeypatch.setattr(routes.google_scan, "_run_google_scan", lambda opts: None)
r = client.post("/api/google/scan/start", json={})
assert r.status_code == 200
assert r.json["status"] == "started"
def test_abort_event_cleared_on_start(self, client, mock_google_connector, monkeypatch):
import routes.google_scan
from routes import state
state._google_scan_abort.set()
monkeypatch.setattr(routes.google_scan, "_run_google_scan", lambda opts: None)
client.post("/api/google/scan/start", json={})
assert not state._google_scan_abort.is_set()
def test_lock_released_after_scan_completes(self, client, mock_google_connector, monkeypatch):
import routes.google_scan
from routes import state
done = threading.Event()
def _fake_scan(opts):
time.sleep(0.02)
done.set()
monkeypatch.setattr(routes.google_scan, "_run_google_scan", _fake_scan)
r = client.post("/api/google/scan/start", json={})
assert r.status_code == 200
assert done.wait(timeout=3), "Scan thread did not complete in time"
time.sleep(0.05) # allow finally block to run
acquired = state._google_scan_lock.acquire(blocking=False)
assert acquired, "Lock was not released after scan completed"
state._google_scan_lock.release()
@pytest.mark.filterwarnings("ignore::pytest.PytestUnhandledThreadExceptionWarning")
def test_lock_released_on_scan_exception(self, client, mock_google_connector, monkeypatch):
import routes.google_scan
from routes import state
done = threading.Event()
def _failing_scan(opts):
done.set()
raise RuntimeError("simulated crash")
monkeypatch.setattr(routes.google_scan, "_run_google_scan", _failing_scan)
r = client.post("/api/google/scan/start", json={})
assert r.status_code == 200
assert done.wait(timeout=3), "Scan thread did not complete in time"
time.sleep(0.05)
acquired = state._google_scan_lock.acquire(blocking=False)
assert acquired, "Lock was not released after scan raised an exception"
state._google_scan_lock.release()
# ── POST /api/google/scan/cancel ─────────────────────────────────────────────
class TestGoogleScanCancel:
def test_sets_abort_event(self, client):
from routes import state
state._google_scan_abort.clear()
r = client.post("/api/google/scan/cancel")
assert r.status_code == 200
assert r.json["status"] == "cancelling"
assert state._google_scan_abort.is_set()
def test_idempotent_when_not_running(self, client):
r = client.post("/api/google/scan/cancel")
assert r.status_code == 200
assert r.json["status"] == "cancelling"
# ── _run_google_scan engine ───────────────────────────────────────────────────
class TestRunGoogleScan:
"""
Unit-tests for _run_google_scan() called synchronously with all heavy
dependencies mocked: broadcast, _scan_bytes, DB, checkpoint I/O.
"""
def _setup_mocks(self, monkeypatch, conn, scan_bytes_result=None):
import gdpr_scanner
import checkpoint
import scan_engine
import gdpr_db
from routes import state
events = []
monkeypatch.setattr(state, "google_connector", conn)
monkeypatch.setattr(gdpr_scanner, "broadcast",
lambda evt, data=None: events.append((evt, data or {})))
monkeypatch.setattr(gdpr_scanner, "_scan_bytes",
lambda data, name, **kw: scan_bytes_result or {
"cprs": [], "pii_counts": None, "emails": [], "phones": []
})
monkeypatch.setattr(checkpoint, "_load_checkpoint", lambda *a, **kw: None)
monkeypatch.setattr(checkpoint, "_save_checkpoint", lambda *a, **kw: None)
monkeypatch.setattr(checkpoint, "_clear_checkpoint", lambda *a, **kw: None)
monkeypatch.setattr(checkpoint, "_load_delta_tokens", lambda: {})
monkeypatch.setattr(checkpoint, "_save_delta_tokens", lambda *a: None)
monkeypatch.setattr(scan_engine, "_with_disposition", lambda card, db: card)
monkeypatch.setattr(gdpr_db, "get_db", lambda *a, **kw: None)
gdpr_scanner.flagged_items.clear()
return events
def _run(self, monkeypatch, conn, options, scan_bytes_result=None):
import gdpr_scanner
import routes.google_scan as gs
events = self._setup_mocks(monkeypatch, conn, scan_bytes_result)
gs._run_google_scan(options)
gdpr_scanner.flagged_items.clear()
return events
def test_no_connector_broadcasts_error_and_done(self, monkeypatch):
import gdpr_scanner
import routes.google_scan as gs
from routes import state
events = []
monkeypatch.setattr(state, "google_connector", None)
monkeypatch.setattr(gdpr_scanner, "broadcast",
lambda evt, data=None: events.append((evt, data or {})))
gs._run_google_scan({"sources": ["gmail"], "user_emails": ["a@b.dk"], "options": {}})
assert any(evt == "scan_error" for evt, _ in events)
assert any(evt == "google_scan_done" for evt, _ in events)
def test_gmail_item_with_cpr_is_flagged(self, monkeypatch):
conn = MagicMock()
conn.list_users.return_value = []
conn.iter_gmail_messages.return_value = [
({"id": "msg1", "name": "report.txt", "size": 1024, "lastModifiedDateTime": "2026-01-01"}, b"content"),
]
cpr_result = {"cprs": [{"formatted": "010101-1234"}], "pii_counts": None, "emails": [], "phones": []}
events = self._run(monkeypatch, conn,
{"sources": ["gmail"], "user_emails": ["a@test.dk"], "options": {}},
scan_bytes_result=cpr_result)
flagged = [d for evt, d in events if evt == "scan_file_flagged"]
assert len(flagged) == 1
def test_gmail_item_source_type_is_gmail(self, monkeypatch):
conn = MagicMock()
conn.list_users.return_value = []
conn.iter_gmail_messages.return_value = [
({"id": "msg2", "name": "invoice.txt", "size": 512, "lastModifiedDateTime": "2026-01-01"}, b"data"),
]
cpr_result = {"cprs": [{"formatted": "020202-2345"}], "pii_counts": None, "emails": [], "phones": []}
events = self._run(monkeypatch, conn,
{"sources": ["gmail"], "user_emails": ["a@test.dk"], "options": {}},
scan_bytes_result=cpr_result)
flagged = [d for evt, d in events if evt == "scan_file_flagged"]
assert flagged[0]["source_type"] == "gmail"
def test_gmail_item_without_pii_not_flagged(self, monkeypatch):
conn = MagicMock()
conn.list_users.return_value = []
conn.iter_gmail_messages.return_value = [
({"id": "msg3", "name": "memo.txt", "size": 100}, b"hello world"),
]
events = self._run(monkeypatch, conn,
{"sources": ["gmail"], "user_emails": ["a@test.dk"], "options": {}})
assert not any(evt == "scan_file_flagged" for evt, _ in events)
def test_gdrive_item_source_type_is_gdrive(self, monkeypatch):
conn = MagicMock()
conn.list_users.return_value = []
conn.iter_gmail_messages.return_value = []
conn.iter_drive_files.return_value = [
({"id": "file1", "name": "doc.docx", "size": 2048, "lastModifiedDateTime": "2026-01-01"}, b"data"),
]
cpr_result = {"cprs": [{"formatted": "030303-3456"}], "pii_counts": None, "emails": [], "phones": []}
events = self._run(monkeypatch, conn,
{"sources": ["gmail", "gdrive"], "user_emails": ["a@test.dk"], "options": {}},
scan_bytes_result=cpr_result)
gdrive = [d for evt, d in events if evt == "scan_file_flagged" and d.get("source_type") == "gdrive"]
assert len(gdrive) == 1
def test_scan_done_always_broadcast(self, monkeypatch):
conn = MagicMock()
conn.list_users.return_value = []
conn.iter_gmail_messages.return_value = []
events = self._run(monkeypatch, conn,
{"sources": ["gmail"], "user_emails": ["a@test.dk"], "options": {}})
done = [d for evt, d in events if evt == "google_scan_done"]
assert len(done) == 1
assert "flagged_count" in done[0]
assert "total_scanned" in done[0]
def test_scan_done_counts_are_correct(self, monkeypatch):
conn = MagicMock()
conn.list_users.return_value = []
conn.iter_gmail_messages.return_value = [
({"id": "m1", "name": "a.txt", "size": 100}, b"x"),
({"id": "m2", "name": "b.txt", "size": 100}, b"y"),
]
cpr_result = {"cprs": [{"formatted": "040404-4567"}], "pii_counts": None, "emails": [], "phones": []}
events = self._run(monkeypatch, conn,
{"sources": ["gmail"], "user_emails": ["a@test.dk"], "options": {}},
scan_bytes_result=cpr_result)
done = next(d for evt, d in events if evt == "google_scan_done")
assert done["total_scanned"] == 2
assert done["flagged_count"] == 2

View File

@ -1,663 +0,0 @@
"""
Route integration tests security-sensitive paths and data-correctness contracts.
Covers:
- Viewer token CRUD and scope validation
- GET /api/db/flagged role and user scope enforcement
- POST /api/db/disposition/bulk only updates selected items
- Viewer PIN set / verify / rate-limit / clear
- Interface PIN set / gate / clear
- Scan lock always released (even when run_scan raises)
- GET /api/db/sessions basic shape
- Profile routes CRUD and rename
"""
from __future__ import annotations
import time
from unittest.mock import MagicMock
import pytest
# ---------------------------------------------------------------------------
# Module-level app fixture (shared with test_routes.py via flask_app)
# ---------------------------------------------------------------------------
@pytest.fixture(scope="module")
def flask_app():
import gdpr_scanner
gdpr_scanner.app.config["TESTING"] = True
gdpr_scanner.app.config["WTF_CSRF_ENABLED"] = False
return gdpr_scanner.app
@pytest.fixture()
def client(flask_app):
with flask_app.test_client() as c:
yield c
@pytest.fixture()
def db_patch(tmp_path, monkeypatch):
from gdpr_db import ScanDB
import routes.database, routes.export
db = ScanDB(str(tmp_path / "test.db"))
monkeypatch.setattr(routes.database, "_get_db", lambda: db)
monkeypatch.setattr(routes.database, "DB_OK", True)
monkeypatch.setattr(routes.export, "_get_db", lambda: db)
monkeypatch.setattr(routes.export, "DB_OK", True)
return db
@pytest.fixture()
def mock_connector(monkeypatch):
from routes import state
conn = MagicMock()
monkeypatch.setattr(state, "connector", conn)
return conn
@pytest.fixture(autouse=True)
def clean_state():
from routes import state
yield
state.flagged_items.clear()
if not state._scan_lock.acquire(blocking=False):
pass
else:
state._scan_lock.release()
# ---------------------------------------------------------------------------
# Helpers
# ---------------------------------------------------------------------------
def _seed_scan(db, items: list[dict]) -> int:
"""Create a completed scan and persist items. Returns the scan_id."""
scan_id = db.begin_scan({"sources": ["email"], "user_ids": [], "options": {}})
for item in items:
db.save_item(scan_id, item)
db.finish_scan(scan_id, total_scanned=len(items))
return scan_id
def _item(item_id: str, role: str = "staff", account_id: str = "") -> dict:
return {
"id": item_id,
"name": f"{item_id}.docx",
"source": "email",
"source_type": "email",
"account_id": account_id or f"{item_id}@school.dk",
"user_role": role,
"cpr_count": 1,
"face_count": 0,
"size_kb": 10,
"modified": "2025-01-01T00:00:00",
}
def _clear_viewer_pins():
"""Remove both viewer and interface PINs between tests."""
from app_config import clear_viewer_pin, clear_interface_pin
clear_viewer_pin()
clear_interface_pin()
# ---------------------------------------------------------------------------
# Viewer token CRUD
# ---------------------------------------------------------------------------
class TestViewerTokenCRUD:
def test_create_and_list(self, client):
r = client.post("/api/viewer/tokens",
json={"label": "Test token", "expires_days": 7})
assert r.status_code == 201
data = r.get_json()
assert "token" in data
tok = data["token"]
r2 = client.get("/api/viewer/tokens")
assert r2.status_code == 200
tokens = r2.get_json()
assert any(t["token"] == tok for t in tokens)
def test_delete_existing_token(self, client):
r = client.post("/api/viewer/tokens", json={"label": "to-delete"})
tok = r.get_json()["token"]
r2 = client.delete(f"/api/viewer/tokens/{tok}")
assert r2.status_code == 200
assert r2.get_json()["ok"] is True
r3 = client.get("/api/viewer/tokens")
tokens = r3.get_json()
assert not any(t["token"] == tok for t in tokens)
def test_delete_nonexistent_token_returns_404(self, client):
r = client.delete("/api/viewer/tokens/doesnotexist123")
assert r.status_code == 404
def test_validate_valid_token(self, client):
tok = client.post("/api/viewer/tokens", json={}).get_json()["token"]
r = client.post("/api/viewer/tokens/validate", json={"token": tok})
assert r.status_code == 200
assert r.get_json()["valid"] is True
def test_validate_invalid_token(self, client):
r = client.post("/api/viewer/tokens/validate",
json={"token": "notarealtoken00000000"})
assert r.status_code == 401
assert r.get_json()["valid"] is False
class TestViewerTokenScopeValidation:
def test_role_and_user_mutually_exclusive(self, client):
r = client.post("/api/viewer/tokens", json={
"scope": {"role": "student", "user": "alice@school.dk"}
})
assert r.status_code == 400
assert "mutually exclusive" in r.get_json()["error"]
def test_invalid_role_value(self, client):
r = client.post("/api/viewer/tokens", json={
"scope": {"role": "teacher"}
})
assert r.status_code == 400
assert "role" in r.get_json()["error"]
def test_user_email_must_contain_at(self, client):
r = client.post("/api/viewer/tokens", json={
"scope": {"user": "notanemail"}
})
assert r.status_code == 400
assert "email" in r.get_json()["error"].lower()
def test_valid_role_scope_stored(self, client):
r = client.post("/api/viewer/tokens",
json={"scope": {"role": "student"}})
assert r.status_code == 201
assert r.get_json()["scope"] == {"role": "student"}
def test_valid_user_scope_stored(self, client):
r = client.post("/api/viewer/tokens", json={
"scope": {
"user": ["alice@m365.dk", "alice@gws.dk"],
"display_name": "Alice Smith",
}
})
assert r.status_code == 201
scope = r.get_json()["scope"]
assert scope["user"] == ["alice@m365.dk", "alice@gws.dk"]
assert scope["display_name"] == "Alice Smith"
# ---------------------------------------------------------------------------
# GET /api/db/flagged — scope enforcement
# ---------------------------------------------------------------------------
class TestFlaggedScopeEnforcement:
def test_no_scope_returns_all_items(self, client, db_patch):
_seed_scan(db_patch, [
_item("s1", role="student"),
_item("s2", role="staff"),
])
r = client.get("/api/db/flagged")
assert r.status_code == 200
ids = {row["id"] for row in r.get_json()}
assert "s1" in ids
assert "s2" in ids
def test_role_scope_student_excludes_staff(self, client, db_patch):
_seed_scan(db_patch, [
_item("r1", role="student"),
_item("r2", role="staff"),
])
with client.session_transaction() as sess:
sess["viewer_ok"] = True
sess["viewer_scope"] = {"role": "student"}
r = client.get("/api/db/flagged")
ids = {row["id"] for row in r.get_json()}
assert "r1" in ids
assert "r2" not in ids
def test_role_scope_staff_excludes_students(self, client, db_patch):
_seed_scan(db_patch, [
_item("t1", role="student"),
_item("t2", role="staff"),
])
with client.session_transaction() as sess:
sess["viewer_ok"] = True
sess["viewer_scope"] = {"role": "staff"}
r = client.get("/api/db/flagged")
ids = {row["id"] for row in r.get_json()}
assert "t2" in ids
assert "t1" not in ids
def test_user_scope_returns_only_matching_account_id(self, client, db_patch):
_seed_scan(db_patch, [
_item("u1", account_id="alice@m365.dk"),
_item("u2", account_id="bob@m365.dk"),
])
with client.session_transaction() as sess:
sess["viewer_ok"] = True
sess["viewer_scope"] = {"user": ["alice@m365.dk"]}
r = client.get("/api/db/flagged")
ids = {row["id"] for row in r.get_json()}
assert "u1" in ids
assert "u2" not in ids
def test_user_scope_matches_both_platform_emails(self, client, db_patch):
# Same person — M365 UPN and GWS email both in scope
_seed_scan(db_patch, [
_item("p1", account_id="alice@m365.dk"),
_item("p2", account_id="alice@gws.dk"),
_item("p3", account_id="bob@m365.dk"),
])
with client.session_transaction() as sess:
sess["viewer_ok"] = True
sess["viewer_scope"] = {"user": ["alice@m365.dk", "alice@gws.dk"]}
r = client.get("/api/db/flagged")
ids = {row["id"] for row in r.get_json()}
assert "p1" in ids
assert "p2" in ids
assert "p3" not in ids
def test_user_scope_case_insensitive(self, client, db_patch):
_seed_scan(db_patch, [_item("ci1", account_id="Alice@M365.dk")])
with client.session_transaction() as sess:
sess["viewer_ok"] = True
sess["viewer_scope"] = {"user": ["alice@m365.dk"]}
r = client.get("/api/db/flagged")
ids = {row["id"] for row in r.get_json()}
assert "ci1" in ids
def test_no_ref_returns_open_items_across_all_sessions(self, client, db_patch):
# Two scans in separate session windows. The default (no-ref) view must
# surface unactioned items from BOTH, not just the latest session.
old_id = _seed_scan(db_patch, [_item("o1")])
db_patch._connect().execute(
"UPDATE scans SET started_at = started_at - 400 WHERE id = ?", (old_id,)
)
db_patch._connect().commit()
_seed_scan(db_patch, [_item("o2")])
r = client.get("/api/db/flagged")
ids = {row["id"] for row in r.get_json()}
assert ids == {"o1", "o2"}
def test_no_ref_excludes_items_with_a_disposition(self, client, db_patch):
_seed_scan(db_patch, [_item("d1"), _item("d2")])
db_patch.set_disposition("d1", "kept")
r = client.get("/api/db/flagged")
ids = {row["id"] for row in r.get_json()}
assert "d2" in ids # untouched → still open
assert "d1" not in ids # action taken → hidden
def test_no_ref_unreviewed_disposition_stays_open(self, client, db_patch):
_seed_scan(db_patch, [_item("u1")])
db_patch.set_disposition("u1", "unreviewed")
r = client.get("/api/db/flagged")
ids = {row["id"] for row in r.get_json()}
assert "u1" in ids # 'unreviewed' status is not an action
def test_no_ref_dedupes_rescanned_item_to_latest(self, client, db_patch):
# Same item flagged by two scans → appears once.
old_id = _seed_scan(db_patch, [_item("k1")])
db_patch._connect().execute(
"UPDATE scans SET started_at = started_at - 400 WHERE id = ?", (old_id,)
)
db_patch._connect().commit()
_seed_scan(db_patch, [_item("k1")])
rows = [row for row in client.get("/api/db/flagged").get_json() if row["id"] == "k1"]
assert len(rows) == 1
def test_ref_param_loads_historical_session(self, client, db_patch):
# Push first scan >300 s into the past so it occupies its own session window.
old_id = _seed_scan(db_patch, [_item("h1")])
db_patch._connect().execute(
"UPDATE scans SET started_at = started_at - 400 WHERE id = ?", (old_id,)
)
db_patch._connect().commit()
_seed_scan(db_patch, [_item("h2")])
r = client.get(f"/api/db/flagged?ref={old_id}")
ids = {row["id"] for row in r.get_json()}
assert "h1" in ids
# h2 belongs to a different (newer) session window — must not appear
assert "h2" not in ids
# ---------------------------------------------------------------------------
# POST /api/db/disposition/bulk
# ---------------------------------------------------------------------------
class TestBulkDisposition:
def test_updates_selected_items(self, client, db_patch):
_seed_scan(db_patch, [_item("b1"), _item("b2"), _item("b3")])
r = client.post("/api/db/disposition/bulk", json={
"item_ids": ["b1", "b2"],
"status": "retain-legal",
})
assert r.status_code == 200
assert r.get_json()["saved"] == 2
assert db_patch.get_disposition("b1")["status"] == "retain-legal"
assert db_patch.get_disposition("b2")["status"] == "retain-legal"
def test_unselected_item_unchanged(self, client, db_patch):
_seed_scan(db_patch, [_item("c1"), _item("c2")])
client.post("/api/db/disposition/bulk", json={
"item_ids": ["c1"],
"status": "delete-scheduled",
})
d = db_patch.get_disposition("c2")
# c2 was not in the bulk request — must remain unreviewed
assert d is None or d.get("status", "unreviewed") == "unreviewed"
def test_missing_item_ids_returns_400(self, client, db_patch):
r = client.post("/api/db/disposition/bulk",
json={"status": "retain-legal"})
assert r.status_code == 400
def test_missing_status_returns_400(self, client, db_patch):
r = client.post("/api/db/disposition/bulk",
json={"item_ids": ["x"]})
assert r.status_code == 400
def test_without_db_returns_503(self, client, monkeypatch):
import routes.database
monkeypatch.setattr(routes.database, "DB_OK", False)
r = client.post("/api/db/disposition/bulk",
json={"item_ids": ["x"], "status": "retain-legal"})
assert r.status_code == 503
# ---------------------------------------------------------------------------
# Viewer PIN
# ---------------------------------------------------------------------------
class TestViewerPin:
def setup_method(self):
_clear_viewer_pins()
def teardown_method(self):
_clear_viewer_pins()
def test_status_no_pin(self, client):
r = client.get("/api/viewer/pin")
assert r.status_code == 200
assert r.get_json()["pin_set"] is False
def test_set_and_status_reflects_set(self, client):
client.post("/api/viewer/pin", json={"pin": "1234"})
r = client.get("/api/viewer/pin")
assert r.get_json()["pin_set"] is True
def test_set_too_short_rejected(self, client):
r = client.post("/api/viewer/pin", json={"pin": "123"})
assert r.status_code == 400
def test_set_too_long_rejected(self, client):
r = client.post("/api/viewer/pin", json={"pin": "123456789"})
assert r.status_code == 400
def test_set_non_digits_rejected(self, client):
r = client.post("/api/viewer/pin", json={"pin": "abcd"})
assert r.status_code == 400
def test_verify_correct_pin_sets_session(self, client):
client.post("/api/viewer/pin", json={"pin": "4321"})
r = client.post("/api/viewer/pin/verify", json={"pin": "4321"})
assert r.status_code == 200
assert r.get_json()["ok"] is True
def test_verify_wrong_pin_returns_401(self, client):
client.post("/api/viewer/pin", json={"pin": "4321"})
r = client.post("/api/viewer/pin/verify", json={"pin": "9999"})
assert r.status_code == 401
def test_verify_rate_limit_after_5_failures(self, client):
client.post("/api/viewer/pin", json={"pin": "5678"})
from routes.viewer import _pin_attempts
_pin_attempts.clear()
for _ in range(5):
client.post("/api/viewer/pin/verify", json={"pin": "0000"})
r = client.post("/api/viewer/pin/verify", json={"pin": "0000"})
assert r.status_code == 429
_pin_attempts.clear()
def test_change_pin_requires_current(self, client):
client.post("/api/viewer/pin", json={"pin": "1111"})
r = client.post("/api/viewer/pin",
json={"pin": "2222", "current_pin": "9999"})
assert r.status_code == 403
def test_change_pin_with_correct_current(self, client):
client.post("/api/viewer/pin", json={"pin": "1111"})
r = client.post("/api/viewer/pin",
json={"pin": "2222", "current_pin": "1111"})
assert r.status_code == 200
# Old PIN no longer valid
r2 = client.post("/api/viewer/pin/verify", json={"pin": "1111"})
assert r2.status_code == 401
def test_clear_pin_requires_current(self, client):
client.post("/api/viewer/pin", json={"pin": "3333"})
r = client.delete("/api/viewer/pin", json={"current_pin": "0000"})
assert r.status_code == 403
def test_clear_pin_with_correct_current(self, client):
client.post("/api/viewer/pin", json={"pin": "3333"})
r = client.delete("/api/viewer/pin", json={"current_pin": "3333"})
assert r.status_code == 200
assert client.get("/api/viewer/pin").get_json()["pin_set"] is False
# ---------------------------------------------------------------------------
# Interface PIN
# ---------------------------------------------------------------------------
class TestInterfacePin:
def setup_method(self):
_clear_viewer_pins()
def teardown_method(self):
_clear_viewer_pins()
def test_status_no_pin(self, client):
r = client.get("/api/interface/pin")
assert r.get_json()["pin_set"] is False
def test_set_and_verify(self, client):
r = client.post("/api/interface/pin", json={"pin": "7777"})
assert r.status_code == 200
# Gate is now active — authenticate before the status check
with client.session_transaction() as sess:
sess["interface_ok"] = True
assert client.get("/api/interface/pin").get_json()["pin_set"] is True
def test_non_digit_rejected(self, client):
r = client.post("/api/interface/pin", json={"pin": "abcd"})
assert r.status_code == 400
def test_set_requires_current_when_set(self, client):
client.post("/api/interface/pin", json={"pin": "7777"})
with client.session_transaction() as sess:
sess["interface_ok"] = True
r = client.post("/api/interface/pin",
json={"pin": "8888", "current_pin": "0000"})
assert r.status_code == 403
def test_clear_requires_current(self, client):
client.post("/api/interface/pin", json={"pin": "7777"})
with client.session_transaction() as sess:
sess["interface_ok"] = True
r = client.delete("/api/interface/pin", json={"current_pin": "0000"})
assert r.status_code == 403
def test_clear_with_correct_current(self, client):
client.post("/api/interface/pin", json={"pin": "7777"})
with client.session_transaction() as sess:
sess["interface_ok"] = True
r = client.delete("/api/interface/pin", json={"current_pin": "7777"})
assert r.status_code == 200
assert client.get("/api/interface/pin").get_json()["pin_set"] is False
# ---------------------------------------------------------------------------
# Scan lock released on run_scan() exception
# ---------------------------------------------------------------------------
class TestScanLockReleasedOnError:
def test_lock_released_when_run_scan_raises(self, client, mock_connector,
monkeypatch):
import scan_engine
from routes import state
def _boom(opts):
raise RuntimeError("simulated scan failure")
monkeypatch.setattr(scan_engine, "run_scan", _boom)
r = client.post("/api/scan/start", json={"sources": ["email"]})
assert r.status_code == 200
# Wait for the background thread to finish and release the lock
deadline = time.time() + 2.0
while True:
acquired = state._scan_lock.acquire(blocking=False)
if acquired:
state._scan_lock.release()
break
assert time.time() < deadline, "scan lock was never released after exception"
time.sleep(0.05)
# ---------------------------------------------------------------------------
# GET /api/db/sessions
# ---------------------------------------------------------------------------
class TestDbSessions:
def test_returns_list(self, client, db_patch):
r = client.get("/api/db/sessions")
assert r.status_code == 200
assert isinstance(r.get_json(), list)
def test_completed_scan_appears_in_sessions(self, client, db_patch):
_seed_scan(db_patch, [_item("sess1")])
r = client.get("/api/db/sessions")
sessions = r.get_json()
assert len(sessions) >= 1
s = sessions[0]
assert "ref_scan_id" in s
assert "flagged_count" in s
assert s["flagged_count"] == 1
def test_sessions_ordered_newest_first(self, client, db_patch):
# Create two scans >300 s apart so each forms its own session window.
old_id = _seed_scan(db_patch, [_item("old1")])
db_patch._connect().execute(
"UPDATE scans SET started_at = started_at - 400 WHERE id = ?", (old_id,)
)
db_patch._connect().commit()
_seed_scan(db_patch, [_item("new1")])
sessions = client.get("/api/db/sessions").get_json()
assert len(sessions) == 2
# Newest session (highest ref_scan_id) must be first
assert sessions[0]["ref_scan_id"] > sessions[1]["ref_scan_id"]
# ---------------------------------------------------------------------------
# Profile routes
# ---------------------------------------------------------------------------
class TestProfileRoutes:
"""
Tests for GET /api/profiles, POST /api/profiles/save,
GET /api/profiles/get, and POST /api/profiles/delete.
Each test monkeypatches the profile storage path to a tmp directory so
tests are fully isolated from the real ~/.gdprscanner/settings.json.
"""
@pytest.fixture(autouse=True)
def _isolate(self, tmp_path, monkeypatch):
import app_config
monkeypatch.setattr(app_config, "_SETTINGS_PATH", tmp_path / "settings.json")
def test_list_returns_empty_list_initially(self, client):
r = client.get("/api/profiles")
assert r.status_code == 200
assert r.get_json()["profiles"] == []
def test_save_missing_name_returns_400(self, client):
r = client.post("/api/profiles/save", json={"sources": ["email"]})
assert r.status_code == 400
assert "error" in r.get_json()
def test_save_creates_profile_and_returns_it(self, client):
r = client.post("/api/profiles/save", json={
"id": "", "name": "Alpha", "sources": ["email"], "options": {}
})
assert r.status_code == 200
data = r.get_json()
assert data["status"] == "saved"
assert data["profile"]["name"] == "Alpha"
assert data["profile"]["id"] # server assigned a non-empty id
def test_saved_profile_appears_in_list(self, client):
client.post("/api/profiles/save", json={"name": "Beta", "sources": [], "options": {}})
profiles = client.get("/api/profiles").get_json()["profiles"]
assert any(p["name"] == "Beta" for p in profiles)
def test_rename_updates_name_in_list(self, client):
"""Regression: _pmgmtSaveFullEdit renames the copy — the API must
persist the new name so loadProfiles() returns fresh data for the
left-column re-render."""
r = client.post("/api/profiles/save", json={
"id": "", "name": "LOCAL-TEST (copy)", "sources": [], "options": {}
})
profile_id = r.get_json()["profile"]["id"]
# Simulate the user renaming the copy in the editor and clicking Save
r2 = client.post("/api/profiles/save", json={
"id": profile_id, "name": "LOCAL-TEST-2", "sources": [], "options": {}
})
assert r2.status_code == 200
assert r2.get_json()["profile"]["name"] == "LOCAL-TEST-2"
profiles = client.get("/api/profiles").get_json()["profiles"]
names = [p["name"] for p in profiles]
assert "LOCAL-TEST-2" in names
assert "LOCAL-TEST (copy)" not in names
def test_get_by_id(self, client):
r = client.post("/api/profiles/save", json={
"id": "fixed-id-1", "name": "Gamma", "sources": [], "options": {}
})
profile_id = r.get_json()["profile"]["id"]
r2 = client.get(f"/api/profiles/get?id={profile_id}")
assert r2.status_code == 200
assert r2.get_json()["profile"]["name"] == "Gamma"
def test_get_nonexistent_returns_404(self, client):
r = client.get("/api/profiles/get?id=does-not-exist")
assert r.status_code == 404
def test_delete_removes_profile(self, client):
client.post("/api/profiles/save", json={"name": "ToDelete", "sources": [], "options": {}})
r = client.post("/api/profiles/delete", json={"name": "ToDelete"})
assert r.status_code == 200
assert r.get_json()["status"] == "deleted"
profiles = client.get("/api/profiles").get_json()["profiles"]
assert not any(p["name"] == "ToDelete" for p in profiles)
def test_delete_nonexistent_returns_not_found(self, client):
r = client.post("/api/profiles/delete", json={"name": "Ghost"})
assert r.status_code == 200
assert r.get_json()["status"] == "not_found"
def test_delete_missing_key_returns_400(self, client):
r = client.post("/api/profiles/delete", json={})
assert r.status_code == 400

View File

@ -97,22 +97,6 @@ class TestScanStatus:
assert "scan_id" in data
assert data["scan_id"] is None
def test_idle_reports_google_not_running(self, client):
# The refresh/restore path relies on google_running being reported
# separately — running alone misses live Google scans.
data = client.get("/api/scan/status").get_json()
assert data["google_running"] is False
def test_google_lock_held_reports_google_running(self, client):
from routes import state
assert state._google_scan_lock.acquire(blocking=False)
try:
data = client.get("/api/scan/status").get_json()
assert data["google_running"] is True
assert data["running"] is False # M365/file lock still free
finally:
state._google_scan_lock.release()
# ---------------------------------------------------------------------------
# /api/scan/start

View File

@ -1,222 +0,0 @@
"""
Tests for the software-update routes (routes/updates.py).
All git interaction is mocked no test touches the real repository,
the network, or restarts the process.
"""
from __future__ import annotations
import subprocess
import pytest
@pytest.fixture(scope="module")
def flask_app():
import gdpr_scanner
gdpr_scanner.app.config["TESTING"] = True
return gdpr_scanner.app
@pytest.fixture()
def client(flask_app):
with flask_app.test_client() as c:
yield c
def _cp(returncode=0, stdout="", stderr=""):
return subprocess.CompletedProcess(args=[], returncode=returncode,
stdout=stdout, stderr=stderr)
def _fake_git(*, local="aaaaaaa1", remote="aaaaaaa1", branch="main",
fetch_rc=0, dirty=False, reqs_changed=False, merge_rc=0,
commits=""):
"""Build a _git() replacement dispatching on the git subcommand."""
calls = []
def fake(*args, timeout=None):
calls.append(args)
if args[:2] == ("rev-parse", "--abbrev-ref"):
return _cp(stdout=branch + "\n")
if args == ("rev-parse", "HEAD"):
return _cp(stdout=local + "\n")
if args[0] == "rev-parse":
return _cp(stdout=remote + "\n")
if args[0] == "fetch":
return _cp(returncode=fetch_rc, stderr="fetch failed" if fetch_rc else "")
if args[0] == "log":
return _cp(stdout=commits)
if args[0] == "diff-index":
return _cp(returncode=1 if dirty else 0)
if args[0] == "diff":
return _cp(returncode=1 if reqs_changed else 0)
if args[0] == "merge":
return _cp(returncode=merge_rc, stderr="not a fast-forward" if merge_rc else "")
if args[0] == "stash":
return _cp()
raise AssertionError(f"unexpected git call: {args}")
fake.calls = calls
return fake
@pytest.fixture(autouse=True)
def supported(monkeypatch):
import routes.updates as upd
monkeypatch.setattr(upd, "_supported", lambda: True)
@pytest.fixture(autouse=True)
def no_audit(monkeypatch):
import gdpr_db
monkeypatch.setattr(gdpr_db, "log_audit_event", lambda *a, **k: None)
# ── /api/update/check ─────────────────────────────────────────────────────────
def test_check_unsupported(client, monkeypatch):
import routes.updates as upd
monkeypatch.setattr(upd, "_supported", lambda: False)
r = client.get("/api/update/check")
assert r.status_code == 200
assert r.get_json() == {"supported": False}
def test_check_up_to_date(client, monkeypatch):
import routes.updates as upd
monkeypatch.setattr(upd, "_git", _fake_git())
d = client.get("/api/update/check").get_json()
assert d["supported"] and d["up_to_date"]
assert d["commits"] == []
def test_check_update_available(client, monkeypatch):
import routes.updates as upd
monkeypatch.setattr(upd, "_git", _fake_git(
local="aaaaaaa1", remote="bbbbbbb2",
commits="bbbbbbb2 Fix thing\nccccccc3 Add thing\n"))
d = client.get("/api/update/check").get_json()
assert d["up_to_date"] is False
assert d["current"] == "aaaaaaa"
assert d["latest"] == "bbbbbbb"
assert len(d["commits"]) == 2
def test_check_fetch_failure(client, monkeypatch):
import routes.updates as upd
monkeypatch.setattr(upd, "_git", _fake_git(fetch_rc=1))
d = client.get("/api/update/check").get_json()
assert d["supported"] is True
assert "fetch failed" in d["error"]
# ── /api/update/apply ─────────────────────────────────────────────────────────
def test_apply_up_to_date_is_noop(client, monkeypatch):
import routes.updates as upd
monkeypatch.setattr(upd, "_git", _fake_git())
monkeypatch.setattr(upd, "_schedule_restart", lambda *a, **k: pytest.fail("must not restart"))
r = client.post("/api/update/apply")
assert r.status_code == 200
d = r.get_json()
assert d["ok"] is True and d["updated"] is False
def test_apply_refused_while_scan_running(client, monkeypatch):
import routes.updates as upd
from routes import state
monkeypatch.setattr(upd, "_git", _fake_git(remote="bbbbbbb2"))
monkeypatch.setattr(upd, "_schedule_restart", lambda *a, **k: pytest.fail("must not restart"))
assert state._scan_lock.acquire(blocking=False)
try:
r = client.post("/api/update/apply")
finally:
state._scan_lock.release()
assert r.status_code == 409
assert r.get_json()["code"] == "scan_running"
def test_apply_happy_path(client, monkeypatch):
import routes.updates as upd
fake = _fake_git(remote="bbbbbbb2", commits="bbbbbbb2 Fix\n")
monkeypatch.setattr(upd, "_git", fake)
restarts = []
monkeypatch.setattr(upd, "_schedule_restart", lambda *a, **k: restarts.append(1))
r = client.post("/api/update/apply")
assert r.status_code == 200
d = r.get_json()
assert d["ok"] and d["updated"] and d["restarting"]
assert d["from"] == "aaaaaaa" and d["to"] == "bbbbbbb"
assert restarts == [1]
assert ("merge", "--ff-only", "origin/main") in fake.calls
# tree was clean — no stash
assert not any(c[0] == "stash" for c in fake.calls)
def test_apply_stashes_dirty_tree(client, monkeypatch):
import routes.updates as upd
fake = _fake_git(remote="bbbbbbb2", dirty=True)
monkeypatch.setattr(upd, "_git", fake)
monkeypatch.setattr(upd, "_schedule_restart", lambda *a, **k: None)
r = client.post("/api/update/apply")
assert r.status_code == 200
assert any(c[0] == "stash" for c in fake.calls)
def test_apply_merge_failure(client, monkeypatch):
import routes.updates as upd
monkeypatch.setattr(upd, "_git", _fake_git(remote="bbbbbbb2", merge_rc=1))
monkeypatch.setattr(upd, "_schedule_restart", lambda *a, **k: pytest.fail("must not restart"))
r = client.post("/api/update/apply")
assert r.status_code == 409
d = r.get_json()
assert d["code"] == "merge_failed"
assert "fast-forward" in d["error"]
def test_apply_installs_requirements_when_changed(client, monkeypatch):
import routes.updates as upd
fake = _fake_git(remote="bbbbbbb2", reqs_changed=True)
monkeypatch.setattr(upd, "_git", fake)
monkeypatch.setattr(upd, "_schedule_restart", lambda *a, **k: None)
pip_calls = []
monkeypatch.setattr(upd.subprocess, "run",
lambda cmd, **kw: pip_calls.append(cmd) or _cp())
r = client.post("/api/update/apply")
assert r.status_code == 200
assert len(pip_calls) == 1
assert "pip" in pip_calls[0] and "-r" in pip_calls[0]
# ── Restart fd hygiene ────────────────────────────────────────────────────────
def test_mark_fds_cloexec_unmarks_inheritable_socket():
"""Werkzeug sets the listening socket inheritable; the restart must undo
that or the socket leaks through execv and squats on the port."""
import socket
import routes.updates as upd
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
try:
s.set_inheritable(True)
assert s.get_inheritable() is True
upd._mark_fds_cloexec()
assert s.get_inheritable() is False
finally:
s.close()
# ── /api/update/settings ──────────────────────────────────────────────────────
def test_settings_roundtrip(client, monkeypatch):
import routes.updates as upd
store = {"auto_update": False}
monkeypatch.setattr(upd, "get_update_config", lambda: dict(store))
monkeypatch.setattr(upd, "save_update_config",
lambda v: store.__setitem__("auto_update", bool(v)))
d = client.get("/api/update/settings").get_json()
assert d == {"supported": True, "auto_update": False}
r = client.post("/api/update/settings", json={"auto_update": True})
assert r.get_json() == {"ok": True}
assert store["auto_update"] is True
d = client.get("/api/update/settings").get_json()
assert d["auto_update"] is True

View File

@ -1,83 +0,0 @@
#!/usr/bin/env bash
# GDPRScanner — self-update script.
#
# Pulls the latest release from origin, reinstalls dependencies if they
# changed, and restarts the systemd service if one is installed.
# Safe to run from cron: exits quietly when already up to date, and
# auto-stashes local hotfixes instead of aborting the merge.
#
# Usage:
# ./update_gdpr.sh # update if origin has new commits
# ./update_gdpr.sh --check # report status only, change nothing
#
# Environment:
# GDPR_BRANCH branch to track (default: main)
# GDPR_SERVICE systemd unit to restart (default: gdprscanner, if it exists)
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
BRANCH="${GDPR_BRANCH:-main}"
SERVICE="${GDPR_SERVICE:-gdprscanner}"
log() { printf '[%s] %s\n' "$(date '+%Y-%m-%d %H:%M:%S')" "$*"; }
cd "$SCRIPT_DIR"
if [ ! -d .git ]; then
log "ERROR: $SCRIPT_DIR is not a git checkout — cannot self-update."
exit 1
fi
git fetch origin "$BRANCH" --quiet
LOCAL="$(git rev-parse HEAD)"
REMOTE="$(git rev-parse "origin/$BRANCH")"
if [ "$LOCAL" = "$REMOTE" ]; then
log "Already up to date ($(git describe --always HEAD))."
exit 0
fi
log "Update available: $(git rev-parse --short HEAD) -> $(git rev-parse --short "$REMOTE")"
git log --oneline "HEAD..origin/$BRANCH" | sed 's/^/ /'
if [ "${1:-}" = "--check" ]; then
exit 0
fi
# Local edits (e.g. a hotfix applied directly on the server) would make the
# merge abort. Stash them so the update proceeds; the stash is kept so
# nothing is lost.
if ! git diff-index --quiet HEAD --; then
log "Local changes detected — stashing:"
git diff --stat HEAD | sed 's/^/ /'
git stash push --quiet -m "update_gdpr.sh auto-stash $(date '+%Y-%m-%d %H:%M:%S')"
log "Recover later with: git stash show -p / git stash pop"
fi
REQS_CHANGED=false
if ! git diff --quiet "HEAD..origin/$BRANCH" -- requirements.txt; then
REQS_CHANGED=true
fi
# Fast-forward only: the server checkout must never diverge from origin.
git merge --ff-only --quiet "origin/$BRANCH"
log "Updated to $(git rev-parse --short HEAD)."
if [ "$REQS_CHANGED" = true ]; then
log "requirements.txt changed — updating dependencies..."
"$SCRIPT_DIR/venv/bin/pip" install --quiet -r requirements.txt
log "Dependencies updated."
fi
if command -v systemctl >/dev/null 2>&1 \
&& systemctl list-unit-files --type=service 2>/dev/null | grep -q "^$SERVICE\.service"; then
log "Restarting $SERVICE.service..."
systemctl restart "$SERVICE"
log "Service restarted."
else
log "No systemd unit '$SERVICE' found — restart GDPRScanner manually."
fi
log "Done."