Add CPR cross-referencing (related documents)

Clicking any flagged card that contains CPR hits now shows a "Related documents" section in the preview panel, listing other items from the same scan session that share at least one CPR number. Items are ordered by number of shared CPRs; clicking any entry opens it in the preview panel. Works in both live mode and scan history mode. Implementation - GDPRDb.get_related_items() — SQL self-join on the existing cpr_index table using the same symmetric 300 s session window as get_session_items. No new data collection needed. - GET /api/db/related/<item_id>?ref=N — new endpoint in routes/database.py, consistent with the ?ref convention used by /api/db/flagged. - #previewRelated div injected between the metadata block and disposition row in the preview panel. - _loadRelated(f) in results.js fetches and renders the list; window._openRelated() resolves items from the live grid or falls back to the API response for history-mode items. Also - Added keyword/FTS5 search as a deferred idea in SUGGESTIONS.md - Updated CHANGELOG.md, README.md, and CLAUDE.md
2026-04-25 21:15:50 +02:00 · 2026-04-25 21:15:50 +02:00 · d84e57239a
commit d84e57239a
parent 8b55e9d933
8 changed files with 118 additions and 1 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -13,6 +13,8 @@ Version numbers follow [Semantic Versioning](https://semver.org/spec/v2.0.0.html
 - **Checkpoint / resume for Google and File scans** — stopping a Google Workspace or file (local/SMB/SFTP) scan mid-way and restarting now resumes from where it left off, exactly like M365 scans have always done. Each engine writes its own checkpoint file (`checkpoint_google.json`, `checkpoint_file_{source_id}.json`) every 25 items. On restart, previously found cards are re-emitted via SSE so the grid is repopulated before new items arrive. The Scan button now always checks for a live checkpoint before starting — if one exists the resume banner is shown regardless of whether the user reloaded the page. `POST /api/scan/checkpoint` returns a per-engine breakdown; `POST /api/scan/clear_checkpoint` wipes all `checkpoint_*.json` files. Google users' email addresses are included in the checkpoint payload from the frontend so the server can compute a matching key. `checkpoint.py` functions gained a `prefix` keyword argument (default `"m365"`) — existing M365 call sites are unchanged.
 - **CPR cross-referencing (related documents)** — clicking any flagged card that contains CPR hits now shows a "Related documents" section in the preview panel listing other items from the same scan session that share at least one CPR number. Items are ordered by number of shared CPRs; clicking any entry opens it in the preview panel. Works in both live mode and history mode (respects `?ref=N`). Powered by a self-join on the existing `cpr_index` table — no new data collection needed. New `GDPRDb.get_related_items(item_id, ref_scan_id)` method and `GET /api/db/related/<item_id>?ref=N` endpoint in `routes/database.py`. Frontend: `#previewRelated` div in the preview panel, `_loadRelated(f)` in `results.js`, `window._openRelated(id, itemData)` helper (looks up live `S.flaggedData` first, falls back to API response for history items).
 - **Email address and Danish phone number detection** — all three scan engines (M365, Google Workspace, local/SMB/SFTP) can now flag files and messages containing email addresses or Danish phone numbers in addition to CPR numbers. Detection is opt-in per profile: two new toggle options **Scan for email addresses** and **Scan for phone numbers** (default off) appear in the scan options panel and profile editor. When enabled, matches are stored as `email_count` / `phone_count` on each DB row and surfaced as colour-coded badges in list view, grid view, and the preview panel. Email regex requires a structurally valid address (`local@domain.tld`); phone regex covers 8-digit Danish numbers with optional `+45`/`0045` prefix and common spacing patterns. Both are deduplicated before counting. Requires DB migration (adds two INTEGER columns to `flagged_items`; applied automatically on first startup via `_MIGRATIONS`).
 - **SFTP as a 4th file connector** — SFTP servers can now be added as file sources alongside local folders, SMB shares, and cloud sources. A new `SFTPScanner` class in `sftp_connector.py` implements the same `iter_files()` interface as `FileScanner`, so `run_file_scan()`, SSE broadcasting, DB persistence, card building, scheduled scans, and exports work without changes. Supports password auth and SSH private key auth (RSA, Ed25519, ECDSA, DSS); passphrases stored in the OS keychain. Key files uploaded via `POST /api/file_sources/upload_key` and stored in `~/.gdprscanner/sftp_keys/` with `chmod 600`. SFTP sources appear with a 🔒 icon in the sources panel. Requires `paramiko>=3.4` (optional — scanner falls back gracefully if not installed). New source-type selector (Local / Network (SMB) / SFTP) replaces the SMB path-prefix auto-detection in the add-source form.
--- a/CLAUDE.md
+++ b/CLAUDE.md
@ -140,6 +140,15 @@ Allows reviewing results from any past scan session without running a new scan.
 - **Auto-load on page load** — `results.js` calls `window.loadHistorySession?.(null)` once when the SSE watchdog confirms `!status.running`. `null` resolves to the latest completed session via `_fetchSessions()[0].ref_scan_id`. The `_initialStatusChecked` guard ensures this fires at most once per page load.
 - **Mode transitions** — `startScan()` calls `window.exitHistoryMode?.()` before clearing the grid, so any history banner is dismissed and `S._historyRefScanId` is reset before SSE events start arriving.
 ## CPR cross-referencing — gdpr_db.py + routes/database.py + static/js/results.js
 - **`GDPRDb.get_related_items(item_id, ref_scan_id, window_seconds=300)`** — self-joins `cpr_index` to find other items in the same session window that share ≥1 CPR hash with `item_id`. Returns rows ordered by `shared_cprs DESC, cpr_count DESC`. Uses the same 300 s symmetric window as `get_session_items` — do not change the window size independently.
 - **`GET /api/db/related/<item_id>?ref=N`** (`routes/database.py`) — passes `item_id` and optional `ref_scan_id` to `get_related_items`; normalises JSON columns (same logic as `db_flagged_items`). Returns `[]` when `DB_OK` is false.
 - **`#previewRelated`** — `<div>` inserted between `#previewMeta` and the disposition row in `index.html`. Hidden (`display:none`) when not in use; shown by `_loadRelated`.
 - **`_loadRelated(f)`** (`results.js`) — async; hides `#previewRelated` if `f.cpr_count` is 0, otherwise fetches `/api/db/related/<id>?ref=N` and renders a clickable list with per-item shared-CPR badge. Called from `openPreview` after `loadDisposition`.
 - **`window._openRelated(id, itemData)`** (`results.js`) — resolves the target item: looks up `id` in `S.flaggedData` first (live/history grid already loaded), falls back to `itemData` from the API response (history items not yet in the grid). Calls `openPreview`.
 - **No new data collection** — `cpr_index` already stores `(cpr_hash, item_id, scan_id)` for every CPR hit at write time. Cross-referencing is entirely a query-time operation.
 ## SSE teardown — static/js/scan.js
 - **Do not close `S.es` in `scan_done` if other scans are still running** — M365 (`scan_done`), Google (`google_scan_done`), and File (`file_scan_done`) each emit their own done event. If M365 finishes first and the SSE is closed, the remaining done events are never received and the UI hangs at 100% indefinitely.
--- a/README.md
+++ b/README.md
@ -46,6 +46,7 @@ an IDE with intelligent completion. The result is the author's work.
 - **Account name on cards** — when scanning multiple users, each card displays the owner's display name so results from different mailboxes are instantly distinguishable
 - **Retention policy enforcement** — flag items older than a configurable retention period with a Overdue badge; supports both rolling and fiscal-year-aligned cutoffs (e.g. Bogføringsloven Dec 31); headless auto-delete via `--retention-years`
 - **Data subject lookup** — find all flagged items containing a specific CPR number across all scans; CPR is SHA-256 hashed before querying — never stored in plaintext
 - **CPR cross-referencing** — clicking any flagged card with CPR hits shows a "Related documents" section listing other items from the same scan session that share at least one CPR number, ordered by number of shared CPRs. Clicking any entry opens it in the preview panel. Works in live mode and history mode. Powered by a SQL self-join on the `cpr_index` table — no new data collection required
 - **Disposition tagging** — compliance officers can tag each flagged item with a legal basis (retain / delete-scheduled / deleted) directly from the preview panel; **bulk disposition tagging** lets you select multiple cards with checkboxes and apply a disposition to all of them at once. A stats bar above the grid shows total · unreviewed · retain · delete counts and the percentage reviewed
 - **Interface PIN** — optional session-level PIN that gates the main scanner interface (`/`). Set a 4–8 digit PIN in **Settings → Security → Interface PIN**; unauthenticated visitors are redirected to `/login`. The `/view` viewer route and all viewer API endpoints are exempt — reviewers are unaffected. Salted SHA-256 hash; brute-force protection (5 attempts / 5 min per IP)
 - **Read-only viewer mode** — share scan results with a DPO or manager via a secure token URL (`/view?token=…`) or a numeric PIN; viewers see the full results grid and disposition panel but cannot scan, delete, or change settings. Tokens can be **role-scoped** (Ansatte / Elever) so a recipient only sees items for their group, or **user-scoped** so an individual employee only sees their own flagged files (supports dual M365 + Google Workspace identity)
--- a/SUGGESTIONS.md
+++ b/SUGGESTIONS.md
@ -351,6 +351,23 @@ Write redacted copies of flagged files with CPR numbers replaced by `XXX XXXX-XX
 Auto-email now fires on manual scans when **Email report after manual scan** is enabled in Settings → Email report. Toggle stored as `auto_email_manual` in `smtp.json`. Implemented in `routes/scan.py` — `_maybe_send_auto_email()` is called from the `_run()` thread after `run_scan()` returns. Same Graph-first → SMTP-fallback pattern as scheduled scans. Only fires when there are flagged items and at least one recipient is configured.
 ### Keyword / name search across flagged document content
 Allow a DPO to type a name (or any keyword) into a search box and find every flagged document whose extracted text contains that string. Complements CPR cross-referencing (#see above) for cases where the person's CPR is not present but their name is.
 **Implementation outline:**
 1. **Store text snippets at scan time** — `_scan_bytes` already extracts plain text for CPR matching; store a 2–4 KB prefix of that text per item in a new `text_snippet TEXT` column on `flagged_items`, or in a separate `content_index` table. Truncation avoids bloating the DB; the snippet covers most short documents in full.
 2. **SQLite FTS5 virtual table** — `CREATE VIRTUAL TABLE content_fts USING fts5(item_id UNINDEXED, snippet)`. Populated at scan time alongside `cpr_index`. FTS5 is bundled with SQLite ≥ 3.9 (macOS ships ≥ 3.37) — no external dependency.
 3. **`GET /api/db/search?q=<term>&ref=N`** — queries `content_fts` with `MATCH ?`, joins back to `flagged_items` within the session window, returns matching items. SQLite FTS5 supports phrase queries, prefix wildcards (`name*`), and Boolean operators automatically.
 4. **Search bar in the filter strip** — a plain `<input type="search">` next to the existing role/source filters. Debounced 300 ms. Results replace the grid (with a "Clear search" pill to return to full view). No new UI paradigm needed.
 **Why deferred:** requires a DB migration + storing text at scan time (increases DB size). The CPR cross-reference (already implemented) covers the most common "find all data about this person" use case without storing any raw text. Implement if a school requests free-text search.
 **Size:** Medium · **Priority:** Low
 ---
 ### Phase 2 PII: name-based roster lookup
 Flag documents containing the full names of students or staff — even when no CPR is present. Implementation outline:
--- a/gdpr_db.py
+++ b/gdpr_db.py
@ -523,6 +523,37 @@ class ScanDB:
            result.append(d)
        return result
    def get_related_items(self, item_id: str, ref_scan_id: int | None = None,
                          window_seconds: int = 300) -> list[dict]:
        """Return flagged items from the same session that share at least one CPR
        hash with *item_id*, ordered by number of shared CPRs descending."""
        if ref_scan_id:
            row = self._connect().execute(
                "SELECT started_at FROM scans WHERE id=?", (ref_scan_id,)
            ).fetchone()
        else:
            row = self._connect().execute(
                "SELECT started_at FROM scans WHERE finished_at IS NOT NULL ORDER BY id DESC LIMIT 1"
            ).fetchone()
        if not row:
            return []
        latest_start = row[0]
        rows = self._connect().execute(
            """SELECT fi.*, COUNT(DISTINCT ci2.cpr_hash) AS shared_cprs
               FROM cpr_index ci1
               JOIN cpr_index ci2 ON ci2.cpr_hash = ci1.cpr_hash
               JOIN flagged_items fi ON fi.id = ci2.item_id
               JOIN scans s ON fi.scan_id = s.id
               WHERE ci1.item_id = ?
                 AND fi.id != ?
                 AND s.started_at BETWEEN ? AND ?
                 AND s.finished_at IS NOT NULL
               GROUP BY fi.id
               ORDER BY shared_cprs DESC, fi.cpr_count DESC""",
            (item_id, item_id, latest_start - window_seconds, latest_start + window_seconds),
        ).fetchall()
        return [dict(r) for r in rows]
    def get_session_sources(self, window_seconds: int = 300) -> set:
        """Return the union of all source keys scanned in the current session.
--- a/routes/database.py
+++ b/routes/database.py
@ -204,6 +204,22 @@ def db_flagged_items():
    return jsonify(out)
@bp.route("/api/db/related/<item_id>")
 def db_related_items(item_id):
    """Return flagged items from the same session sharing at least one CPR hash."""
    if not DB_OK:
        return jsonify([])
    ref = request.args.get("ref", type=int)
    import json as _json
    out = []
    for row in _get_db().get_related_items(item_id, ref_scan_id=ref):
        row["special_category"] = _json.loads(row.get("special_category") or "[]") if isinstance(row.get("special_category"), str) else row.get("special_category", [])
        row["exif"] = _json.loads(row.get("exif_json") or "{}") if isinstance(row.get("exif_json"), str) else row.get("exif", {})
        row.pop("exif_json", None)
        out.append(row)
    return jsonify(out)
@bp.route("/api/db/deletion_log")
 def db_deletion_log():
    """Return the deletion audit log.
--- a/static/js/results.js
+++ b/static/js/results.js
@ -110,7 +110,8 @@ async function openPreview(f) {
  ].filter(Boolean).join('');
  _previewItemId = f.id;
-  loadDisposition(f.id);  // load disposition for this item (#6)
+  loadDisposition(f.id);
  _loadRelated(f);
  try {
    const r = await fetch('/api/preview/' + encodeURIComponent(f.id)
@ -176,6 +177,44 @@ async function openPreview(f) {
  }
 }
 // ── Related documents (CPR cross-reference) ───────────────────────────────────
 async function _loadRelated(f) {
  const el = document.getElementById('previewRelated');
  if (!el) return;
  if (!f.cpr_count) { el.style.display = 'none'; return; }
  const ref = S._historyRefScanId ? `&ref=${S._historyRefScanId}` : '';
  try {
    const r = await fetch(`/api/db/related/${encodeURIComponent(f.id)}?${ref}`);
    const items = await r.json();
    if (f.id !== _previewItemId) return; // stale
    if (!items.length) { el.style.display = 'none'; return; }
    const rows = items.map(item => {
      const shared = item.shared_cprs ?? '';
      const badge  = shared ? `<span style="font-size:9px;padding:1px 5px;border-radius:10px;background:var(--danger);color:#fff;font-weight:500;flex-shrink:0">${shared} CPR</span>` : '';
      const src    = item.source ? `<span style="color:var(--muted);font-size:10px;flex-shrink:0">${item.source}</span>` : '';
      return `<div onclick="window._openRelated('${item.id.replace(/'/g,"\\'")}',${JSON.stringify(item)})"
                   style="display:flex;align-items:center;gap:6px;padding:4px 0;cursor:pointer;border-radius:4px"
                   onmouseover="this.style.background='var(--surface)'" onmouseout="this.style.background=''">
        <span style="flex:1;font-size:11px;overflow:hidden;text-overflow:ellipsis;white-space:nowrap" title="${item.name}">${item.name}</span>
        ${src}${badge}
      </div>`;
    }).join('');
    el.innerHTML = `<div style="font-size:10px;font-weight:600;color:var(--muted);margin-bottom:4px;text-transform:uppercase;letter-spacing:.04em">${t('m365_related_docs','Related documents')} <span style="font-weight:400">(${items.length})</span></div>${rows}`;
    el.style.display = 'block';
  } catch(e) {
    el.style.display = 'none';
  }
 }
 window._openRelated = function(id, itemData) {
  const cached = (S.flaggedData || []).find(x => x.id === id);
  openPreview(cached || itemData);
 };
 // ── Retention policy (#1) ────────────────────────────────────────────────────
 function toggleRetentionPanel() {
--- a/templates/index.html
+++ b/templates/index.html
@ -478,6 +478,8 @@ document.addEventListener('DOMContentLoaded', applyI18n);
              <iframe id="previewFrame" sandbox="allow-scripts allow-same-origin allow-forms allow-popups" style="display:none"></iframe>
            </div>
            <div class="preview-meta" id="previewMeta"></div>
            <!-- Related documents -->
            <div id="previewRelated" style="display:none;padding:8px 14px 4px;border-top:1px solid var(--border)"></div>
            <!-- Disposition widget (#6) -->
            <div class="disposition-row" id="dispositionRow" style="display:none">
              <span class="disposition-label" data-i18n="m365_disposition_label">Disposition</span>