Recover unfinished scans so their items aren't stranded

get_session_items / get_open_items / latest_scan_id all require finished_at IS NOT NULL, but the M365 and Google engines return early on abort (skipping finish_scan) and a process kill mid-scan (deploy, OOM, crash) never reaches it either. Result on prod: 41/42 scans had finished_at NULL, so 291 already-saved flagged items were invisible — the grid showed nothing. - finalize_orphan_scans(): finalises every finished_at-NULL scan; runs once at startup before the scheduler (nothing is scanning at boot, so any unfinished scan is dead). Recovers existing stranded items and guards against future mid-scan restarts. - run_scan: finalise the DB scan on the abort early-return too, so a stopped scan's items stay visible without waiting for a restart. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 09:51:22 +02:00 · 2026-06-22 09:51:22 +02:00 · 29d9168643
commit 29d9168643
parent 7bf589bf7a
5 changed files with 100 additions and 0 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@ -93,6 +93,7 @@ All options live in the profile `options` dict and apply to **all three scan eng
 - **`get_sessions(limit=50, window_seconds=300)`** — groups `scans` rows by 300 s window. Groups built ascending, returned descending. `ref_scan_id` is the highest `scan_id` in each group. Do not change window size independently of `get_session_items`.
 - **`get_session_items(ref_scan_id=N)`** — anchors 300 s window to that scan's `started_at`. Window is **symmetric**: `started_at BETWEEN ref.started_at - 300 AND ref.started_at + 300`. Do not revert to a one-sided lower bound.
 - **`get_related_items(item_id, ref_scan_id, window_seconds=300)`** — self-joins `cpr_index` to find items sharing ≥1 CPR hash. Uses same 300 s symmetric window — do not change independently.
+- **Scans must be finalised or their items are invisible** — `get_session_items`, `get_open_items`, and `latest_scan_id` all filter on `finished_at IS NOT NULL`. The file scan finalises in a `finally`; M365 (`run_scan`) and Google (`_run_google_scan`) `return` early on abort, so each now calls `finish_scan` before that abort-return. A process kill (deploy/OOM/crash) mid-scan still strands a scan → **`finalize_orphan_scans()`** runs once at server startup (`gdpr_scanner.py` `__main__`, before the scheduler) and finalises every `finished_at IS NULL` scan (safe because nothing is scanning at boot). Do not add a scan-results query that ignores `finished_at` instead of fixing finalisation.
 - **`get_open_items()`** — returns every flagged item with **no action taken**, across **all** scans (not just the latest session window). "Open" = no `dispositions` row, or one whose `status='unreviewed'`. Because `flagged_items` PK is `(id, scan_id)`, the same item recurs per scan; the query dedupes by `id`, keeping the row from the highest finished `scan_id`. This powers the **default landing view** so items don't drop out of sight once a newer scan opens a fresh session.
 - **`GET /api/db/flagged`** — **with `?ref=N`** → `get_session_items(ref_scan_id=N)` (history mode); **without ref** → `get_open_items()` (default + viewer). Viewer scope enforcement applies to both. Do not change the no-ref `get_session_items()` default elsewhere (`export.py`, `scan_scheduler.py` still rely on latest-session for the current scan's report/email).
 - See `static/js/CLAUDE.md` for the frontend history browser behaviour and `sse_replay_done` retry fix.
--- a/gdpr_db.py
+++ b/gdpr_db.py
@ -29,11 +29,14 @@ Usage (from gdpr_scanner.py)

 import hashlib
 import json
+import logging
 import sqlite3
 import time
 from pathlib import Path
 from typing import Iterator

+logger = logging.getLogger(__name__)
+
 from pathlib import Path as _P
 _DATA_DIR = _P.home() / ".gdprscanner"
 _DATA_DIR.mkdir(exist_ok=True)
@ -432,6 +435,33 @@ class ScanDB:

        c.commit()

+    def finalize_orphan_scans(self) -> int:
+        """Finalise scans left unfinished by a crash, kill, or mid-scan restart.
+
+        After a fresh process start nothing is scanning, so any scan still
+        carrying finished_at IS NULL is dead — the process that owned it is gone.
+        Its already-saved flagged_items were stranded: both get_session_items
+        and get_open_items require finished_at, so those items are invisible and
+        effectively lost.  Finalising the orphans on startup makes them show up
+        and prevents permanent data loss from interrupted scans (the M365 and
+        Google engines return early on abort and never reach finish_scan; only
+        the file scan finalises in a finally block).
+
+        Safe to call only when no scan is running (i.e. at startup).  Returns the
+        number of scans finalised.
+        """
+        rows = self._connect().execute(
+            "SELECT id, total_scanned FROM scans WHERE finished_at IS NULL"
+        ).fetchall()
+        count = 0
+        for sid, total in rows:
+            try:
+                self.finish_scan(sid, total or 0)
+                count += 1
+            except Exception as e:
+                logger.warning("[db] finalize_orphan_scans: scan %s failed: %s", sid, e)
+        return count
+
    # ── Query helpers ─────────────────────────────────────────────────────────

    def latest_scan_id(self) -> int | None:
--- a/gdpr_scanner.py
+++ b/gdpr_scanner.py
@ -2305,6 +2305,19 @@ Example --settings file with SMTP:
        print(f"\n  GDPRScanner\n  ──────────────────────────────")
        print(f"  Open: http://{args.host}:{args.port}")

+        # Recover scans left unfinished by a crash / kill / mid-scan restart.
+        # Nothing is scanning at startup, so any scan with finished_at IS NULL is
+        # dead; finalising it makes its already-saved items visible again instead
+        # of stranding them (both get_session_items and get_open_items require a
+        # finished scan). Must run before the scheduler can start a new scan.
+        try:
+            if DB_OK:
+                _recovered = _get_db().finalize_orphan_scans()
+                if _recovered:
+                    print(f"  Recovered {_recovered} unfinished scan(s) from a prior restart")
+        except Exception as _orphan_err:
+            print(f"  Orphan-scan recovery: failed ({_orphan_err})")
+
        # Start in-process scheduler (#19)
        try:
            import scan_scheduler as _sched_mod
--- a/scan_engine.py
+++ b/scan_engine.py
@ -1078,6 +1078,14 @@ def run_scan(options: dict):
        if _check_abort():
            # Save checkpoint so scan can be resumed later
            _save_checkpoint(ck_key, scanned_ids, _state.flagged_items, _state.scan_meta)
+            # Finalise the DB scan record so items found before the stop stay
+            # visible — this early return otherwise skips finish_scan below,
+            # stranding them (invisible to get_session_items / get_open_items).
+            if _db and _db_scan_id:
+                try:
+                    _db.finish_scan(_db_scan_id, resumed_count + idx + 1)
+                except Exception as _e:
+                    logger.error("[db] finish_scan (aborted) failed: %s", _e)
            return
        idx += 1
        kind, meta, _ = _work_q.popleft()  # releases this item from the deque immediately
--- a/tests/test_db.py
+++ b/tests/test_db.py
@ -265,3 +265,51 @@ class TestExportImport:
        tgt.import_db(str(export_path), mode="replace")
        results = tgt.lookup_data_subject("290472-1234")
        assert len(results) >= 1
+
+
+# ─────────────────────────────────────────────────────────────────────────────
+# Orphan-scan recovery (crash / kill / mid-scan restart)
+# ─────────────────────────────────────────────────────────────────────────────
+
+class TestOrphanScanRecovery:
+
+    def _start_unfinished_scan(self, db, item_id):
+        """Begin a scan and save an item but never call finish_scan."""
+        sid = db.begin_scan({"sources": ["email"], "user_ids": []})
+        db.save_item(sid, _make_card(item_id=item_id))
+        return sid
+
+    def test_unfinished_scan_items_hidden_until_recovery(self, tmp_db):
+        self._start_unfinished_scan(tmp_db, "orphan-1")
+        # Not finalised → invisible to the open-items view
+        assert tmp_db.get_open_items() == []
+
+    def test_recovery_finalises_and_reveals_items(self, tmp_db):
+        self._start_unfinished_scan(tmp_db, "orphan-1")
+        self._start_unfinished_scan(tmp_db, "orphan-2")
+
+        recovered = tmp_db.finalize_orphan_scans()
+        assert recovered == 2
+
+        ids = {row["id"] for row in tmp_db.get_open_items()}
+        assert ids == {"orphan-1", "orphan-2"}
+
+    def test_recovery_leaves_finished_scans_untouched(self, tmp_db):
+        sid = tmp_db.begin_scan({"sources": ["email"], "user_ids": []})
+        tmp_db.save_item(sid, _make_card(item_id="done-1"))
+        tmp_db.finish_scan(sid, total_scanned=1)
+        before = tmp_db._connect().execute(
+            "SELECT finished_at FROM scans WHERE id=?", (sid,)
+        ).fetchone()[0]
+
+        assert tmp_db.finalize_orphan_scans() == 0  # nothing to recover
+
+        after = tmp_db._connect().execute(
+            "SELECT finished_at FROM scans WHERE id=?", (sid,)
+        ).fetchone()[0]
+        assert after == before  # finished_at not rewritten
+
+    def test_recovery_is_idempotent(self, tmp_db):
+        self._start_unfinished_scan(tmp_db, "orphan-1")
+        assert tmp_db.finalize_orphan_scans() == 1
+        assert tmp_db.finalize_orphan_scans() == 0