Recover unfinished scans so their items aren't stranded
get_session_items / get_open_items / latest_scan_id all require finished_at IS NOT NULL, but the M365 and Google engines return early on abort (skipping finish_scan) and a process kill mid-scan (deploy, OOM, crash) never reaches it either. Result on prod: 41/42 scans had finished_at NULL, so 291 already-saved flagged items were invisible — the grid showed nothing. - finalize_orphan_scans(): finalises every finished_at-NULL scan; runs once at startup before the scheduler (nothing is scanning at boot, so any unfinished scan is dead). Recovers existing stranded items and guards against future mid-scan restarts. - run_scan: finalise the DB scan on the abort early-return too, so a stopped scan's items stay visible without waiting for a restart. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
7bf589bf7a
commit
29d9168643
@ -93,6 +93,7 @@ All options live in the profile `options` dict and apply to **all three scan eng
|
|||||||
- **`get_sessions(limit=50, window_seconds=300)`** — groups `scans` rows by 300 s window. Groups built ascending, returned descending. `ref_scan_id` is the highest `scan_id` in each group. Do not change window size independently of `get_session_items`.
|
- **`get_sessions(limit=50, window_seconds=300)`** — groups `scans` rows by 300 s window. Groups built ascending, returned descending. `ref_scan_id` is the highest `scan_id` in each group. Do not change window size independently of `get_session_items`.
|
||||||
- **`get_session_items(ref_scan_id=N)`** — anchors 300 s window to that scan's `started_at`. Window is **symmetric**: `started_at BETWEEN ref.started_at - 300 AND ref.started_at + 300`. Do not revert to a one-sided lower bound.
|
- **`get_session_items(ref_scan_id=N)`** — anchors 300 s window to that scan's `started_at`. Window is **symmetric**: `started_at BETWEEN ref.started_at - 300 AND ref.started_at + 300`. Do not revert to a one-sided lower bound.
|
||||||
- **`get_related_items(item_id, ref_scan_id, window_seconds=300)`** — self-joins `cpr_index` to find items sharing ≥1 CPR hash. Uses same 300 s symmetric window — do not change independently.
|
- **`get_related_items(item_id, ref_scan_id, window_seconds=300)`** — self-joins `cpr_index` to find items sharing ≥1 CPR hash. Uses same 300 s symmetric window — do not change independently.
|
||||||
|
- **Scans must be finalised or their items are invisible** — `get_session_items`, `get_open_items`, and `latest_scan_id` all filter on `finished_at IS NOT NULL`. The file scan finalises in a `finally`; M365 (`run_scan`) and Google (`_run_google_scan`) `return` early on abort, so each now calls `finish_scan` before that abort-return. A process kill (deploy/OOM/crash) mid-scan still strands a scan → **`finalize_orphan_scans()`** runs once at server startup (`gdpr_scanner.py` `__main__`, before the scheduler) and finalises every `finished_at IS NULL` scan (safe because nothing is scanning at boot). Do not add a scan-results query that ignores `finished_at` instead of fixing finalisation.
|
||||||
- **`get_open_items()`** — returns every flagged item with **no action taken**, across **all** scans (not just the latest session window). "Open" = no `dispositions` row, or one whose `status='unreviewed'`. Because `flagged_items` PK is `(id, scan_id)`, the same item recurs per scan; the query dedupes by `id`, keeping the row from the highest finished `scan_id`. This powers the **default landing view** so items don't drop out of sight once a newer scan opens a fresh session.
|
- **`get_open_items()`** — returns every flagged item with **no action taken**, across **all** scans (not just the latest session window). "Open" = no `dispositions` row, or one whose `status='unreviewed'`. Because `flagged_items` PK is `(id, scan_id)`, the same item recurs per scan; the query dedupes by `id`, keeping the row from the highest finished `scan_id`. This powers the **default landing view** so items don't drop out of sight once a newer scan opens a fresh session.
|
||||||
- **`GET /api/db/flagged`** — **with `?ref=N`** → `get_session_items(ref_scan_id=N)` (history mode); **without ref** → `get_open_items()` (default + viewer). Viewer scope enforcement applies to both. Do not change the no-ref `get_session_items()` default elsewhere (`export.py`, `scan_scheduler.py` still rely on latest-session for the current scan's report/email).
|
- **`GET /api/db/flagged`** — **with `?ref=N`** → `get_session_items(ref_scan_id=N)` (history mode); **without ref** → `get_open_items()` (default + viewer). Viewer scope enforcement applies to both. Do not change the no-ref `get_session_items()` default elsewhere (`export.py`, `scan_scheduler.py` still rely on latest-session for the current scan's report/email).
|
||||||
- See `static/js/CLAUDE.md` for the frontend history browser behaviour and `sse_replay_done` retry fix.
|
- See `static/js/CLAUDE.md` for the frontend history browser behaviour and `sse_replay_done` retry fix.
|
||||||
|
|||||||
30
gdpr_db.py
30
gdpr_db.py
@ -29,11 +29,14 @@ Usage (from gdpr_scanner.py)
|
|||||||
|
|
||||||
import hashlib
|
import hashlib
|
||||||
import json
|
import json
|
||||||
|
import logging
|
||||||
import sqlite3
|
import sqlite3
|
||||||
import time
|
import time
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
from typing import Iterator
|
from typing import Iterator
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
from pathlib import Path as _P
|
from pathlib import Path as _P
|
||||||
_DATA_DIR = _P.home() / ".gdprscanner"
|
_DATA_DIR = _P.home() / ".gdprscanner"
|
||||||
_DATA_DIR.mkdir(exist_ok=True)
|
_DATA_DIR.mkdir(exist_ok=True)
|
||||||
@ -432,6 +435,33 @@ class ScanDB:
|
|||||||
|
|
||||||
c.commit()
|
c.commit()
|
||||||
|
|
||||||
|
def finalize_orphan_scans(self) -> int:
|
||||||
|
"""Finalise scans left unfinished by a crash, kill, or mid-scan restart.
|
||||||
|
|
||||||
|
After a fresh process start nothing is scanning, so any scan still
|
||||||
|
carrying finished_at IS NULL is dead — the process that owned it is gone.
|
||||||
|
Its already-saved flagged_items were stranded: both get_session_items
|
||||||
|
and get_open_items require finished_at, so those items are invisible and
|
||||||
|
effectively lost. Finalising the orphans on startup makes them show up
|
||||||
|
and prevents permanent data loss from interrupted scans (the M365 and
|
||||||
|
Google engines return early on abort and never reach finish_scan; only
|
||||||
|
the file scan finalises in a finally block).
|
||||||
|
|
||||||
|
Safe to call only when no scan is running (i.e. at startup). Returns the
|
||||||
|
number of scans finalised.
|
||||||
|
"""
|
||||||
|
rows = self._connect().execute(
|
||||||
|
"SELECT id, total_scanned FROM scans WHERE finished_at IS NULL"
|
||||||
|
).fetchall()
|
||||||
|
count = 0
|
||||||
|
for sid, total in rows:
|
||||||
|
try:
|
||||||
|
self.finish_scan(sid, total or 0)
|
||||||
|
count += 1
|
||||||
|
except Exception as e:
|
||||||
|
logger.warning("[db] finalize_orphan_scans: scan %s failed: %s", sid, e)
|
||||||
|
return count
|
||||||
|
|
||||||
# ── Query helpers ─────────────────────────────────────────────────────────
|
# ── Query helpers ─────────────────────────────────────────────────────────
|
||||||
|
|
||||||
def latest_scan_id(self) -> int | None:
|
def latest_scan_id(self) -> int | None:
|
||||||
|
|||||||
@ -2305,6 +2305,19 @@ Example --settings file with SMTP:
|
|||||||
print(f"\n GDPRScanner\n ──────────────────────────────")
|
print(f"\n GDPRScanner\n ──────────────────────────────")
|
||||||
print(f" Open: http://{args.host}:{args.port}")
|
print(f" Open: http://{args.host}:{args.port}")
|
||||||
|
|
||||||
|
# Recover scans left unfinished by a crash / kill / mid-scan restart.
|
||||||
|
# Nothing is scanning at startup, so any scan with finished_at IS NULL is
|
||||||
|
# dead; finalising it makes its already-saved items visible again instead
|
||||||
|
# of stranding them (both get_session_items and get_open_items require a
|
||||||
|
# finished scan). Must run before the scheduler can start a new scan.
|
||||||
|
try:
|
||||||
|
if DB_OK:
|
||||||
|
_recovered = _get_db().finalize_orphan_scans()
|
||||||
|
if _recovered:
|
||||||
|
print(f" Recovered {_recovered} unfinished scan(s) from a prior restart")
|
||||||
|
except Exception as _orphan_err:
|
||||||
|
print(f" Orphan-scan recovery: failed ({_orphan_err})")
|
||||||
|
|
||||||
# Start in-process scheduler (#19)
|
# Start in-process scheduler (#19)
|
||||||
try:
|
try:
|
||||||
import scan_scheduler as _sched_mod
|
import scan_scheduler as _sched_mod
|
||||||
|
|||||||
@ -1078,6 +1078,14 @@ def run_scan(options: dict):
|
|||||||
if _check_abort():
|
if _check_abort():
|
||||||
# Save checkpoint so scan can be resumed later
|
# Save checkpoint so scan can be resumed later
|
||||||
_save_checkpoint(ck_key, scanned_ids, _state.flagged_items, _state.scan_meta)
|
_save_checkpoint(ck_key, scanned_ids, _state.flagged_items, _state.scan_meta)
|
||||||
|
# Finalise the DB scan record so items found before the stop stay
|
||||||
|
# visible — this early return otherwise skips finish_scan below,
|
||||||
|
# stranding them (invisible to get_session_items / get_open_items).
|
||||||
|
if _db and _db_scan_id:
|
||||||
|
try:
|
||||||
|
_db.finish_scan(_db_scan_id, resumed_count + idx + 1)
|
||||||
|
except Exception as _e:
|
||||||
|
logger.error("[db] finish_scan (aborted) failed: %s", _e)
|
||||||
return
|
return
|
||||||
idx += 1
|
idx += 1
|
||||||
kind, meta, _ = _work_q.popleft() # releases this item from the deque immediately
|
kind, meta, _ = _work_q.popleft() # releases this item from the deque immediately
|
||||||
|
|||||||
@ -265,3 +265,51 @@ class TestExportImport:
|
|||||||
tgt.import_db(str(export_path), mode="replace")
|
tgt.import_db(str(export_path), mode="replace")
|
||||||
results = tgt.lookup_data_subject("290472-1234")
|
results = tgt.lookup_data_subject("290472-1234")
|
||||||
assert len(results) >= 1
|
assert len(results) >= 1
|
||||||
|
|
||||||
|
|
||||||
|
# ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
# Orphan-scan recovery (crash / kill / mid-scan restart)
|
||||||
|
# ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
class TestOrphanScanRecovery:
|
||||||
|
|
||||||
|
def _start_unfinished_scan(self, db, item_id):
|
||||||
|
"""Begin a scan and save an item but never call finish_scan."""
|
||||||
|
sid = db.begin_scan({"sources": ["email"], "user_ids": []})
|
||||||
|
db.save_item(sid, _make_card(item_id=item_id))
|
||||||
|
return sid
|
||||||
|
|
||||||
|
def test_unfinished_scan_items_hidden_until_recovery(self, tmp_db):
|
||||||
|
self._start_unfinished_scan(tmp_db, "orphan-1")
|
||||||
|
# Not finalised → invisible to the open-items view
|
||||||
|
assert tmp_db.get_open_items() == []
|
||||||
|
|
||||||
|
def test_recovery_finalises_and_reveals_items(self, tmp_db):
|
||||||
|
self._start_unfinished_scan(tmp_db, "orphan-1")
|
||||||
|
self._start_unfinished_scan(tmp_db, "orphan-2")
|
||||||
|
|
||||||
|
recovered = tmp_db.finalize_orphan_scans()
|
||||||
|
assert recovered == 2
|
||||||
|
|
||||||
|
ids = {row["id"] for row in tmp_db.get_open_items()}
|
||||||
|
assert ids == {"orphan-1", "orphan-2"}
|
||||||
|
|
||||||
|
def test_recovery_leaves_finished_scans_untouched(self, tmp_db):
|
||||||
|
sid = tmp_db.begin_scan({"sources": ["email"], "user_ids": []})
|
||||||
|
tmp_db.save_item(sid, _make_card(item_id="done-1"))
|
||||||
|
tmp_db.finish_scan(sid, total_scanned=1)
|
||||||
|
before = tmp_db._connect().execute(
|
||||||
|
"SELECT finished_at FROM scans WHERE id=?", (sid,)
|
||||||
|
).fetchone()[0]
|
||||||
|
|
||||||
|
assert tmp_db.finalize_orphan_scans() == 0 # nothing to recover
|
||||||
|
|
||||||
|
after = tmp_db._connect().execute(
|
||||||
|
"SELECT finished_at FROM scans WHERE id=?", (sid,)
|
||||||
|
).fetchone()[0]
|
||||||
|
assert after == before # finished_at not rewritten
|
||||||
|
|
||||||
|
def test_recovery_is_idempotent(self, tmp_db):
|
||||||
|
self._start_unfinished_scan(tmp_db, "orphan-1")
|
||||||
|
assert tmp_db.finalize_orphan_scans() == 1
|
||||||
|
assert tmp_db.finalize_orphan_scans() == 0
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user