diff --git a/CHANGELOG.md b/CHANGELOG.md index 12b0a89..bfc9eca 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -11,6 +11,8 @@ Version numbers follow [Semantic Versioning](https://semver.org/spec/v2.0.0.html ### Added +- **Checkpoint / resume for Google and File scans** — stopping a Google Workspace or file (local/SMB/SFTP) scan mid-way and restarting now resumes from where it left off, exactly like M365 scans have always done. Each engine writes its own checkpoint file (`checkpoint_google.json`, `checkpoint_file_{source_id}.json`) every 25 items. On restart, previously found cards are re-emitted via SSE so the grid is repopulated before new items arrive. The Scan button now always checks for a live checkpoint before starting — if one exists the resume banner is shown regardless of whether the user reloaded the page. `POST /api/scan/checkpoint` returns a per-engine breakdown; `POST /api/scan/clear_checkpoint` wipes all `checkpoint_*.json` files. Google users' email addresses are included in the checkpoint payload from the frontend so the server can compute a matching key. `checkpoint.py` functions gained a `prefix` keyword argument (default `"m365"`) — existing M365 call sites are unchanged. + - **Email address and Danish phone number detection** — all three scan engines (M365, Google Workspace, local/SMB/SFTP) can now flag files and messages containing email addresses or Danish phone numbers in addition to CPR numbers. Detection is opt-in per profile: two new toggle options **Scan for email addresses** and **Scan for phone numbers** (default off) appear in the scan options panel and profile editor. When enabled, matches are stored as `email_count` / `phone_count` on each DB row and surfaced as colour-coded badges in list view, grid view, and the preview panel. Email regex requires a structurally valid address (`local@domain.tld`); phone regex covers 8-digit Danish numbers with optional `+45`/`0045` prefix and common spacing patterns. Both are deduplicated before counting. Requires DB migration (adds two INTEGER columns to `flagged_items`; applied automatically on first startup via `_MIGRATIONS`). - **SFTP as a 4th file connector** — SFTP servers can now be added as file sources alongside local folders, SMB shares, and cloud sources. A new `SFTPScanner` class in `sftp_connector.py` implements the same `iter_files()` interface as `FileScanner`, so `run_file_scan()`, SSE broadcasting, DB persistence, card building, scheduled scans, and exports work without changes. Supports password auth and SSH private key auth (RSA, Ed25519, ECDSA, DSS); passphrases stored in the OS keychain. Key files uploaded via `POST /api/file_sources/upload_key` and stored in `~/.gdprscanner/sftp_keys/` with `chmod 600`. SFTP sources appear with a 🔒 icon in the sources panel. Requires `paramiko>=3.4` (optional — scanner falls back gracefully if not installed). New source-type selector (Local / Network (SMB) / SFTP) replaces the SMB path-prefix auto-detection in the add-source form. diff --git a/CLAUDE.md b/CLAUDE.md index 8736f8d..b0864f8 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -30,7 +30,9 @@ python -m pytest tests/ -q **Frontend:** `templates/index.html` (SPA), `static/style.css` (all styles), `static/js/*.js` (11 ES modules + `state.js`). `static/app.js` is an archived monolith — no longer loaded. -**Data dir** `~/.gdprscanner/`: `scanner.db`, `config.json`, `settings.json`, `schedule.json`, `token.json`, `delta.json`, `checkpoint.json`, `smtp.json`, `machine_id` (**never delete** — Fernet key), `role_overrides.json`, `google_sa.json`, `google.json`, `src_toggles.json`, `app.lock`, `viewer_tokens.json` +**Checkpoint / resume** — all three scan engines save progress to `~/.gdprscanner/checkpoint_{prefix}.json` every 25 items. Prefixes: `m365`, `google`, `file_{source_id}`. `checkpoint.py` functions accept a `prefix` keyword (default `"m365"`). Use `_cp_path(prefix)` to get the path — do not hard-code filenames. The Scan button calls `checkCheckpoint(() => startScan(false))` so a resume banner is offered before any grid clearing happens. `POST /api/scan/clear_checkpoint` globs and deletes all `checkpoint_*.json` files. + +**Data dir** `~/.gdprscanner/`: `scanner.db`, `config.json`, `settings.json`, `schedule.json`, `token.json`, `delta.json`, `checkpoint_m365.json`, `checkpoint_google.json`, `checkpoint_file_*.json`, `smtp.json`, `machine_id` (**never delete** — Fernet key), `role_overrides.json`, `google_sa.json`, `google.json`, `src_toggles.json`, `app.lock`, `viewer_tokens.json` ## Non-obvious files diff --git a/OSS_LANDSCAPE.md b/OSS_LANDSCAPE.md new file mode 100644 index 0000000..496d947 --- /dev/null +++ b/OSS_LANDSCAPE.md @@ -0,0 +1,67 @@ +# Open Source Landscape — GDPR / PII Document Scanners + +An overview of existing open source tools in the same space as GDPRScanner, and where the gaps are. + +--- + +## Summary + +No open source project covers the same combination of M365 + Google Workspace connectors, Danish CPR detection, and GDPR Article 30 reporting in a single web UI. The closest commercial equivalent is [PII Tools](https://pii-tools.com) (closed source, SaaS). + +--- + +## Existing open source tools + +### [Microsoft Presidio](https://github.com/microsoft/presidio) +A well-maintained PII detection *library* (not an application) from Microsoft. Supports custom recognisers — a CPR pattern could be added. Covers text, images, and structured data via NLP + regex pipelines. No M365/GWS connectors, no UI, no reports, no scheduling. You would have to build the entire scanning application around it. ~9k GitHub stars. + +### [Octopii](https://github.com/redhuntlabs/Octopii) +Local filesystem / S3 / Apache open-directory scanner using OCR + NLP + regex. Detects passports, government IDs, emails, and addresses in image and document files. No cloud connectors, no CPR awareness, no web UI. + +### [pdscan](https://github.com/ankane/pdscan) / [piicatcher](https://github.com/tokern/piicatcher) +CLI tools that scan *databases* and data warehouses for PII columns using column-name heuristics and NLP sampling. No file storage scanning, no email, no cloud connectors. + +### "GDPR scanners" on GitHub +Projects such as [baudev/gdpr-checker-backend](https://github.com/baudev/gdpr-checker-backend), [dev4privacy/gdpr-analyzer](https://github.com/dev4privacy/gdpr-analyzer), [mammuth/gdpr-scanner](https://github.com/mammuth/gdpr-scanner), and [City-of-Helsinki/GDPR-compliance-scanner](https://github.com/City-of-Helsinki/GDPR-compliance-scanner) are all **website and cookie compliance** scanners. They check whether a domain sets tracking cookies without consent — a completely different problem. + +### CPR libraries +Several small libraries exist for validating or generating Danish CPR numbers ([mathiasvr/danish-ssn](https://github.com/mathiasvr/danish-ssn), [anhoej/cprr](https://github.com/anhoej/cprr), [ekstroem/DKcpr](https://github.com/ekstroem/DKcpr)). None of them are document or cloud-storage scanners. + +--- + +## Commercial products that do cover it + +| Product | M365 | GWS | CPR | Article 30 | Open source | +|---|---|---|---|---|---| +| [PII Tools](https://pii-tools.com) | ✅ | ✅ | ❌ | ❌ | ❌ | +| BigID | ✅ | ✅ | ❌ | ❌ | ❌ | +| Varonis | ✅ | partial | ❌ | ❌ | ❌ | +| Spirion | ✅ | ❌ | ❌ | ❌ | ❌ | + +PII Tools is the most direct commercial equivalent: Graph API + GWS service account connectors, document scanning, web UI. Closed source, SaaS pricing targeted at enterprise. + +--- + +## Capability comparison + +| Capability | GDPRScanner | Presidio | Octopii | Commercial | +|---|---|---|---|---| +| M365 (Exchange / OneDrive / SharePoint / Teams) | ✅ | ❌ | ❌ | ✅ | +| Google Workspace (Gmail / Drive) | ✅ | ❌ | ❌ | ✅ | +| Local / SMB / SFTP | ✅ | ❌ | partial | ✅ | +| Danish CPR with modulus-11 validation | ✅ | plugin only | ❌ | ❌ | +| Email address + phone number detection | ✅ | ✅ | ✅ | ✅ | +| GDPR Article 30 report generation | ✅ | ❌ | ❌ | partial | +| Disposition tagging + bulk deletion | ✅ | ❌ | ❌ | partial | +| Scheduled scans | ✅ | ❌ | ❌ | ✅ | +| Checkpoint / resume | ✅ | ❌ | ❌ | unknown | +| Read-only viewer / share links | ✅ | ❌ | ❌ | partial | +| Web UI for non-technical staff | ✅ | ❌ | ❌ | ✅ | +| Danish-language UI | ✅ | ❌ | ❌ | ❌ | +| Open source | ✅ | ✅ | ✅ | ❌ | + +--- + +## What makes GDPRScanner unique + +The combination of Danish CPR specificity (modulus-11 validation, date sanity checks), M365 + Google Workspace connectors in a single tool, and GDPR Article 30 output is the gap no open source project fills. The Danish public-sector target audience (schools, municipalities) also drives requirements — role classification (student/staff), Danish-language UI, municipal data retention rules — that no general-purpose PII tool addresses. diff --git a/TODO.md b/TODO.md index 11c9c42..2a7b2e4 100644 --- a/TODO.md +++ b/TODO.md @@ -119,6 +119,12 @@ Scan SFTP servers (SSH File Transfer Protocol) alongside local, SMB, and cloud s --- +### Checkpoint / resume for Google and File scans ✅ + +Extended the M365 checkpoint/resume mechanism to all three scan engines. Each engine writes its own file (`checkpoint_m365.json`, `checkpoint_google.json`, `checkpoint_file_{source_id}.json`) every 25 items. Previously found cards are re-emitted via SSE on resume so the grid repopulates before new items arrive. The Scan button now checks for a checkpoint before clearing the grid, so the resume banner appears even without a page reload. `POST /api/scan/checkpoint` returns a per-engine breakdown; `POST /api/scan/clear_checkpoint` wipes all `checkpoint_*.json` files. `checkpoint.py` functions gained a `prefix` keyword (default `"m365"`); M365 call sites are unchanged. + +--- + ### #32 — Windowed mode for Profiles, Sources, and Settings ✗ Won't do The workflow is sequential (configure → scan → review), not parallel — there is no realistic scenario where a modal and the results grid need to be open simultaneously. The Sources panel is already visible in the sidebar. Option A (the least-work path) still loads the full 3800-line JS stack twice. Closed. diff --git a/checkpoint.py b/checkpoint.py index 8b9c36d..dc95474 100644 --- a/checkpoint.py +++ b/checkpoint.py @@ -15,7 +15,9 @@ logger = logging.getLogger(__name__) _DATA_DIR = Path.home() / ".gdprscanner" _DATA_DIR.mkdir(exist_ok=True) -_CHECKPOINT_PATH = _DATA_DIR / "checkpoint.json" + +def _cp_path(prefix: str) -> Path: + return _DATA_DIR / f"checkpoint_{prefix}.json" def _checkpoint_key(options: dict) -> str: """Stable hash of the scan options — used to detect when a checkpoint @@ -27,7 +29,7 @@ def _checkpoint_key(options: dict) -> str: }, sort_keys=True) return hashlib.sha256(sig.encode()).hexdigest()[:16] -def _save_checkpoint(key: str, scanned_ids: set, flagged: list, meta: dict) -> None: +def _save_checkpoint(key: str, scanned_ids: set, flagged: list, meta: dict, *, prefix: str = "m365") -> None: """Write checkpoint to disk. Called periodically during scanning.""" try: payload = { @@ -36,28 +38,31 @@ def _save_checkpoint(key: str, scanned_ids: set, flagged: list, meta: dict) -> N "flagged": flagged, "meta": {k: v for k, v in meta.items() if k != "options"}, } - tmp = _CHECKPOINT_PATH.with_suffix(".tmp") + path = _cp_path(prefix) + tmp = path.with_suffix(".tmp") tmp.write_text(json.dumps(payload, ensure_ascii=False, default=str), encoding="utf-8") - tmp.replace(_CHECKPOINT_PATH) + tmp.replace(path) except Exception as e: logger.error("[checkpoint] save failed: %s", e) -def _load_checkpoint(key: str) -> dict | None: +def _load_checkpoint(key: str, *, prefix: str = "m365") -> dict | None: """Load checkpoint if it matches the current scan key. Returns None on mismatch or error.""" try: - if not _CHECKPOINT_PATH.exists(): + path = _cp_path(prefix) + if not path.exists(): return None - payload = json.loads(_CHECKPOINT_PATH.read_text(encoding="utf-8")) + payload = json.loads(path.read_text(encoding="utf-8")) if payload.get("key") != key: return None return payload except Exception: return None -def _clear_checkpoint() -> None: +def _clear_checkpoint(*, prefix: str = "m365") -> None: try: - if _CHECKPOINT_PATH.exists(): - _CHECKPOINT_PATH.unlink() + path = _cp_path(prefix) + if path.exists(): + path.unlink() except Exception: pass diff --git a/gdpr_scanner.py b/gdpr_scanner.py index d647a79..df78cc4 100644 --- a/gdpr_scanner.py +++ b/gdpr_scanner.py @@ -251,7 +251,7 @@ from app_config import ( from checkpoint import ( _checkpoint_key, _save_checkpoint, _load_checkpoint, _clear_checkpoint, _load_delta_tokens, _save_delta_tokens, - _CHECKPOINT_PATH, _DELTA_PATH, + _cp_path, _DELTA_PATH, ) from sse import broadcast, _sse_queues, _sse_buffer @@ -1842,7 +1842,7 @@ Example --settings file with SMTP: (_SETTINGS_PATH, "Headless scan settings"), (_ROLE_OVERRIDES_PATH, "Manual role overrides"), (_FILE_SOURCES_PATH, "File source definitions"), - (_CHECKPOINT_PATH, "Scan checkpoint (resume state)"), + (_cp_path("m365"), "Scan checkpoint (resume state)"), (_DELTA_PATH, "Delta scan tokens"), (_LANG_OVERRIDE_FILE, "Language preference"), (Path.home() / ".gdprscanner" / "schedule.json", "Scheduler configuration"), @@ -1929,10 +1929,12 @@ Example --settings file with SMTP: print(" ✖ m365_db not available — cannot reset") _sys.exit(1) - # Also clear the JSON checkpoint so the UI starts with no cached results - _clear_checkpoint() - if not _CHECKPOINT_PATH.exists(): - print(f" ✔ Checkpoint cleared") + # Also clear all checkpoints so the UI starts with no cached results + from pathlib import Path as _Path + for _cpf in (_Path.home() / ".gdprscanner").glob("checkpoint_*.json"): + try: _cpf.unlink() + except Exception: pass + print(f" ✔ Checkpoints cleared") # Clear delta tokens too — stale after a full DB reset if _DELTA_PATH.exists(): diff --git a/routes/google_scan.py b/routes/google_scan.py index 80da589..c48baa6 100644 --- a/routes/google_scan.py +++ b/routes/google_scan.py @@ -144,7 +144,8 @@ def _run_google_scan(options: dict): scan_emails = bool(scan_opts.get("scan_emails", False)) scan_phones = bool(scan_opts.get("scan_phones", False)) - from checkpoint import _load_delta_tokens, _save_delta_tokens + from checkpoint import (_load_delta_tokens, _save_delta_tokens, + _save_checkpoint, _load_checkpoint, _clear_checkpoint) _drive_delta_tokens: dict = _load_delta_tokens() if delta_enabled else {} _new_drive_tokens: dict = {} @@ -195,6 +196,28 @@ def _run_google_scan(options: dict): except Exception as e: logger.error("[google_scan] begin_scan failed: %s", e) + # ── Checkpoint: resume from a previous interrupted Google scan ──────────── + import hashlib as _hl, json as _js + _gck_prefix = "google" + _gck_key = _hl.sha256(_js.dumps({ + "emails": sorted(user_emails), + "sources": sorted(sources), + "older_than_days": scan_opts.get("older_than_days", 0), + }, sort_keys=True).encode()).hexdigest()[:16] + _gck = _load_checkpoint(_gck_key, prefix=_gck_prefix) + _g_scanned_ids: set = set(_gck["scanned_ids"]) if _gck else set() + _google_flagged: list = [] # items found by this Google scan (for checkpoint) + _gck_resumed = len(_g_scanned_ids) + if _gck: + from scan_engine import _with_disposition as _wd_ck + _google_flagged = list(_gck.get("flagged", [])) + flagged_items.extend(_google_flagged) + broadcast("scan_phase", {"phase": f"Resuming — skipping {_gck_resumed} already-scanned items…"}) + for _card in _google_flagged: + broadcast("scan_file_flagged", _wd_ck(_card, _db)) + _GCHECKPOINT_SAVE_EVERY = 25 + _g_items_since_save = 0 + total_flagged = 0 total_scanned = 0 t_start = _time.monotonic() @@ -234,6 +257,7 @@ def _run_google_scan(options: dict): "exif": {}, } flagged_items.append(card) + _google_flagged.append(card) broadcast("scan_file_flagged", _with_disposition(card, _db)) total_flagged += 1 if _db and _db_scan_id: @@ -265,6 +289,10 @@ def _run_google_scan(options: dict): ): if _check_abort(): return + _item_id = meta.get("id", "") + if _item_id in _g_scanned_ids: + total_scanned += 1 + continue total_scanned += 1 broadcast("scan_file", {"file": meta.get("name", "")}) broadcast("scan_progress", { @@ -279,6 +307,7 @@ def _run_google_scan(options: dict): result = _scan_bytes(data, meta.get("name", "msg.txt")) except Exception as e: broadcast("scan_error", {"file": meta.get("name", ""), "error": str(e)}) + _g_scanned_ids.add(_item_id) continue cprs = result.get("cprs", []) pii_counts = result.get("pii_counts") @@ -288,6 +317,11 @@ def _run_google_scan(options: dict): meta["_email_count"] = len(_em) meta["_phone_count"] = len(_ph) _broadcast_card(meta, cprs, pii_counts) + _g_scanned_ids.add(_item_id) + _g_items_since_save += 1 + if _g_items_since_save >= _GCHECKPOINT_SAVE_EVERY: + _save_checkpoint(_gck_key, _g_scanned_ids, _google_flagged, {}, prefix=_gck_prefix) + _g_items_since_save = 0 except GoogleError as e: broadcast("scan_error", {"file": f"Gmail/{user_email}", "error": str(e)}) except Exception as e: @@ -327,6 +361,10 @@ def _run_google_scan(options: dict): for meta, data in drive_items: if _check_abort(): return + _item_id = meta.get("id", "") + if _item_id in _g_scanned_ids: + total_scanned += 1 + continue total_scanned += 1 broadcast("scan_file", {"file": meta.get("name", "")}) broadcast("scan_progress", { @@ -341,6 +379,7 @@ def _run_google_scan(options: dict): result = _scan_bytes(data, meta.get("name", "file")) except Exception as e: broadcast("scan_error", {"file": meta.get("name", ""), "error": str(e)}) + _g_scanned_ids.add(_item_id) continue cprs = result.get("cprs", []) pii_counts = result.get("pii_counts") @@ -350,6 +389,11 @@ def _run_google_scan(options: dict): meta["_email_count"] = len(_em) meta["_phone_count"] = len(_ph) _broadcast_card(meta, cprs, pii_counts) + _g_scanned_ids.add(_item_id) + _g_items_since_save += 1 + if _g_items_since_save >= _GCHECKPOINT_SAVE_EVERY: + _save_checkpoint(_gck_key, _g_scanned_ids, _google_flagged, {}, prefix=_gck_prefix) + _g_items_since_save = 0 except GoogleError as e: broadcast("scan_error", {"file": f"Drive/{user_email}", "error": str(e)}) except Exception as e: @@ -362,6 +406,10 @@ def _run_google_scan(options: dict): except Exception as e: logger.warning("[gdrive delta] token save failed: %s", e) + from gdpr_scanner import _scan_abort as _gsa + if not _gsa.is_set(): + _clear_checkpoint(prefix=_gck_prefix) + elapsed = _time.monotonic() - t_start broadcast("google_scan_done", { "flagged_count": total_flagged, diff --git a/routes/scan.py b/routes/scan.py index 2b6c129..a1660c1 100644 --- a/routes/scan.py +++ b/routes/scan.py @@ -13,7 +13,7 @@ from app_config import ( ) from checkpoint import ( _checkpoint_key, _load_checkpoint, _clear_checkpoint, - _load_delta_tokens, _DELTA_PATH, + _load_delta_tokens, _DELTA_PATH, _cp_path, ) bp = Blueprint("scan", __name__) @@ -121,28 +121,80 @@ def scan_stop(): def scan_checkpoint_info(): """Return info about any saved checkpoint for the given scan options. If check_only=true, just reports whether a scan is currently running.""" + import hashlib, json as _json options = request.get_json() or {} if options.get("check_only"): acquired = state._scan_lock.acquire(blocking=False) if acquired: state._scan_lock.release() return jsonify({"running": not acquired}) - key = _checkpoint_key(options) - cp = _load_checkpoint(key) - if not cp: + + engines = {} + + # M365 + if options.get("sources"): + key = _checkpoint_key(options) + cp = _load_checkpoint(key, prefix="m365") + if cp: + engines["m365"] = { + "exists": True, + "scanned_count": len(cp.get("scanned_ids", [])), + "flagged_count": len(cp.get("flagged", [])), + "started_at": cp.get("meta", {}).get("started_at"), + } + + # Google + google_emails = options.get("googleUserEmails", []) + google_sources = options.get("googleSources", []) + if google_emails and google_sources: + gkey = hashlib.sha256(_json.dumps({ + "emails": sorted(google_emails), + "sources": sorted(google_sources), + "older_than_days": options.get("options", {}).get("older_than_days", 0), + }, sort_keys=True).encode()).hexdigest()[:16] + cp = _load_checkpoint(gkey, prefix="google") + if cp: + engines["google"] = { + "exists": True, + "scanned_count": len(cp.get("scanned_ids", [])), + "flagged_count": len(cp.get("flagged", [])), + "started_at": cp.get("meta", {}).get("started_at"), + } + + # File sources (one checkpoint per source ID) + for src_id in options.get("fileSources", []): + fkey = _checkpoint_key({"sources": ["file"], "user_ids": [src_id], "options": {}}) + cp = _load_checkpoint(fkey, prefix=f"file_{src_id}") + if cp: + fe = engines.setdefault("file", {"exists": True, "scanned_count": 0, "flagged_count": 0, "started_at": None}) + fe["scanned_count"] += len(cp.get("scanned_ids", [])) + fe["flagged_count"] += len(cp.get("flagged", [])) + if not fe["started_at"]: + fe["started_at"] = cp.get("meta", {}).get("started_at") + + if not engines: return jsonify({"exists": False}) + + started_ats = [v["started_at"] for v in engines.values() if v.get("started_at")] return jsonify({ "exists": True, - "scanned_count": len(cp.get("scanned_ids", [])), - "flagged_count": len(cp.get("flagged", [])), - "started_at": cp.get("meta", {}).get("started_at"), + "scanned_count": sum(v.get("scanned_count", 0) for v in engines.values()), + "flagged_count": sum(v.get("flagged_count", 0) for v in engines.values()), + "started_at": min(started_ats) if started_ats else None, + "engines": engines, }) @bp.route("/api/scan/clear_checkpoint", methods=["POST"]) def scan_clear_checkpoint(): - """Discard any saved checkpoint so the next scan starts fresh.""" - _clear_checkpoint() + """Discard all saved checkpoints so the next scan starts fresh.""" + from pathlib import Path + data_dir = Path.home() / ".gdprscanner" + for f in data_dir.glob("checkpoint_*.json"): + try: + f.unlink() + except Exception: + pass return jsonify({"status": "cleared"}) diff --git a/scan_engine.py b/scan_engine.py index 080b61f..64b79c3 100644 --- a/scan_engine.py +++ b/scan_engine.py @@ -125,8 +125,8 @@ def _html_esc(s): return str(s) # type: ignore[misc] # checkpoint helpers — injected by gdpr_scanner.py def _checkpoint_key(opts): return "" # type: ignore[misc] def _save_checkpoint(*a, **kw): pass # type: ignore[misc] -def _load_checkpoint(key): return None # type: ignore[misc] -def _clear_checkpoint(): pass # type: ignore[misc] +def _load_checkpoint(key, **kw): return None # type: ignore[misc] +def _clear_checkpoint(**kw): pass # type: ignore[misc] def _load_delta_tokens(): return {} # type: ignore[misc] def _save_delta_tokens(t): pass # type: ignore[misc] @@ -209,6 +209,23 @@ def run_file_scan(source: dict): except Exception as e: logger.error("[db] start_scan failed: %s", e) + # \u2500\u2500 Checkpoint: resume from a previous interrupted file scan \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 + _ck_prefix = f"file_{source.get('id', 'local')}" + _ck_key = _checkpoint_key({"sources": [source.get("source_type", "local")], "user_ids": [source.get("id", path)], "options": {}}) + _ck = _load_checkpoint(_ck_key, prefix=_ck_prefix) + _file_scanned_ids: set = set(_ck["scanned_ids"]) if _ck else set() + _file_flagged: list = [] # items found by this file scan run (for checkpoint) + _ck_resumed = len(_file_scanned_ids) + if _ck: + _file_flagged = list(_ck.get("flagged", [])) + for card in _file_flagged: + _state.flagged_items.append(card) + broadcast("scan_phase", {"phase": LANG.get("m365_resuming", f"Resuming \u2014 skipping {_ck_resumed} already-scanned items\u2026")}) + for card in _file_flagged: + broadcast("scan_file_flagged", _with_disposition(card, _db)) + _CHECKPOINT_SAVE_EVERY_FILE = 25 + _file_items_since_save = 0 + total_scanned = 0 total_flagged = 0 @@ -247,6 +264,10 @@ def run_file_scan(source: dict): if _state._scan_abort.is_set(): break + if rel_path in _file_scanned_ids: + total_scanned += 1 + continue + total_scanned += 1 broadcast("scan_progress", {"scanned": total_scanned, "flagged": total_flagged, "file": rel_path, "pct": min(90, 10 + total_scanned // 10), "source": "file"}) @@ -353,6 +374,7 @@ def run_file_scan(source: dict): } _state.flagged_items.append(card) + _file_flagged.append(card) total_flagged += 1 broadcast("scan_file_flagged", _with_disposition(card, _db)) @@ -362,10 +384,19 @@ def run_file_scan(source: dict): except Exception as e: logger.error("[db] save_item failed: %s", e) + _file_scanned_ids.add(rel_path) + _file_items_since_save += 1 + if _file_items_since_save >= _CHECKPOINT_SAVE_EVERY_FILE: + _save_checkpoint(_ck_key, _file_scanned_ids, _file_flagged, _state.scan_meta, prefix=_ck_prefix) + _file_items_since_save = 0 + except Exception as e: import traceback broadcast("scan_error", {"file": label, "error": str(e)}) logger.error("[file_scan] error:\n%s", traceback.format_exc()) + else: + if not _state._scan_abort.is_set(): + _clear_checkpoint(prefix=_ck_prefix) finally: if _db and _db_scan_id: try: diff --git a/static/js/scan.js b/static/js/scan.js index 092ada3..e3ef247 100644 --- a/static/js/scan.js +++ b/static/js/scan.js @@ -136,26 +136,39 @@ function buildScanPayload() { return { sources, fileSources, allSources, googleSources, user_ids, options }; } -async function checkCheckpoint() { +async function checkCheckpoint(onNoCheckpoint) { const payload = buildScanPayload(); - if (!payload.sources.length && !payload.fileSources.length) return; - if (payload.sources.length && !payload.user_ids.length) return; + const banner = document.getElementById('resumeBanner'); + const hasSources = payload.sources.length > 0 || payload.fileSources.length > 0 || payload.googleSources.length > 0; + if (!hasSources) { + if (banner) banner.style.display = 'none'; + onNoCheckpoint?.(); return; + } + // M365 sources without users — scan button will handle the alert + if (payload.sources.length && !payload.user_ids.length && !payload.googleSources.length) { + if (banner) banner.style.display = 'none'; + onNoCheckpoint?.(); return; + } + // Collect Google user emails for server-side checkpoint key computation + const googleUserEmails = payload.googleSources.length > 0 + ? (S._allUsers || []).filter(u => u.selected !== false && (u.platform === 'google' || u.platform === 'both')).map(u => u.email || u.id).filter(Boolean) + : []; try { const r = await fetch('/api/scan/checkpoint', { method: 'POST', headers: {'Content-Type':'application/json'}, - body: JSON.stringify(payload) + body: JSON.stringify({...payload, googleUserEmails}) }); const d = await r.json(); - const banner = document.getElementById('resumeBanner'); if (d.exists) { const ts = d.started_at ? new Date(d.started_at * 1000).toLocaleString([], {dateStyle:'short', timeStyle:'short'}) : ''; document.getElementById('resumeBannerText').textContent = t('m365_resume_banner', `Previous scan interrupted (${d.scanned_count} scanned, ${d.flagged_count} found${ts ? ' — ' + ts : ''})`); - banner.style.display = 'flex'; + if (banner) banner.style.display = 'flex'; } else { - banner.style.display = 'none'; + if (banner) banner.style.display = 'none'; + onNoCheckpoint?.(); } - } catch(e) { /* ignore */ } + } catch(e) { onNoCheckpoint?.(); } } async function clearCheckpointAndScan() { diff --git a/templates/index.html b/templates/index.html index 87fa578..c88c908 100644 --- a/templates/index.html +++ b/templates/index.html @@ -302,7 +302,7 @@ document.addEventListener('DOMContentLoaded', applyI18n);