Extended the M365 checkpoint/resume mechanism to all three scan engines. Each engine writes its own +file (checkpoint_m365.json, checkpoint_google.json, checkpoint_file_{source_id}.json) every 25 + items.

2026-04-25 20:30:59 +02:00 · 2026-04-25 20:30:59 +02:00 · 8b55e9d933
commit 8b55e9d933
parent 2254e00481
12 changed files with 268 additions and 40 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -11,6 +11,8 @@ Version numbers follow [Semantic Versioning](https://semver.org/spec/v2.0.0.html

 ### Added

+- **Checkpoint / resume for Google and File scans** — stopping a Google Workspace or file (local/SMB/SFTP) scan mid-way and restarting now resumes from where it left off, exactly like M365 scans have always done. Each engine writes its own checkpoint file (`checkpoint_google.json`, `checkpoint_file_{source_id}.json`) every 25 items. On restart, previously found cards are re-emitted via SSE so the grid is repopulated before new items arrive. The Scan button now always checks for a live checkpoint before starting — if one exists the resume banner is shown regardless of whether the user reloaded the page. `POST /api/scan/checkpoint` returns a per-engine breakdown; `POST /api/scan/clear_checkpoint` wipes all `checkpoint_*.json` files. Google users' email addresses are included in the checkpoint payload from the frontend so the server can compute a matching key. `checkpoint.py` functions gained a `prefix` keyword argument (default `"m365"`) — existing M365 call sites are unchanged.
+
 - **Email address and Danish phone number detection** — all three scan engines (M365, Google Workspace, local/SMB/SFTP) can now flag files and messages containing email addresses or Danish phone numbers in addition to CPR numbers. Detection is opt-in per profile: two new toggle options **Scan for email addresses** and **Scan for phone numbers** (default off) appear in the scan options panel and profile editor. When enabled, matches are stored as `email_count` / `phone_count` on each DB row and surfaced as colour-coded badges in list view, grid view, and the preview panel. Email regex requires a structurally valid address (`local@domain.tld`); phone regex covers 8-digit Danish numbers with optional `+45`/`0045` prefix and common spacing patterns. Both are deduplicated before counting. Requires DB migration (adds two INTEGER columns to `flagged_items`; applied automatically on first startup via `_MIGRATIONS`).

 - **SFTP as a 4th file connector** — SFTP servers can now be added as file sources alongside local folders, SMB shares, and cloud sources. A new `SFTPScanner` class in `sftp_connector.py` implements the same `iter_files()` interface as `FileScanner`, so `run_file_scan()`, SSE broadcasting, DB persistence, card building, scheduled scans, and exports work without changes. Supports password auth and SSH private key auth (RSA, Ed25519, ECDSA, DSS); passphrases stored in the OS keychain. Key files uploaded via `POST /api/file_sources/upload_key` and stored in `~/.gdprscanner/sftp_keys/` with `chmod 600`. SFTP sources appear with a 🔒 icon in the sources panel. Requires `paramiko>=3.4` (optional — scanner falls back gracefully if not installed). New source-type selector (Local / Network (SMB) / SFTP) replaces the SMB path-prefix auto-detection in the add-source form.
--- a/CLAUDE.md
+++ b/CLAUDE.md
@ -30,7 +30,9 @@ python -m pytest tests/ -q

 **Frontend:** `templates/index.html` (SPA), `static/style.css` (all styles), `static/js/*.js` (11 ES modules + `state.js`). `static/app.js` is an archived monolith — no longer loaded.

-**Data dir** `~/.gdprscanner/`: `scanner.db`, `config.json`, `settings.json`, `schedule.json`, `token.json`, `delta.json`, `checkpoint.json`, `smtp.json`, `machine_id` (**never delete** — Fernet key), `role_overrides.json`, `google_sa.json`, `google.json`, `src_toggles.json`, `app.lock`, `viewer_tokens.json`
+**Checkpoint / resume** — all three scan engines save progress to `~/.gdprscanner/checkpoint_{prefix}.json` every 25 items. Prefixes: `m365`, `google`, `file_{source_id}`. `checkpoint.py` functions accept a `prefix` keyword (default `"m365"`). Use `_cp_path(prefix)` to get the path — do not hard-code filenames. The Scan button calls `checkCheckpoint(() => startScan(false))` so a resume banner is offered before any grid clearing happens. `POST /api/scan/clear_checkpoint` globs and deletes all `checkpoint_*.json` files.
+
+**Data dir** `~/.gdprscanner/`: `scanner.db`, `config.json`, `settings.json`, `schedule.json`, `token.json`, `delta.json`, `checkpoint_m365.json`, `checkpoint_google.json`, `checkpoint_file_*.json`, `smtp.json`, `machine_id` (**never delete** — Fernet key), `role_overrides.json`, `google_sa.json`, `google.json`, `src_toggles.json`, `app.lock`, `viewer_tokens.json`

 ## Non-obvious files

--- a/OSS_LANDSCAPE.md
+++ b/OSS_LANDSCAPE.md
@ -0,0 +1,67 @@
+# Open Source Landscape — GDPR / PII Document Scanners
+
+An overview of existing open source tools in the same space as GDPRScanner, and where the gaps are.
+
+---
+
+## Summary
+
+No open source project covers the same combination of M365 + Google Workspace connectors, Danish CPR detection, and GDPR Article 30 reporting in a single web UI. The closest commercial equivalent is [PII Tools](https://pii-tools.com) (closed source, SaaS).
+
+---
+
+## Existing open source tools
+
+### [Microsoft Presidio](https://github.com/microsoft/presidio)
+A well-maintained PII detection *library* (not an application) from Microsoft. Supports custom recognisers — a CPR pattern could be added. Covers text, images, and structured data via NLP + regex pipelines. No M365/GWS connectors, no UI, no reports, no scheduling. You would have to build the entire scanning application around it. ~9k GitHub stars.
+
+### [Octopii](https://github.com/redhuntlabs/Octopii)
+Local filesystem / S3 / Apache open-directory scanner using OCR + NLP + regex. Detects passports, government IDs, emails, and addresses in image and document files. No cloud connectors, no CPR awareness, no web UI.
+
+### [pdscan](https://github.com/ankane/pdscan) / [piicatcher](https://github.com/tokern/piicatcher)
+CLI tools that scan *databases* and data warehouses for PII columns using column-name heuristics and NLP sampling. No file storage scanning, no email, no cloud connectors.
+
+### "GDPR scanners" on GitHub
+Projects such as [baudev/gdpr-checker-backend](https://github.com/baudev/gdpr-checker-backend), [dev4privacy/gdpr-analyzer](https://github.com/dev4privacy/gdpr-analyzer), [mammuth/gdpr-scanner](https://github.com/mammuth/gdpr-scanner), and [City-of-Helsinki/GDPR-compliance-scanner](https://github.com/City-of-Helsinki/GDPR-compliance-scanner) are all **website and cookie compliance** scanners. They check whether a domain sets tracking cookies without consent — a completely different problem.
+
+### CPR libraries
+Several small libraries exist for validating or generating Danish CPR numbers ([mathiasvr/danish-ssn](https://github.com/mathiasvr/danish-ssn), [anhoej/cprr](https://github.com/anhoej/cprr), [ekstroem/DKcpr](https://github.com/ekstroem/DKcpr)). None of them are document or cloud-storage scanners.
+
+---
+
+## Commercial products that do cover it
+
+| Product | M365 | GWS | CPR | Article 30 | Open source |
+|---|---|---|---|---|---|
+| [PII Tools](https://pii-tools.com) | ✅ | ✅ | ❌ | ❌ | ❌ |
+| BigID | ✅ | ✅ | ❌ | ❌ | ❌ |
+| Varonis | ✅ | partial | ❌ | ❌ | ❌ |
+| Spirion | ✅ | ❌ | ❌ | ❌ | ❌ |
+
+PII Tools is the most direct commercial equivalent: Graph API + GWS service account connectors, document scanning, web UI. Closed source, SaaS pricing targeted at enterprise.
+
+---
+
+## Capability comparison
+
+| Capability | GDPRScanner | Presidio | Octopii | Commercial |
+|---|---|---|---|---|
+| M365 (Exchange / OneDrive / SharePoint / Teams) | ✅ | ❌ | ❌ | ✅ |
+| Google Workspace (Gmail / Drive) | ✅ | ❌ | ❌ | ✅ |
+| Local / SMB / SFTP | ✅ | ❌ | partial | ✅ |
+| Danish CPR with modulus-11 validation | ✅ | plugin only | ❌ | ❌ |
+| Email address + phone number detection | ✅ | ✅ | ✅ | ✅ |
+| GDPR Article 30 report generation | ✅ | ❌ | ❌ | partial |
+| Disposition tagging + bulk deletion | ✅ | ❌ | ❌ | partial |
+| Scheduled scans | ✅ | ❌ | ❌ | ✅ |
+| Checkpoint / resume | ✅ | ❌ | ❌ | unknown |
+| Read-only viewer / share links | ✅ | ❌ | ❌ | partial |
+| Web UI for non-technical staff | ✅ | ❌ | ❌ | ✅ |
+| Danish-language UI | ✅ | ❌ | ❌ | ❌ |
+| Open source | ✅ | ✅ | ✅ | ❌ |
+
+---
+
+## What makes GDPRScanner unique
+
+The combination of Danish CPR specificity (modulus-11 validation, date sanity checks), M365 + Google Workspace connectors in a single tool, and GDPR Article 30 output is the gap no open source project fills. The Danish public-sector target audience (schools, municipalities) also drives requirements — role classification (student/staff), Danish-language UI, municipal data retention rules — that no general-purpose PII tool addresses.
--- a/TODO.md
+++ b/TODO.md
@ -119,6 +119,12 @@ Scan SFTP servers (SSH File Transfer Protocol) alongside local, SMB, and cloud s

 ---

+### Checkpoint / resume for Google and File scans ✅
+
+Extended the M365 checkpoint/resume mechanism to all three scan engines. Each engine writes its own file (`checkpoint_m365.json`, `checkpoint_google.json`, `checkpoint_file_{source_id}.json`) every 25 items. Previously found cards are re-emitted via SSE on resume so the grid repopulates before new items arrive. The Scan button now checks for a checkpoint before clearing the grid, so the resume banner appears even without a page reload. `POST /api/scan/checkpoint` returns a per-engine breakdown; `POST /api/scan/clear_checkpoint` wipes all `checkpoint_*.json` files. `checkpoint.py` functions gained a `prefix` keyword (default `"m365"`); M365 call sites are unchanged.
+
+---
+
 ### #32 — Windowed mode for Profiles, Sources, and Settings ✗ Won't do
 The workflow is sequential (configure → scan → review), not parallel — there is no realistic scenario where a modal and the results grid need to be open simultaneously. The Sources panel is already visible in the sidebar. Option A (the least-work path) still loads the full 3800-line JS stack twice. Closed.

--- a/checkpoint.py
+++ b/checkpoint.py
@ -15,7 +15,9 @@ logger = logging.getLogger(__name__)

 _DATA_DIR = Path.home() / ".gdprscanner"
 _DATA_DIR.mkdir(exist_ok=True)
-_CHECKPOINT_PATH = _DATA_DIR / "checkpoint.json"
+
+def _cp_path(prefix: str) -> Path:
+    return _DATA_DIR / f"checkpoint_{prefix}.json"

 def _checkpoint_key(options: dict) -> str:
    """Stable hash of the scan options — used to detect when a checkpoint
@ -27,7 +29,7 @@ def _checkpoint_key(options: dict) -> str:
    }, sort_keys=True)
    return hashlib.sha256(sig.encode()).hexdigest()[:16]

-def _save_checkpoint(key: str, scanned_ids: set, flagged: list, meta: dict) -> None:
+def _save_checkpoint(key: str, scanned_ids: set, flagged: list, meta: dict, *, prefix: str = "m365") -> None:
    """Write checkpoint to disk. Called periodically during scanning."""
    try:
        payload = {
@ -36,28 +38,31 @@ def _save_checkpoint(key: str, scanned_ids: set, flagged: list, meta: dict) -> N
            "flagged":     flagged,
            "meta":        {k: v for k, v in meta.items() if k != "options"},
        }
-        tmp = _CHECKPOINT_PATH.with_suffix(".tmp")
+        path = _cp_path(prefix)
+        tmp  = path.with_suffix(".tmp")
        tmp.write_text(json.dumps(payload, ensure_ascii=False, default=str), encoding="utf-8")
-        tmp.replace(_CHECKPOINT_PATH)
+        tmp.replace(path)
    except Exception as e:
        logger.error("[checkpoint] save failed: %s", e)

-def _load_checkpoint(key: str) -> dict | None:
+def _load_checkpoint(key: str, *, prefix: str = "m365") -> dict | None:
    """Load checkpoint if it matches the current scan key. Returns None on mismatch or error."""
    try:
-        if not _CHECKPOINT_PATH.exists():
+        path = _cp_path(prefix)
+        if not path.exists():
            return None
-        payload = json.loads(_CHECKPOINT_PATH.read_text(encoding="utf-8"))
+        payload = json.loads(path.read_text(encoding="utf-8"))
        if payload.get("key") != key:
            return None
        return payload
    except Exception:
        return None

-def _clear_checkpoint() -> None:
+def _clear_checkpoint(*, prefix: str = "m365") -> None:
    try:
-        if _CHECKPOINT_PATH.exists():
-            _CHECKPOINT_PATH.unlink()
+        path = _cp_path(prefix)
+        if path.exists():
+            path.unlink()
    except Exception:
        pass

--- a/gdpr_scanner.py
+++ b/gdpr_scanner.py
@ -251,7 +251,7 @@ from app_config import (
 from checkpoint import (
    _checkpoint_key, _save_checkpoint, _load_checkpoint, _clear_checkpoint,
    _load_delta_tokens, _save_delta_tokens,
-    _CHECKPOINT_PATH, _DELTA_PATH,
+    _cp_path, _DELTA_PATH,
 )

 from sse import broadcast, _sse_queues, _sse_buffer
@ -1842,7 +1842,7 @@ Example --settings file with SMTP:
            (_SETTINGS_PATH,                                        "Headless scan settings"),
            (_ROLE_OVERRIDES_PATH,                                  "Manual role overrides"),
            (_FILE_SOURCES_PATH,                                    "File source definitions"),
-            (_CHECKPOINT_PATH,                                      "Scan checkpoint (resume state)"),
+            (_cp_path("m365"),                                      "Scan checkpoint (resume state)"),
            (_DELTA_PATH,                                           "Delta scan tokens"),
            (_LANG_OVERRIDE_FILE,                                   "Language preference"),
            (Path.home() / ".gdprscanner" / "schedule.json",           "Scheduler configuration"),
@ -1929,10 +1929,12 @@ Example --settings file with SMTP:
            print("  ✖ m365_db not available — cannot reset")
            _sys.exit(1)

-        # Also clear the JSON checkpoint so the UI starts with no cached results
-        _clear_checkpoint()
-        if not _CHECKPOINT_PATH.exists():
-            print(f"  ✔ Checkpoint cleared")
+        # Also clear all checkpoints so the UI starts with no cached results
+        from pathlib import Path as _Path
+        for _cpf in (_Path.home() / ".gdprscanner").glob("checkpoint_*.json"):
+            try: _cpf.unlink()
+            except Exception: pass
+        print(f"  ✔ Checkpoints cleared")

        # Clear delta tokens too — stale after a full DB reset
        if _DELTA_PATH.exists():
--- a/routes/google_scan.py
+++ b/routes/google_scan.py
@ -144,7 +144,8 @@ def _run_google_scan(options: dict):
    scan_emails   = bool(scan_opts.get("scan_emails",  False))
    scan_phones   = bool(scan_opts.get("scan_phones",  False))

-    from checkpoint import _load_delta_tokens, _save_delta_tokens
+    from checkpoint import (_load_delta_tokens, _save_delta_tokens,
+                            _save_checkpoint, _load_checkpoint, _clear_checkpoint)
    _drive_delta_tokens: dict = _load_delta_tokens() if delta_enabled else {}
    _new_drive_tokens:   dict = {}

@ -195,6 +196,28 @@ def _run_google_scan(options: dict):
        except Exception as e:
            logger.error("[google_scan] begin_scan failed: %s", e)

+    # ── Checkpoint: resume from a previous interrupted Google scan ────────────
+    import hashlib as _hl, json as _js
+    _gck_prefix = "google"
+    _gck_key    = _hl.sha256(_js.dumps({
+        "emails":  sorted(user_emails),
+        "sources": sorted(sources),
+        "older_than_days": scan_opts.get("older_than_days", 0),
+    }, sort_keys=True).encode()).hexdigest()[:16]
+    _gck             = _load_checkpoint(_gck_key, prefix=_gck_prefix)
+    _g_scanned_ids:  set  = set(_gck["scanned_ids"]) if _gck else set()
+    _google_flagged: list = []  # items found by this Google scan (for checkpoint)
+    _gck_resumed = len(_g_scanned_ids)
+    if _gck:
+        from scan_engine import _with_disposition as _wd_ck
+        _google_flagged = list(_gck.get("flagged", []))
+        flagged_items.extend(_google_flagged)
+        broadcast("scan_phase", {"phase": f"Resuming — skipping {_gck_resumed} already-scanned items…"})
+        for _card in _google_flagged:
+            broadcast("scan_file_flagged", _wd_ck(_card, _db))
+    _GCHECKPOINT_SAVE_EVERY = 25
+    _g_items_since_save = 0
+
    total_flagged = 0
    total_scanned = 0
    t_start = _time.monotonic()
@ -234,6 +257,7 @@ def _run_google_scan(options: dict):
            "exif":             {},
        }
        flagged_items.append(card)
+        _google_flagged.append(card)
        broadcast("scan_file_flagged", _with_disposition(card, _db))
        total_flagged += 1
        if _db and _db_scan_id:
@ -265,6 +289,10 @@ def _run_google_scan(options: dict):
                ):
                    if _check_abort():
                        return
+                    _item_id = meta.get("id", "")
+                    if _item_id in _g_scanned_ids:
+                        total_scanned += 1
+                        continue
                    total_scanned += 1
                    broadcast("scan_file", {"file": meta.get("name", "")})
                    broadcast("scan_progress", {
@ -279,6 +307,7 @@ def _run_google_scan(options: dict):
                        result = _scan_bytes(data, meta.get("name", "msg.txt"))
                    except Exception as e:
                        broadcast("scan_error", {"file": meta.get("name", ""), "error": str(e)})
+                        _g_scanned_ids.add(_item_id)
                        continue
                    cprs       = result.get("cprs", [])
                    pii_counts = result.get("pii_counts")
@ -288,6 +317,11 @@ def _run_google_scan(options: dict):
                        meta["_email_count"] = len(_em)
                        meta["_phone_count"] = len(_ph)
                        _broadcast_card(meta, cprs, pii_counts)
+                    _g_scanned_ids.add(_item_id)
+                    _g_items_since_save += 1
+                    if _g_items_since_save >= _GCHECKPOINT_SAVE_EVERY:
+                        _save_checkpoint(_gck_key, _g_scanned_ids, _google_flagged, {}, prefix=_gck_prefix)
+                        _g_items_since_save = 0
            except GoogleError as e:
                broadcast("scan_error", {"file": f"Gmail/{user_email}", "error": str(e)})
            except Exception as e:
@ -327,6 +361,10 @@ def _run_google_scan(options: dict):
                for meta, data in drive_items:
                    if _check_abort():
                        return
+                    _item_id = meta.get("id", "")
+                    if _item_id in _g_scanned_ids:
+                        total_scanned += 1
+                        continue
                    total_scanned += 1
                    broadcast("scan_file", {"file": meta.get("name", "")})
                    broadcast("scan_progress", {
@ -341,6 +379,7 @@ def _run_google_scan(options: dict):
                        result = _scan_bytes(data, meta.get("name", "file"))
                    except Exception as e:
                        broadcast("scan_error", {"file": meta.get("name", ""), "error": str(e)})
+                        _g_scanned_ids.add(_item_id)
                        continue
                    cprs       = result.get("cprs", [])
                    pii_counts = result.get("pii_counts")
@ -350,6 +389,11 @@ def _run_google_scan(options: dict):
                        meta["_email_count"] = len(_em)
                        meta["_phone_count"] = len(_ph)
                        _broadcast_card(meta, cprs, pii_counts)
+                    _g_scanned_ids.add(_item_id)
+                    _g_items_since_save += 1
+                    if _g_items_since_save >= _GCHECKPOINT_SAVE_EVERY:
+                        _save_checkpoint(_gck_key, _g_scanned_ids, _google_flagged, {}, prefix=_gck_prefix)
+                        _g_items_since_save = 0
            except GoogleError as e:
                broadcast("scan_error", {"file": f"Drive/{user_email}", "error": str(e)})
            except Exception as e:
@ -362,6 +406,10 @@ def _run_google_scan(options: dict):
        except Exception as e:
            logger.warning("[gdrive delta] token save failed: %s", e)

+    from gdpr_scanner import _scan_abort as _gsa
+    if not _gsa.is_set():
+        _clear_checkpoint(prefix=_gck_prefix)
+
    elapsed = _time.monotonic() - t_start
    broadcast("google_scan_done", {
        "flagged_count":   total_flagged,
--- a/routes/scan.py
+++ b/routes/scan.py
@ -13,7 +13,7 @@ from app_config import (
 )
 from checkpoint import (
    _checkpoint_key, _load_checkpoint, _clear_checkpoint,
-    _load_delta_tokens, _DELTA_PATH,
+    _load_delta_tokens, _DELTA_PATH, _cp_path,
 )

 bp = Blueprint("scan", __name__)
@ -121,28 +121,80 @@ def scan_stop():
 def scan_checkpoint_info():
    """Return info about any saved checkpoint for the given scan options.
    If check_only=true, just reports whether a scan is currently running."""
+    import hashlib, json as _json
    options = request.get_json() or {}
    if options.get("check_only"):
        acquired = state._scan_lock.acquire(blocking=False)
        if acquired:
            state._scan_lock.release()
        return jsonify({"running": not acquired})
-    key = _checkpoint_key(options)
-    cp  = _load_checkpoint(key)
-    if not cp:
+
+    engines = {}
+
+    # M365
+    if options.get("sources"):
+        key = _checkpoint_key(options)
+        cp  = _load_checkpoint(key, prefix="m365")
+        if cp:
+            engines["m365"] = {
+                "exists":        True,
+                "scanned_count": len(cp.get("scanned_ids", [])),
+                "flagged_count": len(cp.get("flagged", [])),
+                "started_at":    cp.get("meta", {}).get("started_at"),
+            }
+
+    # Google
+    google_emails  = options.get("googleUserEmails", [])
+    google_sources = options.get("googleSources", [])
+    if google_emails and google_sources:
+        gkey = hashlib.sha256(_json.dumps({
+            "emails":  sorted(google_emails),
+            "sources": sorted(google_sources),
+            "older_than_days": options.get("options", {}).get("older_than_days", 0),
+        }, sort_keys=True).encode()).hexdigest()[:16]
+        cp = _load_checkpoint(gkey, prefix="google")
+        if cp:
+            engines["google"] = {
+                "exists":        True,
+                "scanned_count": len(cp.get("scanned_ids", [])),
+                "flagged_count": len(cp.get("flagged", [])),
+                "started_at":    cp.get("meta", {}).get("started_at"),
+            }
+
+    # File sources (one checkpoint per source ID)
+    for src_id in options.get("fileSources", []):
+        fkey = _checkpoint_key({"sources": ["file"], "user_ids": [src_id], "options": {}})
+        cp   = _load_checkpoint(fkey, prefix=f"file_{src_id}")
+        if cp:
+            fe = engines.setdefault("file", {"exists": True, "scanned_count": 0, "flagged_count": 0, "started_at": None})
+            fe["scanned_count"] += len(cp.get("scanned_ids", []))
+            fe["flagged_count"]  += len(cp.get("flagged", []))
+            if not fe["started_at"]:
+                fe["started_at"] = cp.get("meta", {}).get("started_at")
+
+    if not engines:
        return jsonify({"exists": False})
+
+    started_ats = [v["started_at"] for v in engines.values() if v.get("started_at")]
    return jsonify({
        "exists":        True,
-        "scanned_count": len(cp.get("scanned_ids", [])),
-        "flagged_count": len(cp.get("flagged", [])),
-        "started_at":    cp.get("meta", {}).get("started_at"),
+        "scanned_count": sum(v.get("scanned_count", 0) for v in engines.values()),
+        "flagged_count": sum(v.get("flagged_count", 0) for v in engines.values()),
+        "started_at":    min(started_ats) if started_ats else None,
+        "engines":       engines,
    })


@bp.route("/api/scan/clear_checkpoint", methods=["POST"])
 def scan_clear_checkpoint():
-    """Discard any saved checkpoint so the next scan starts fresh."""
-    _clear_checkpoint()
+    """Discard all saved checkpoints so the next scan starts fresh."""
+    from pathlib import Path
+    data_dir = Path.home() / ".gdprscanner"
+    for f in data_dir.glob("checkpoint_*.json"):
+        try:
+            f.unlink()
+        except Exception:
+            pass
    return jsonify({"status": "cleared"})


--- a/scan_engine.py
+++ b/scan_engine.py
@ -125,8 +125,8 @@ def _html_esc(s): return str(s)  # type: ignore[misc]
 # checkpoint helpers — injected by gdpr_scanner.py
 def _checkpoint_key(opts): return ""  # type: ignore[misc]
 def _save_checkpoint(*a, **kw): pass  # type: ignore[misc]
-def _load_checkpoint(key): return None  # type: ignore[misc]
-def _clear_checkpoint(): pass  # type: ignore[misc]
+def _load_checkpoint(key, **kw): return None  # type: ignore[misc]
+def _clear_checkpoint(**kw): pass  # type: ignore[misc]
 def _load_delta_tokens(): return {}  # type: ignore[misc]
 def _save_delta_tokens(t): pass  # type: ignore[misc]

@ -209,6 +209,23 @@ def run_file_scan(source: dict):
        except Exception as e:
            logger.error("[db] start_scan failed: %s", e)

+    # \u2500\u2500 Checkpoint: resume from a previous interrupted file scan \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500
+    _ck_prefix = f"file_{source.get('id', 'local')}"
+    _ck_key    = _checkpoint_key({"sources": [source.get("source_type", "local")], "user_ids": [source.get("id", path)], "options": {}})
+    _ck        = _load_checkpoint(_ck_key, prefix=_ck_prefix)
+    _file_scanned_ids: set  = set(_ck["scanned_ids"]) if _ck else set()
+    _file_flagged:     list = []  # items found by this file scan run (for checkpoint)
+    _ck_resumed = len(_file_scanned_ids)
+    if _ck:
+        _file_flagged = list(_ck.get("flagged", []))
+        for card in _file_flagged:
+            _state.flagged_items.append(card)
+        broadcast("scan_phase", {"phase": LANG.get("m365_resuming", f"Resuming \u2014 skipping {_ck_resumed} already-scanned items\u2026")})
+        for card in _file_flagged:
+            broadcast("scan_file_flagged", _with_disposition(card, _db))
+    _CHECKPOINT_SAVE_EVERY_FILE = 25
+    _file_items_since_save = 0
+
    total_scanned = 0
    total_flagged = 0

@ -247,6 +264,10 @@ def run_file_scan(source: dict):
            if _state._scan_abort.is_set():
                break

+            if rel_path in _file_scanned_ids:
+                total_scanned += 1
+                continue
+
            total_scanned += 1
            broadcast("scan_progress", {"scanned": total_scanned, "flagged": total_flagged, "file": rel_path, "pct": min(90, 10 + total_scanned // 10), "source": "file"})

@ -353,6 +374,7 @@ def run_file_scan(source: dict):
            }

            _state.flagged_items.append(card)
+            _file_flagged.append(card)
            total_flagged += 1
            broadcast("scan_file_flagged", _with_disposition(card, _db))

@ -362,10 +384,19 @@ def run_file_scan(source: dict):
                except Exception as e:
                    logger.error("[db] save_item failed: %s", e)

+            _file_scanned_ids.add(rel_path)
+            _file_items_since_save += 1
+            if _file_items_since_save >= _CHECKPOINT_SAVE_EVERY_FILE:
+                _save_checkpoint(_ck_key, _file_scanned_ids, _file_flagged, _state.scan_meta, prefix=_ck_prefix)
+                _file_items_since_save = 0
+
    except Exception as e:
        import traceback
        broadcast("scan_error", {"file": label, "error": str(e)})
        logger.error("[file_scan] error:\n%s", traceback.format_exc())
+    else:
+        if not _state._scan_abort.is_set():
+            _clear_checkpoint(prefix=_ck_prefix)
    finally:
        if _db and _db_scan_id:
            try:
--- a/static/js/scan.js
+++ b/static/js/scan.js
@ -136,26 +136,39 @@ function buildScanPayload() {
  return { sources, fileSources, allSources, googleSources, user_ids, options };
 }

-async function checkCheckpoint() {
+async function checkCheckpoint(onNoCheckpoint) {
  const payload = buildScanPayload();
-  if (!payload.sources.length && !payload.fileSources.length) return;
-  if (payload.sources.length && !payload.user_ids.length) return;
+  const banner  = document.getElementById('resumeBanner');
+  const hasSources = payload.sources.length > 0 || payload.fileSources.length > 0 || payload.googleSources.length > 0;
+  if (!hasSources) {
+    if (banner) banner.style.display = 'none';
+    onNoCheckpoint?.(); return;
+  }
+  // M365 sources without users — scan button will handle the alert
+  if (payload.sources.length && !payload.user_ids.length && !payload.googleSources.length) {
+    if (banner) banner.style.display = 'none';
+    onNoCheckpoint?.(); return;
+  }
+  // Collect Google user emails for server-side checkpoint key computation
+  const googleUserEmails = payload.googleSources.length > 0
+    ? (S._allUsers || []).filter(u => u.selected !== false && (u.platform === 'google' || u.platform === 'both')).map(u => u.email || u.id).filter(Boolean)
+    : [];
  try {
    const r = await fetch('/api/scan/checkpoint', {
      method: 'POST', headers: {'Content-Type':'application/json'},
-      body: JSON.stringify(payload)
+      body: JSON.stringify({...payload, googleUserEmails})
    });
    const d = await r.json();
-    const banner = document.getElementById('resumeBanner');
    if (d.exists) {
      const ts = d.started_at ? new Date(d.started_at * 1000).toLocaleString([], {dateStyle:'short', timeStyle:'short'}) : '';
      document.getElementById('resumeBannerText').textContent =
        t('m365_resume_banner', `Previous scan interrupted (${d.scanned_count} scanned, ${d.flagged_count} found${ts ? ' — ' + ts : ''})`);
-      banner.style.display = 'flex';
+      if (banner) banner.style.display = 'flex';
    } else {
-      banner.style.display = 'none';
+      if (banner) banner.style.display = 'none';
+      onNoCheckpoint?.();
    }
-  } catch(e) { /* ignore */ }
+  } catch(e) { onNoCheckpoint?.(); }
 }

 async function clearCheckpointAndScan() {
--- a/templates/index.html
+++ b/templates/index.html
@ -302,7 +302,7 @@ document.addEventListener('DOMContentLoaded', applyI18n);
      <!-- Topbar -->
      <div class="topbar">
        <span id="viewerBrand" style="display:none;font-size:15px;font-weight:600;color:var(--text);white-space:nowrap;margin-right:6px">🔍 GDPRScanner</span>
-        <button class="scan-btn" id="scanBtn" onclick="startScan()" data-i18n="m365_btn_scan">Scan</button>
+        <button class="scan-btn" id="scanBtn" onclick="checkCheckpoint(() => startScan(false))" data-i18n="m365_btn_scan">Scan</button>
        <button class="stop-btn" id="stopBtn" style="display:none" onclick="stopScan()" data-i18n="m365_btn_stop">Stop</button>

        <!-- Profile selector (15c) -->
--- a/tests/test_checkpoint.py
+++ b/tests/test_checkpoint.py
@ -22,8 +22,8 @@ import checkpoint
@pytest.fixture(autouse=True)
 def _isolate(tmp_path, monkeypatch):
    """Redirect all disk writes to a temp dir for each test."""
-    monkeypatch.setattr(checkpoint, "_CHECKPOINT_PATH", tmp_path / "checkpoint.json")
-    monkeypatch.setattr(checkpoint, "_DELTA_PATH",      tmp_path / "delta.json")
+    monkeypatch.setattr(checkpoint, "_DATA_DIR",   tmp_path)
+    monkeypatch.setattr(checkpoint, "_DELTA_PATH", tmp_path / "delta.json")


 _OPTS = {