Extended the M365 checkpoint/resume mechanism to all three scan engines. Each engine writes its own +file (checkpoint_m365.json, checkpoint_google.json, checkpoint_file_{source_id}.json) every 25 + items.

This commit is contained in:
StyxX65 2026-04-25 20:30:59 +02:00
parent 2254e00481
commit 8b55e9d933
12 changed files with 268 additions and 40 deletions

View File

@ -11,6 +11,8 @@ Version numbers follow [Semantic Versioning](https://semver.org/spec/v2.0.0.html
### Added
- **Checkpoint / resume for Google and File scans** — stopping a Google Workspace or file (local/SMB/SFTP) scan mid-way and restarting now resumes from where it left off, exactly like M365 scans have always done. Each engine writes its own checkpoint file (`checkpoint_google.json`, `checkpoint_file_{source_id}.json`) every 25 items. On restart, previously found cards are re-emitted via SSE so the grid is repopulated before new items arrive. The Scan button now always checks for a live checkpoint before starting — if one exists the resume banner is shown regardless of whether the user reloaded the page. `POST /api/scan/checkpoint` returns a per-engine breakdown; `POST /api/scan/clear_checkpoint` wipes all `checkpoint_*.json` files. Google users' email addresses are included in the checkpoint payload from the frontend so the server can compute a matching key. `checkpoint.py` functions gained a `prefix` keyword argument (default `"m365"`) — existing M365 call sites are unchanged.
- **Email address and Danish phone number detection** — all three scan engines (M365, Google Workspace, local/SMB/SFTP) can now flag files and messages containing email addresses or Danish phone numbers in addition to CPR numbers. Detection is opt-in per profile: two new toggle options **Scan for email addresses** and **Scan for phone numbers** (default off) appear in the scan options panel and profile editor. When enabled, matches are stored as `email_count` / `phone_count` on each DB row and surfaced as colour-coded badges in list view, grid view, and the preview panel. Email regex requires a structurally valid address (`local@domain.tld`); phone regex covers 8-digit Danish numbers with optional `+45`/`0045` prefix and common spacing patterns. Both are deduplicated before counting. Requires DB migration (adds two INTEGER columns to `flagged_items`; applied automatically on first startup via `_MIGRATIONS`).
- **SFTP as a 4th file connector** — SFTP servers can now be added as file sources alongside local folders, SMB shares, and cloud sources. A new `SFTPScanner` class in `sftp_connector.py` implements the same `iter_files()` interface as `FileScanner`, so `run_file_scan()`, SSE broadcasting, DB persistence, card building, scheduled scans, and exports work without changes. Supports password auth and SSH private key auth (RSA, Ed25519, ECDSA, DSS); passphrases stored in the OS keychain. Key files uploaded via `POST /api/file_sources/upload_key` and stored in `~/.gdprscanner/sftp_keys/` with `chmod 600`. SFTP sources appear with a 🔒 icon in the sources panel. Requires `paramiko>=3.4` (optional — scanner falls back gracefully if not installed). New source-type selector (Local / Network (SMB) / SFTP) replaces the SMB path-prefix auto-detection in the add-source form.

View File

@ -30,7 +30,9 @@ python -m pytest tests/ -q
**Frontend:** `templates/index.html` (SPA), `static/style.css` (all styles), `static/js/*.js` (11 ES modules + `state.js`). `static/app.js` is an archived monolith — no longer loaded.
**Data dir** `~/.gdprscanner/`: `scanner.db`, `config.json`, `settings.json`, `schedule.json`, `token.json`, `delta.json`, `checkpoint.json`, `smtp.json`, `machine_id` (**never delete** — Fernet key), `role_overrides.json`, `google_sa.json`, `google.json`, `src_toggles.json`, `app.lock`, `viewer_tokens.json`
**Checkpoint / resume** — all three scan engines save progress to `~/.gdprscanner/checkpoint_{prefix}.json` every 25 items. Prefixes: `m365`, `google`, `file_{source_id}`. `checkpoint.py` functions accept a `prefix` keyword (default `"m365"`). Use `_cp_path(prefix)` to get the path — do not hard-code filenames. The Scan button calls `checkCheckpoint(() => startScan(false))` so a resume banner is offered before any grid clearing happens. `POST /api/scan/clear_checkpoint` globs and deletes all `checkpoint_*.json` files.
**Data dir** `~/.gdprscanner/`: `scanner.db`, `config.json`, `settings.json`, `schedule.json`, `token.json`, `delta.json`, `checkpoint_m365.json`, `checkpoint_google.json`, `checkpoint_file_*.json`, `smtp.json`, `machine_id` (**never delete** — Fernet key), `role_overrides.json`, `google_sa.json`, `google.json`, `src_toggles.json`, `app.lock`, `viewer_tokens.json`
## Non-obvious files

67
OSS_LANDSCAPE.md Normal file
View File

@ -0,0 +1,67 @@
# Open Source Landscape — GDPR / PII Document Scanners
An overview of existing open source tools in the same space as GDPRScanner, and where the gaps are.
---
## Summary
No open source project covers the same combination of M365 + Google Workspace connectors, Danish CPR detection, and GDPR Article 30 reporting in a single web UI. The closest commercial equivalent is [PII Tools](https://pii-tools.com) (closed source, SaaS).
---
## Existing open source tools
### [Microsoft Presidio](https://github.com/microsoft/presidio)
A well-maintained PII detection *library* (not an application) from Microsoft. Supports custom recognisers — a CPR pattern could be added. Covers text, images, and structured data via NLP + regex pipelines. No M365/GWS connectors, no UI, no reports, no scheduling. You would have to build the entire scanning application around it. ~9k GitHub stars.
### [Octopii](https://github.com/redhuntlabs/Octopii)
Local filesystem / S3 / Apache open-directory scanner using OCR + NLP + regex. Detects passports, government IDs, emails, and addresses in image and document files. No cloud connectors, no CPR awareness, no web UI.
### [pdscan](https://github.com/ankane/pdscan) / [piicatcher](https://github.com/tokern/piicatcher)
CLI tools that scan *databases* and data warehouses for PII columns using column-name heuristics and NLP sampling. No file storage scanning, no email, no cloud connectors.
### "GDPR scanners" on GitHub
Projects such as [baudev/gdpr-checker-backend](https://github.com/baudev/gdpr-checker-backend), [dev4privacy/gdpr-analyzer](https://github.com/dev4privacy/gdpr-analyzer), [mammuth/gdpr-scanner](https://github.com/mammuth/gdpr-scanner), and [City-of-Helsinki/GDPR-compliance-scanner](https://github.com/City-of-Helsinki/GDPR-compliance-scanner) are all **website and cookie compliance** scanners. They check whether a domain sets tracking cookies without consent — a completely different problem.
### CPR libraries
Several small libraries exist for validating or generating Danish CPR numbers ([mathiasvr/danish-ssn](https://github.com/mathiasvr/danish-ssn), [anhoej/cprr](https://github.com/anhoej/cprr), [ekstroem/DKcpr](https://github.com/ekstroem/DKcpr)). None of them are document or cloud-storage scanners.
---
## Commercial products that do cover it
| Product | M365 | GWS | CPR | Article 30 | Open source |
|---|---|---|---|---|---|
| [PII Tools](https://pii-tools.com) | ✅ | ✅ | ❌ | ❌ | ❌ |
| BigID | ✅ | ✅ | ❌ | ❌ | ❌ |
| Varonis | ✅ | partial | ❌ | ❌ | ❌ |
| Spirion | ✅ | ❌ | ❌ | ❌ | ❌ |
PII Tools is the most direct commercial equivalent: Graph API + GWS service account connectors, document scanning, web UI. Closed source, SaaS pricing targeted at enterprise.
---
## Capability comparison
| Capability | GDPRScanner | Presidio | Octopii | Commercial |
|---|---|---|---|---|
| M365 (Exchange / OneDrive / SharePoint / Teams) | ✅ | ❌ | ❌ | ✅ |
| Google Workspace (Gmail / Drive) | ✅ | ❌ | ❌ | ✅ |
| Local / SMB / SFTP | ✅ | ❌ | partial | ✅ |
| Danish CPR with modulus-11 validation | ✅ | plugin only | ❌ | ❌ |
| Email address + phone number detection | ✅ | ✅ | ✅ | ✅ |
| GDPR Article 30 report generation | ✅ | ❌ | ❌ | partial |
| Disposition tagging + bulk deletion | ✅ | ❌ | ❌ | partial |
| Scheduled scans | ✅ | ❌ | ❌ | ✅ |
| Checkpoint / resume | ✅ | ❌ | ❌ | unknown |
| Read-only viewer / share links | ✅ | ❌ | ❌ | partial |
| Web UI for non-technical staff | ✅ | ❌ | ❌ | ✅ |
| Danish-language UI | ✅ | ❌ | ❌ | ❌ |
| Open source | ✅ | ✅ | ✅ | ❌ |
---
## What makes GDPRScanner unique
The combination of Danish CPR specificity (modulus-11 validation, date sanity checks), M365 + Google Workspace connectors in a single tool, and GDPR Article 30 output is the gap no open source project fills. The Danish public-sector target audience (schools, municipalities) also drives requirements — role classification (student/staff), Danish-language UI, municipal data retention rules — that no general-purpose PII tool addresses.

View File

@ -119,6 +119,12 @@ Scan SFTP servers (SSH File Transfer Protocol) alongside local, SMB, and cloud s
---
### Checkpoint / resume for Google and File scans ✅
Extended the M365 checkpoint/resume mechanism to all three scan engines. Each engine writes its own file (`checkpoint_m365.json`, `checkpoint_google.json`, `checkpoint_file_{source_id}.json`) every 25 items. Previously found cards are re-emitted via SSE on resume so the grid repopulates before new items arrive. The Scan button now checks for a checkpoint before clearing the grid, so the resume banner appears even without a page reload. `POST /api/scan/checkpoint` returns a per-engine breakdown; `POST /api/scan/clear_checkpoint` wipes all `checkpoint_*.json` files. `checkpoint.py` functions gained a `prefix` keyword (default `"m365"`); M365 call sites are unchanged.
---
### #32 — Windowed mode for Profiles, Sources, and Settings ✗ Won't do
The workflow is sequential (configure → scan → review), not parallel — there is no realistic scenario where a modal and the results grid need to be open simultaneously. The Sources panel is already visible in the sidebar. Option A (the least-work path) still loads the full 3800-line JS stack twice. Closed.

View File

@ -15,7 +15,9 @@ logger = logging.getLogger(__name__)
_DATA_DIR = Path.home() / ".gdprscanner"
_DATA_DIR.mkdir(exist_ok=True)
_CHECKPOINT_PATH = _DATA_DIR / "checkpoint.json"
def _cp_path(prefix: str) -> Path:
return _DATA_DIR / f"checkpoint_{prefix}.json"
def _checkpoint_key(options: dict) -> str:
"""Stable hash of the scan options — used to detect when a checkpoint
@ -27,7 +29,7 @@ def _checkpoint_key(options: dict) -> str:
}, sort_keys=True)
return hashlib.sha256(sig.encode()).hexdigest()[:16]
def _save_checkpoint(key: str, scanned_ids: set, flagged: list, meta: dict) -> None:
def _save_checkpoint(key: str, scanned_ids: set, flagged: list, meta: dict, *, prefix: str = "m365") -> None:
"""Write checkpoint to disk. Called periodically during scanning."""
try:
payload = {
@ -36,28 +38,31 @@ def _save_checkpoint(key: str, scanned_ids: set, flagged: list, meta: dict) -> N
"flagged": flagged,
"meta": {k: v for k, v in meta.items() if k != "options"},
}
tmp = _CHECKPOINT_PATH.with_suffix(".tmp")
path = _cp_path(prefix)
tmp = path.with_suffix(".tmp")
tmp.write_text(json.dumps(payload, ensure_ascii=False, default=str), encoding="utf-8")
tmp.replace(_CHECKPOINT_PATH)
tmp.replace(path)
except Exception as e:
logger.error("[checkpoint] save failed: %s", e)
def _load_checkpoint(key: str) -> dict | None:
def _load_checkpoint(key: str, *, prefix: str = "m365") -> dict | None:
"""Load checkpoint if it matches the current scan key. Returns None on mismatch or error."""
try:
if not _CHECKPOINT_PATH.exists():
path = _cp_path(prefix)
if not path.exists():
return None
payload = json.loads(_CHECKPOINT_PATH.read_text(encoding="utf-8"))
payload = json.loads(path.read_text(encoding="utf-8"))
if payload.get("key") != key:
return None
return payload
except Exception:
return None
def _clear_checkpoint() -> None:
def _clear_checkpoint(*, prefix: str = "m365") -> None:
try:
if _CHECKPOINT_PATH.exists():
_CHECKPOINT_PATH.unlink()
path = _cp_path(prefix)
if path.exists():
path.unlink()
except Exception:
pass

View File

@ -251,7 +251,7 @@ from app_config import (
from checkpoint import (
_checkpoint_key, _save_checkpoint, _load_checkpoint, _clear_checkpoint,
_load_delta_tokens, _save_delta_tokens,
_CHECKPOINT_PATH, _DELTA_PATH,
_cp_path, _DELTA_PATH,
)
from sse import broadcast, _sse_queues, _sse_buffer
@ -1842,7 +1842,7 @@ Example --settings file with SMTP:
(_SETTINGS_PATH, "Headless scan settings"),
(_ROLE_OVERRIDES_PATH, "Manual role overrides"),
(_FILE_SOURCES_PATH, "File source definitions"),
(_CHECKPOINT_PATH, "Scan checkpoint (resume state)"),
(_cp_path("m365"), "Scan checkpoint (resume state)"),
(_DELTA_PATH, "Delta scan tokens"),
(_LANG_OVERRIDE_FILE, "Language preference"),
(Path.home() / ".gdprscanner" / "schedule.json", "Scheduler configuration"),
@ -1929,10 +1929,12 @@ Example --settings file with SMTP:
print(" ✖ m365_db not available — cannot reset")
_sys.exit(1)
# Also clear the JSON checkpoint so the UI starts with no cached results
_clear_checkpoint()
if not _CHECKPOINT_PATH.exists():
print(f" ✔ Checkpoint cleared")
# Also clear all checkpoints so the UI starts with no cached results
from pathlib import Path as _Path
for _cpf in (_Path.home() / ".gdprscanner").glob("checkpoint_*.json"):
try: _cpf.unlink()
except Exception: pass
print(f" ✔ Checkpoints cleared")
# Clear delta tokens too — stale after a full DB reset
if _DELTA_PATH.exists():

View File

@ -144,7 +144,8 @@ def _run_google_scan(options: dict):
scan_emails = bool(scan_opts.get("scan_emails", False))
scan_phones = bool(scan_opts.get("scan_phones", False))
from checkpoint import _load_delta_tokens, _save_delta_tokens
from checkpoint import (_load_delta_tokens, _save_delta_tokens,
_save_checkpoint, _load_checkpoint, _clear_checkpoint)
_drive_delta_tokens: dict = _load_delta_tokens() if delta_enabled else {}
_new_drive_tokens: dict = {}
@ -195,6 +196,28 @@ def _run_google_scan(options: dict):
except Exception as e:
logger.error("[google_scan] begin_scan failed: %s", e)
# ── Checkpoint: resume from a previous interrupted Google scan ────────────
import hashlib as _hl, json as _js
_gck_prefix = "google"
_gck_key = _hl.sha256(_js.dumps({
"emails": sorted(user_emails),
"sources": sorted(sources),
"older_than_days": scan_opts.get("older_than_days", 0),
}, sort_keys=True).encode()).hexdigest()[:16]
_gck = _load_checkpoint(_gck_key, prefix=_gck_prefix)
_g_scanned_ids: set = set(_gck["scanned_ids"]) if _gck else set()
_google_flagged: list = [] # items found by this Google scan (for checkpoint)
_gck_resumed = len(_g_scanned_ids)
if _gck:
from scan_engine import _with_disposition as _wd_ck
_google_flagged = list(_gck.get("flagged", []))
flagged_items.extend(_google_flagged)
broadcast("scan_phase", {"phase": f"Resuming — skipping {_gck_resumed} already-scanned items…"})
for _card in _google_flagged:
broadcast("scan_file_flagged", _wd_ck(_card, _db))
_GCHECKPOINT_SAVE_EVERY = 25
_g_items_since_save = 0
total_flagged = 0
total_scanned = 0
t_start = _time.monotonic()
@ -234,6 +257,7 @@ def _run_google_scan(options: dict):
"exif": {},
}
flagged_items.append(card)
_google_flagged.append(card)
broadcast("scan_file_flagged", _with_disposition(card, _db))
total_flagged += 1
if _db and _db_scan_id:
@ -265,6 +289,10 @@ def _run_google_scan(options: dict):
):
if _check_abort():
return
_item_id = meta.get("id", "")
if _item_id in _g_scanned_ids:
total_scanned += 1
continue
total_scanned += 1
broadcast("scan_file", {"file": meta.get("name", "")})
broadcast("scan_progress", {
@ -279,6 +307,7 @@ def _run_google_scan(options: dict):
result = _scan_bytes(data, meta.get("name", "msg.txt"))
except Exception as e:
broadcast("scan_error", {"file": meta.get("name", ""), "error": str(e)})
_g_scanned_ids.add(_item_id)
continue
cprs = result.get("cprs", [])
pii_counts = result.get("pii_counts")
@ -288,6 +317,11 @@ def _run_google_scan(options: dict):
meta["_email_count"] = len(_em)
meta["_phone_count"] = len(_ph)
_broadcast_card(meta, cprs, pii_counts)
_g_scanned_ids.add(_item_id)
_g_items_since_save += 1
if _g_items_since_save >= _GCHECKPOINT_SAVE_EVERY:
_save_checkpoint(_gck_key, _g_scanned_ids, _google_flagged, {}, prefix=_gck_prefix)
_g_items_since_save = 0
except GoogleError as e:
broadcast("scan_error", {"file": f"Gmail/{user_email}", "error": str(e)})
except Exception as e:
@ -327,6 +361,10 @@ def _run_google_scan(options: dict):
for meta, data in drive_items:
if _check_abort():
return
_item_id = meta.get("id", "")
if _item_id in _g_scanned_ids:
total_scanned += 1
continue
total_scanned += 1
broadcast("scan_file", {"file": meta.get("name", "")})
broadcast("scan_progress", {
@ -341,6 +379,7 @@ def _run_google_scan(options: dict):
result = _scan_bytes(data, meta.get("name", "file"))
except Exception as e:
broadcast("scan_error", {"file": meta.get("name", ""), "error": str(e)})
_g_scanned_ids.add(_item_id)
continue
cprs = result.get("cprs", [])
pii_counts = result.get("pii_counts")
@ -350,6 +389,11 @@ def _run_google_scan(options: dict):
meta["_email_count"] = len(_em)
meta["_phone_count"] = len(_ph)
_broadcast_card(meta, cprs, pii_counts)
_g_scanned_ids.add(_item_id)
_g_items_since_save += 1
if _g_items_since_save >= _GCHECKPOINT_SAVE_EVERY:
_save_checkpoint(_gck_key, _g_scanned_ids, _google_flagged, {}, prefix=_gck_prefix)
_g_items_since_save = 0
except GoogleError as e:
broadcast("scan_error", {"file": f"Drive/{user_email}", "error": str(e)})
except Exception as e:
@ -362,6 +406,10 @@ def _run_google_scan(options: dict):
except Exception as e:
logger.warning("[gdrive delta] token save failed: %s", e)
from gdpr_scanner import _scan_abort as _gsa
if not _gsa.is_set():
_clear_checkpoint(prefix=_gck_prefix)
elapsed = _time.monotonic() - t_start
broadcast("google_scan_done", {
"flagged_count": total_flagged,

View File

@ -13,7 +13,7 @@ from app_config import (
)
from checkpoint import (
_checkpoint_key, _load_checkpoint, _clear_checkpoint,
_load_delta_tokens, _DELTA_PATH,
_load_delta_tokens, _DELTA_PATH, _cp_path,
)
bp = Blueprint("scan", __name__)
@ -121,28 +121,80 @@ def scan_stop():
def scan_checkpoint_info():
"""Return info about any saved checkpoint for the given scan options.
If check_only=true, just reports whether a scan is currently running."""
import hashlib, json as _json
options = request.get_json() or {}
if options.get("check_only"):
acquired = state._scan_lock.acquire(blocking=False)
if acquired:
state._scan_lock.release()
return jsonify({"running": not acquired})
key = _checkpoint_key(options)
cp = _load_checkpoint(key)
if not cp:
engines = {}
# M365
if options.get("sources"):
key = _checkpoint_key(options)
cp = _load_checkpoint(key, prefix="m365")
if cp:
engines["m365"] = {
"exists": True,
"scanned_count": len(cp.get("scanned_ids", [])),
"flagged_count": len(cp.get("flagged", [])),
"started_at": cp.get("meta", {}).get("started_at"),
}
# Google
google_emails = options.get("googleUserEmails", [])
google_sources = options.get("googleSources", [])
if google_emails and google_sources:
gkey = hashlib.sha256(_json.dumps({
"emails": sorted(google_emails),
"sources": sorted(google_sources),
"older_than_days": options.get("options", {}).get("older_than_days", 0),
}, sort_keys=True).encode()).hexdigest()[:16]
cp = _load_checkpoint(gkey, prefix="google")
if cp:
engines["google"] = {
"exists": True,
"scanned_count": len(cp.get("scanned_ids", [])),
"flagged_count": len(cp.get("flagged", [])),
"started_at": cp.get("meta", {}).get("started_at"),
}
# File sources (one checkpoint per source ID)
for src_id in options.get("fileSources", []):
fkey = _checkpoint_key({"sources": ["file"], "user_ids": [src_id], "options": {}})
cp = _load_checkpoint(fkey, prefix=f"file_{src_id}")
if cp:
fe = engines.setdefault("file", {"exists": True, "scanned_count": 0, "flagged_count": 0, "started_at": None})
fe["scanned_count"] += len(cp.get("scanned_ids", []))
fe["flagged_count"] += len(cp.get("flagged", []))
if not fe["started_at"]:
fe["started_at"] = cp.get("meta", {}).get("started_at")
if not engines:
return jsonify({"exists": False})
started_ats = [v["started_at"] for v in engines.values() if v.get("started_at")]
return jsonify({
"exists": True,
"scanned_count": len(cp.get("scanned_ids", [])),
"flagged_count": len(cp.get("flagged", [])),
"started_at": cp.get("meta", {}).get("started_at"),
"scanned_count": sum(v.get("scanned_count", 0) for v in engines.values()),
"flagged_count": sum(v.get("flagged_count", 0) for v in engines.values()),
"started_at": min(started_ats) if started_ats else None,
"engines": engines,
})
@bp.route("/api/scan/clear_checkpoint", methods=["POST"])
def scan_clear_checkpoint():
"""Discard any saved checkpoint so the next scan starts fresh."""
_clear_checkpoint()
"""Discard all saved checkpoints so the next scan starts fresh."""
from pathlib import Path
data_dir = Path.home() / ".gdprscanner"
for f in data_dir.glob("checkpoint_*.json"):
try:
f.unlink()
except Exception:
pass
return jsonify({"status": "cleared"})

View File

@ -125,8 +125,8 @@ def _html_esc(s): return str(s) # type: ignore[misc]
# checkpoint helpers — injected by gdpr_scanner.py
def _checkpoint_key(opts): return "" # type: ignore[misc]
def _save_checkpoint(*a, **kw): pass # type: ignore[misc]
def _load_checkpoint(key): return None # type: ignore[misc]
def _clear_checkpoint(): pass # type: ignore[misc]
def _load_checkpoint(key, **kw): return None # type: ignore[misc]
def _clear_checkpoint(**kw): pass # type: ignore[misc]
def _load_delta_tokens(): return {} # type: ignore[misc]
def _save_delta_tokens(t): pass # type: ignore[misc]
@ -209,6 +209,23 @@ def run_file_scan(source: dict):
except Exception as e:
logger.error("[db] start_scan failed: %s", e)
# \u2500\u2500 Checkpoint: resume from a previous interrupted file scan \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500
_ck_prefix = f"file_{source.get('id', 'local')}"
_ck_key = _checkpoint_key({"sources": [source.get("source_type", "local")], "user_ids": [source.get("id", path)], "options": {}})
_ck = _load_checkpoint(_ck_key, prefix=_ck_prefix)
_file_scanned_ids: set = set(_ck["scanned_ids"]) if _ck else set()
_file_flagged: list = [] # items found by this file scan run (for checkpoint)
_ck_resumed = len(_file_scanned_ids)
if _ck:
_file_flagged = list(_ck.get("flagged", []))
for card in _file_flagged:
_state.flagged_items.append(card)
broadcast("scan_phase", {"phase": LANG.get("m365_resuming", f"Resuming \u2014 skipping {_ck_resumed} already-scanned items\u2026")})
for card in _file_flagged:
broadcast("scan_file_flagged", _with_disposition(card, _db))
_CHECKPOINT_SAVE_EVERY_FILE = 25
_file_items_since_save = 0
total_scanned = 0
total_flagged = 0
@ -247,6 +264,10 @@ def run_file_scan(source: dict):
if _state._scan_abort.is_set():
break
if rel_path in _file_scanned_ids:
total_scanned += 1
continue
total_scanned += 1
broadcast("scan_progress", {"scanned": total_scanned, "flagged": total_flagged, "file": rel_path, "pct": min(90, 10 + total_scanned // 10), "source": "file"})
@ -353,6 +374,7 @@ def run_file_scan(source: dict):
}
_state.flagged_items.append(card)
_file_flagged.append(card)
total_flagged += 1
broadcast("scan_file_flagged", _with_disposition(card, _db))
@ -362,10 +384,19 @@ def run_file_scan(source: dict):
except Exception as e:
logger.error("[db] save_item failed: %s", e)
_file_scanned_ids.add(rel_path)
_file_items_since_save += 1
if _file_items_since_save >= _CHECKPOINT_SAVE_EVERY_FILE:
_save_checkpoint(_ck_key, _file_scanned_ids, _file_flagged, _state.scan_meta, prefix=_ck_prefix)
_file_items_since_save = 0
except Exception as e:
import traceback
broadcast("scan_error", {"file": label, "error": str(e)})
logger.error("[file_scan] error:\n%s", traceback.format_exc())
else:
if not _state._scan_abort.is_set():
_clear_checkpoint(prefix=_ck_prefix)
finally:
if _db and _db_scan_id:
try:

View File

@ -136,26 +136,39 @@ function buildScanPayload() {
return { sources, fileSources, allSources, googleSources, user_ids, options };
}
async function checkCheckpoint() {
async function checkCheckpoint(onNoCheckpoint) {
const payload = buildScanPayload();
if (!payload.sources.length && !payload.fileSources.length) return;
if (payload.sources.length && !payload.user_ids.length) return;
const banner = document.getElementById('resumeBanner');
const hasSources = payload.sources.length > 0 || payload.fileSources.length > 0 || payload.googleSources.length > 0;
if (!hasSources) {
if (banner) banner.style.display = 'none';
onNoCheckpoint?.(); return;
}
// M365 sources without users — scan button will handle the alert
if (payload.sources.length && !payload.user_ids.length && !payload.googleSources.length) {
if (banner) banner.style.display = 'none';
onNoCheckpoint?.(); return;
}
// Collect Google user emails for server-side checkpoint key computation
const googleUserEmails = payload.googleSources.length > 0
? (S._allUsers || []).filter(u => u.selected !== false && (u.platform === 'google' || u.platform === 'both')).map(u => u.email || u.id).filter(Boolean)
: [];
try {
const r = await fetch('/api/scan/checkpoint', {
method: 'POST', headers: {'Content-Type':'application/json'},
body: JSON.stringify(payload)
body: JSON.stringify({...payload, googleUserEmails})
});
const d = await r.json();
const banner = document.getElementById('resumeBanner');
if (d.exists) {
const ts = d.started_at ? new Date(d.started_at * 1000).toLocaleString([], {dateStyle:'short', timeStyle:'short'}) : '';
document.getElementById('resumeBannerText').textContent =
t('m365_resume_banner', `Previous scan interrupted (${d.scanned_count} scanned, ${d.flagged_count} found${ts ? ' — ' + ts : ''})`);
banner.style.display = 'flex';
if (banner) banner.style.display = 'flex';
} else {
banner.style.display = 'none';
if (banner) banner.style.display = 'none';
onNoCheckpoint?.();
}
} catch(e) { /* ignore */ }
} catch(e) { onNoCheckpoint?.(); }
}
async function clearCheckpointAndScan() {

View File

@ -302,7 +302,7 @@ document.addEventListener('DOMContentLoaded', applyI18n);
<!-- Topbar -->
<div class="topbar">
<span id="viewerBrand" style="display:none;font-size:15px;font-weight:600;color:var(--text);white-space:nowrap;margin-right:6px">🔍 GDPRScanner</span>
<button class="scan-btn" id="scanBtn" onclick="startScan()" data-i18n="m365_btn_scan">Scan</button>
<button class="scan-btn" id="scanBtn" onclick="checkCheckpoint(() => startScan(false))" data-i18n="m365_btn_scan">Scan</button>
<button class="stop-btn" id="stopBtn" style="display:none" onclick="stopScan()" data-i18n="m365_btn_stop">Stop</button>
<!-- Profile selector (15c) -->

View File

@ -22,8 +22,8 @@ import checkpoint
@pytest.fixture(autouse=True)
def _isolate(tmp_path, monkeypatch):
"""Redirect all disk writes to a temp dir for each test."""
monkeypatch.setattr(checkpoint, "_CHECKPOINT_PATH", tmp_path / "checkpoint.json")
monkeypatch.setattr(checkpoint, "_DELTA_PATH", tmp_path / "delta.json")
monkeypatch.setattr(checkpoint, "_DATA_DIR", tmp_path)
monkeypatch.setattr(checkpoint, "_DELTA_PATH", tmp_path / "delta.json")
_OPTS = {