Extended the M365 checkpoint/resume mechanism to all three scan engines. Each engine writes its own +file (checkpoint_m365.json, checkpoint_google.json, checkpoint_file_{source_id}.json) every 25 + items.

This commit is contained in:
StyxX65 2026-04-25 20:30:59 +02:00
parent 2254e00481
commit 8b55e9d933
12 changed files with 268 additions and 40 deletions

View File

@ -11,6 +11,8 @@ Version numbers follow [Semantic Versioning](https://semver.org/spec/v2.0.0.html
### Added ### Added
- **Checkpoint / resume for Google and File scans** — stopping a Google Workspace or file (local/SMB/SFTP) scan mid-way and restarting now resumes from where it left off, exactly like M365 scans have always done. Each engine writes its own checkpoint file (`checkpoint_google.json`, `checkpoint_file_{source_id}.json`) every 25 items. On restart, previously found cards are re-emitted via SSE so the grid is repopulated before new items arrive. The Scan button now always checks for a live checkpoint before starting — if one exists the resume banner is shown regardless of whether the user reloaded the page. `POST /api/scan/checkpoint` returns a per-engine breakdown; `POST /api/scan/clear_checkpoint` wipes all `checkpoint_*.json` files. Google users' email addresses are included in the checkpoint payload from the frontend so the server can compute a matching key. `checkpoint.py` functions gained a `prefix` keyword argument (default `"m365"`) — existing M365 call sites are unchanged.
- **Email address and Danish phone number detection** — all three scan engines (M365, Google Workspace, local/SMB/SFTP) can now flag files and messages containing email addresses or Danish phone numbers in addition to CPR numbers. Detection is opt-in per profile: two new toggle options **Scan for email addresses** and **Scan for phone numbers** (default off) appear in the scan options panel and profile editor. When enabled, matches are stored as `email_count` / `phone_count` on each DB row and surfaced as colour-coded badges in list view, grid view, and the preview panel. Email regex requires a structurally valid address (`local@domain.tld`); phone regex covers 8-digit Danish numbers with optional `+45`/`0045` prefix and common spacing patterns. Both are deduplicated before counting. Requires DB migration (adds two INTEGER columns to `flagged_items`; applied automatically on first startup via `_MIGRATIONS`). - **Email address and Danish phone number detection** — all three scan engines (M365, Google Workspace, local/SMB/SFTP) can now flag files and messages containing email addresses or Danish phone numbers in addition to CPR numbers. Detection is opt-in per profile: two new toggle options **Scan for email addresses** and **Scan for phone numbers** (default off) appear in the scan options panel and profile editor. When enabled, matches are stored as `email_count` / `phone_count` on each DB row and surfaced as colour-coded badges in list view, grid view, and the preview panel. Email regex requires a structurally valid address (`local@domain.tld`); phone regex covers 8-digit Danish numbers with optional `+45`/`0045` prefix and common spacing patterns. Both are deduplicated before counting. Requires DB migration (adds two INTEGER columns to `flagged_items`; applied automatically on first startup via `_MIGRATIONS`).
- **SFTP as a 4th file connector** — SFTP servers can now be added as file sources alongside local folders, SMB shares, and cloud sources. A new `SFTPScanner` class in `sftp_connector.py` implements the same `iter_files()` interface as `FileScanner`, so `run_file_scan()`, SSE broadcasting, DB persistence, card building, scheduled scans, and exports work without changes. Supports password auth and SSH private key auth (RSA, Ed25519, ECDSA, DSS); passphrases stored in the OS keychain. Key files uploaded via `POST /api/file_sources/upload_key` and stored in `~/.gdprscanner/sftp_keys/` with `chmod 600`. SFTP sources appear with a 🔒 icon in the sources panel. Requires `paramiko>=3.4` (optional — scanner falls back gracefully if not installed). New source-type selector (Local / Network (SMB) / SFTP) replaces the SMB path-prefix auto-detection in the add-source form. - **SFTP as a 4th file connector** — SFTP servers can now be added as file sources alongside local folders, SMB shares, and cloud sources. A new `SFTPScanner` class in `sftp_connector.py` implements the same `iter_files()` interface as `FileScanner`, so `run_file_scan()`, SSE broadcasting, DB persistence, card building, scheduled scans, and exports work without changes. Supports password auth and SSH private key auth (RSA, Ed25519, ECDSA, DSS); passphrases stored in the OS keychain. Key files uploaded via `POST /api/file_sources/upload_key` and stored in `~/.gdprscanner/sftp_keys/` with `chmod 600`. SFTP sources appear with a 🔒 icon in the sources panel. Requires `paramiko>=3.4` (optional — scanner falls back gracefully if not installed). New source-type selector (Local / Network (SMB) / SFTP) replaces the SMB path-prefix auto-detection in the add-source form.

View File

@ -30,7 +30,9 @@ python -m pytest tests/ -q
**Frontend:** `templates/index.html` (SPA), `static/style.css` (all styles), `static/js/*.js` (11 ES modules + `state.js`). `static/app.js` is an archived monolith — no longer loaded. **Frontend:** `templates/index.html` (SPA), `static/style.css` (all styles), `static/js/*.js` (11 ES modules + `state.js`). `static/app.js` is an archived monolith — no longer loaded.
**Data dir** `~/.gdprscanner/`: `scanner.db`, `config.json`, `settings.json`, `schedule.json`, `token.json`, `delta.json`, `checkpoint.json`, `smtp.json`, `machine_id` (**never delete** — Fernet key), `role_overrides.json`, `google_sa.json`, `google.json`, `src_toggles.json`, `app.lock`, `viewer_tokens.json` **Checkpoint / resume** — all three scan engines save progress to `~/.gdprscanner/checkpoint_{prefix}.json` every 25 items. Prefixes: `m365`, `google`, `file_{source_id}`. `checkpoint.py` functions accept a `prefix` keyword (default `"m365"`). Use `_cp_path(prefix)` to get the path — do not hard-code filenames. The Scan button calls `checkCheckpoint(() => startScan(false))` so a resume banner is offered before any grid clearing happens. `POST /api/scan/clear_checkpoint` globs and deletes all `checkpoint_*.json` files.
**Data dir** `~/.gdprscanner/`: `scanner.db`, `config.json`, `settings.json`, `schedule.json`, `token.json`, `delta.json`, `checkpoint_m365.json`, `checkpoint_google.json`, `checkpoint_file_*.json`, `smtp.json`, `machine_id` (**never delete** — Fernet key), `role_overrides.json`, `google_sa.json`, `google.json`, `src_toggles.json`, `app.lock`, `viewer_tokens.json`
## Non-obvious files ## Non-obvious files

67
OSS_LANDSCAPE.md Normal file
View File

@ -0,0 +1,67 @@
# Open Source Landscape — GDPR / PII Document Scanners
An overview of existing open source tools in the same space as GDPRScanner, and where the gaps are.
---
## Summary
No open source project covers the same combination of M365 + Google Workspace connectors, Danish CPR detection, and GDPR Article 30 reporting in a single web UI. The closest commercial equivalent is [PII Tools](https://pii-tools.com) (closed source, SaaS).
---
## Existing open source tools
### [Microsoft Presidio](https://github.com/microsoft/presidio)
A well-maintained PII detection *library* (not an application) from Microsoft. Supports custom recognisers — a CPR pattern could be added. Covers text, images, and structured data via NLP + regex pipelines. No M365/GWS connectors, no UI, no reports, no scheduling. You would have to build the entire scanning application around it. ~9k GitHub stars.
### [Octopii](https://github.com/redhuntlabs/Octopii)
Local filesystem / S3 / Apache open-directory scanner using OCR + NLP + regex. Detects passports, government IDs, emails, and addresses in image and document files. No cloud connectors, no CPR awareness, no web UI.
### [pdscan](https://github.com/ankane/pdscan) / [piicatcher](https://github.com/tokern/piicatcher)
CLI tools that scan *databases* and data warehouses for PII columns using column-name heuristics and NLP sampling. No file storage scanning, no email, no cloud connectors.
### "GDPR scanners" on GitHub
Projects such as [baudev/gdpr-checker-backend](https://github.com/baudev/gdpr-checker-backend), [dev4privacy/gdpr-analyzer](https://github.com/dev4privacy/gdpr-analyzer), [mammuth/gdpr-scanner](https://github.com/mammuth/gdpr-scanner), and [City-of-Helsinki/GDPR-compliance-scanner](https://github.com/City-of-Helsinki/GDPR-compliance-scanner) are all **website and cookie compliance** scanners. They check whether a domain sets tracking cookies without consent — a completely different problem.
### CPR libraries
Several small libraries exist for validating or generating Danish CPR numbers ([mathiasvr/danish-ssn](https://github.com/mathiasvr/danish-ssn), [anhoej/cprr](https://github.com/anhoej/cprr), [ekstroem/DKcpr](https://github.com/ekstroem/DKcpr)). None of them are document or cloud-storage scanners.
---
## Commercial products that do cover it
| Product | M365 | GWS | CPR | Article 30 | Open source |
|---|---|---|---|---|---|
| [PII Tools](https://pii-tools.com) | ✅ | ✅ | ❌ | ❌ | ❌ |
| BigID | ✅ | ✅ | ❌ | ❌ | ❌ |
| Varonis | ✅ | partial | ❌ | ❌ | ❌ |
| Spirion | ✅ | ❌ | ❌ | ❌ | ❌ |
PII Tools is the most direct commercial equivalent: Graph API + GWS service account connectors, document scanning, web UI. Closed source, SaaS pricing targeted at enterprise.
---
## Capability comparison
| Capability | GDPRScanner | Presidio | Octopii | Commercial |
|---|---|---|---|---|
| M365 (Exchange / OneDrive / SharePoint / Teams) | ✅ | ❌ | ❌ | ✅ |
| Google Workspace (Gmail / Drive) | ✅ | ❌ | ❌ | ✅ |
| Local / SMB / SFTP | ✅ | ❌ | partial | ✅ |
| Danish CPR with modulus-11 validation | ✅ | plugin only | ❌ | ❌ |
| Email address + phone number detection | ✅ | ✅ | ✅ | ✅ |
| GDPR Article 30 report generation | ✅ | ❌ | ❌ | partial |
| Disposition tagging + bulk deletion | ✅ | ❌ | ❌ | partial |
| Scheduled scans | ✅ | ❌ | ❌ | ✅ |
| Checkpoint / resume | ✅ | ❌ | ❌ | unknown |
| Read-only viewer / share links | ✅ | ❌ | ❌ | partial |
| Web UI for non-technical staff | ✅ | ❌ | ❌ | ✅ |
| Danish-language UI | ✅ | ❌ | ❌ | ❌ |
| Open source | ✅ | ✅ | ✅ | ❌ |
---
## What makes GDPRScanner unique
The combination of Danish CPR specificity (modulus-11 validation, date sanity checks), M365 + Google Workspace connectors in a single tool, and GDPR Article 30 output is the gap no open source project fills. The Danish public-sector target audience (schools, municipalities) also drives requirements — role classification (student/staff), Danish-language UI, municipal data retention rules — that no general-purpose PII tool addresses.

View File

@ -119,6 +119,12 @@ Scan SFTP servers (SSH File Transfer Protocol) alongside local, SMB, and cloud s
--- ---
### Checkpoint / resume for Google and File scans ✅
Extended the M365 checkpoint/resume mechanism to all three scan engines. Each engine writes its own file (`checkpoint_m365.json`, `checkpoint_google.json`, `checkpoint_file_{source_id}.json`) every 25 items. Previously found cards are re-emitted via SSE on resume so the grid repopulates before new items arrive. The Scan button now checks for a checkpoint before clearing the grid, so the resume banner appears even without a page reload. `POST /api/scan/checkpoint` returns a per-engine breakdown; `POST /api/scan/clear_checkpoint` wipes all `checkpoint_*.json` files. `checkpoint.py` functions gained a `prefix` keyword (default `"m365"`); M365 call sites are unchanged.
---
### #32 — Windowed mode for Profiles, Sources, and Settings ✗ Won't do ### #32 — Windowed mode for Profiles, Sources, and Settings ✗ Won't do
The workflow is sequential (configure → scan → review), not parallel — there is no realistic scenario where a modal and the results grid need to be open simultaneously. The Sources panel is already visible in the sidebar. Option A (the least-work path) still loads the full 3800-line JS stack twice. Closed. The workflow is sequential (configure → scan → review), not parallel — there is no realistic scenario where a modal and the results grid need to be open simultaneously. The Sources panel is already visible in the sidebar. Option A (the least-work path) still loads the full 3800-line JS stack twice. Closed.

View File

@ -15,7 +15,9 @@ logger = logging.getLogger(__name__)
_DATA_DIR = Path.home() / ".gdprscanner" _DATA_DIR = Path.home() / ".gdprscanner"
_DATA_DIR.mkdir(exist_ok=True) _DATA_DIR.mkdir(exist_ok=True)
_CHECKPOINT_PATH = _DATA_DIR / "checkpoint.json"
def _cp_path(prefix: str) -> Path:
return _DATA_DIR / f"checkpoint_{prefix}.json"
def _checkpoint_key(options: dict) -> str: def _checkpoint_key(options: dict) -> str:
"""Stable hash of the scan options — used to detect when a checkpoint """Stable hash of the scan options — used to detect when a checkpoint
@ -27,7 +29,7 @@ def _checkpoint_key(options: dict) -> str:
}, sort_keys=True) }, sort_keys=True)
return hashlib.sha256(sig.encode()).hexdigest()[:16] return hashlib.sha256(sig.encode()).hexdigest()[:16]
def _save_checkpoint(key: str, scanned_ids: set, flagged: list, meta: dict) -> None: def _save_checkpoint(key: str, scanned_ids: set, flagged: list, meta: dict, *, prefix: str = "m365") -> None:
"""Write checkpoint to disk. Called periodically during scanning.""" """Write checkpoint to disk. Called periodically during scanning."""
try: try:
payload = { payload = {
@ -36,28 +38,31 @@ def _save_checkpoint(key: str, scanned_ids: set, flagged: list, meta: dict) -> N
"flagged": flagged, "flagged": flagged,
"meta": {k: v for k, v in meta.items() if k != "options"}, "meta": {k: v for k, v in meta.items() if k != "options"},
} }
tmp = _CHECKPOINT_PATH.with_suffix(".tmp") path = _cp_path(prefix)
tmp = path.with_suffix(".tmp")
tmp.write_text(json.dumps(payload, ensure_ascii=False, default=str), encoding="utf-8") tmp.write_text(json.dumps(payload, ensure_ascii=False, default=str), encoding="utf-8")
tmp.replace(_CHECKPOINT_PATH) tmp.replace(path)
except Exception as e: except Exception as e:
logger.error("[checkpoint] save failed: %s", e) logger.error("[checkpoint] save failed: %s", e)
def _load_checkpoint(key: str) -> dict | None: def _load_checkpoint(key: str, *, prefix: str = "m365") -> dict | None:
"""Load checkpoint if it matches the current scan key. Returns None on mismatch or error.""" """Load checkpoint if it matches the current scan key. Returns None on mismatch or error."""
try: try:
if not _CHECKPOINT_PATH.exists(): path = _cp_path(prefix)
if not path.exists():
return None return None
payload = json.loads(_CHECKPOINT_PATH.read_text(encoding="utf-8")) payload = json.loads(path.read_text(encoding="utf-8"))
if payload.get("key") != key: if payload.get("key") != key:
return None return None
return payload return payload
except Exception: except Exception:
return None return None
def _clear_checkpoint() -> None: def _clear_checkpoint(*, prefix: str = "m365") -> None:
try: try:
if _CHECKPOINT_PATH.exists(): path = _cp_path(prefix)
_CHECKPOINT_PATH.unlink() if path.exists():
path.unlink()
except Exception: except Exception:
pass pass

View File

@ -251,7 +251,7 @@ from app_config import (
from checkpoint import ( from checkpoint import (
_checkpoint_key, _save_checkpoint, _load_checkpoint, _clear_checkpoint, _checkpoint_key, _save_checkpoint, _load_checkpoint, _clear_checkpoint,
_load_delta_tokens, _save_delta_tokens, _load_delta_tokens, _save_delta_tokens,
_CHECKPOINT_PATH, _DELTA_PATH, _cp_path, _DELTA_PATH,
) )
from sse import broadcast, _sse_queues, _sse_buffer from sse import broadcast, _sse_queues, _sse_buffer
@ -1842,7 +1842,7 @@ Example --settings file with SMTP:
(_SETTINGS_PATH, "Headless scan settings"), (_SETTINGS_PATH, "Headless scan settings"),
(_ROLE_OVERRIDES_PATH, "Manual role overrides"), (_ROLE_OVERRIDES_PATH, "Manual role overrides"),
(_FILE_SOURCES_PATH, "File source definitions"), (_FILE_SOURCES_PATH, "File source definitions"),
(_CHECKPOINT_PATH, "Scan checkpoint (resume state)"), (_cp_path("m365"), "Scan checkpoint (resume state)"),
(_DELTA_PATH, "Delta scan tokens"), (_DELTA_PATH, "Delta scan tokens"),
(_LANG_OVERRIDE_FILE, "Language preference"), (_LANG_OVERRIDE_FILE, "Language preference"),
(Path.home() / ".gdprscanner" / "schedule.json", "Scheduler configuration"), (Path.home() / ".gdprscanner" / "schedule.json", "Scheduler configuration"),
@ -1929,10 +1929,12 @@ Example --settings file with SMTP:
print(" ✖ m365_db not available — cannot reset") print(" ✖ m365_db not available — cannot reset")
_sys.exit(1) _sys.exit(1)
# Also clear the JSON checkpoint so the UI starts with no cached results # Also clear all checkpoints so the UI starts with no cached results
_clear_checkpoint() from pathlib import Path as _Path
if not _CHECKPOINT_PATH.exists(): for _cpf in (_Path.home() / ".gdprscanner").glob("checkpoint_*.json"):
print(f" ✔ Checkpoint cleared") try: _cpf.unlink()
except Exception: pass
print(f" ✔ Checkpoints cleared")
# Clear delta tokens too — stale after a full DB reset # Clear delta tokens too — stale after a full DB reset
if _DELTA_PATH.exists(): if _DELTA_PATH.exists():

View File

@ -144,7 +144,8 @@ def _run_google_scan(options: dict):
scan_emails = bool(scan_opts.get("scan_emails", False)) scan_emails = bool(scan_opts.get("scan_emails", False))
scan_phones = bool(scan_opts.get("scan_phones", False)) scan_phones = bool(scan_opts.get("scan_phones", False))
from checkpoint import _load_delta_tokens, _save_delta_tokens from checkpoint import (_load_delta_tokens, _save_delta_tokens,
_save_checkpoint, _load_checkpoint, _clear_checkpoint)
_drive_delta_tokens: dict = _load_delta_tokens() if delta_enabled else {} _drive_delta_tokens: dict = _load_delta_tokens() if delta_enabled else {}
_new_drive_tokens: dict = {} _new_drive_tokens: dict = {}
@ -195,6 +196,28 @@ def _run_google_scan(options: dict):
except Exception as e: except Exception as e:
logger.error("[google_scan] begin_scan failed: %s", e) logger.error("[google_scan] begin_scan failed: %s", e)
# ── Checkpoint: resume from a previous interrupted Google scan ────────────
import hashlib as _hl, json as _js
_gck_prefix = "google"
_gck_key = _hl.sha256(_js.dumps({
"emails": sorted(user_emails),
"sources": sorted(sources),
"older_than_days": scan_opts.get("older_than_days", 0),
}, sort_keys=True).encode()).hexdigest()[:16]
_gck = _load_checkpoint(_gck_key, prefix=_gck_prefix)
_g_scanned_ids: set = set(_gck["scanned_ids"]) if _gck else set()
_google_flagged: list = [] # items found by this Google scan (for checkpoint)
_gck_resumed = len(_g_scanned_ids)
if _gck:
from scan_engine import _with_disposition as _wd_ck
_google_flagged = list(_gck.get("flagged", []))
flagged_items.extend(_google_flagged)
broadcast("scan_phase", {"phase": f"Resuming — skipping {_gck_resumed} already-scanned items…"})
for _card in _google_flagged:
broadcast("scan_file_flagged", _wd_ck(_card, _db))
_GCHECKPOINT_SAVE_EVERY = 25
_g_items_since_save = 0
total_flagged = 0 total_flagged = 0
total_scanned = 0 total_scanned = 0
t_start = _time.monotonic() t_start = _time.monotonic()
@ -234,6 +257,7 @@ def _run_google_scan(options: dict):
"exif": {}, "exif": {},
} }
flagged_items.append(card) flagged_items.append(card)
_google_flagged.append(card)
broadcast("scan_file_flagged", _with_disposition(card, _db)) broadcast("scan_file_flagged", _with_disposition(card, _db))
total_flagged += 1 total_flagged += 1
if _db and _db_scan_id: if _db and _db_scan_id:
@ -265,6 +289,10 @@ def _run_google_scan(options: dict):
): ):
if _check_abort(): if _check_abort():
return return
_item_id = meta.get("id", "")
if _item_id in _g_scanned_ids:
total_scanned += 1
continue
total_scanned += 1 total_scanned += 1
broadcast("scan_file", {"file": meta.get("name", "")}) broadcast("scan_file", {"file": meta.get("name", "")})
broadcast("scan_progress", { broadcast("scan_progress", {
@ -279,6 +307,7 @@ def _run_google_scan(options: dict):
result = _scan_bytes(data, meta.get("name", "msg.txt")) result = _scan_bytes(data, meta.get("name", "msg.txt"))
except Exception as e: except Exception as e:
broadcast("scan_error", {"file": meta.get("name", ""), "error": str(e)}) broadcast("scan_error", {"file": meta.get("name", ""), "error": str(e)})
_g_scanned_ids.add(_item_id)
continue continue
cprs = result.get("cprs", []) cprs = result.get("cprs", [])
pii_counts = result.get("pii_counts") pii_counts = result.get("pii_counts")
@ -288,6 +317,11 @@ def _run_google_scan(options: dict):
meta["_email_count"] = len(_em) meta["_email_count"] = len(_em)
meta["_phone_count"] = len(_ph) meta["_phone_count"] = len(_ph)
_broadcast_card(meta, cprs, pii_counts) _broadcast_card(meta, cprs, pii_counts)
_g_scanned_ids.add(_item_id)
_g_items_since_save += 1
if _g_items_since_save >= _GCHECKPOINT_SAVE_EVERY:
_save_checkpoint(_gck_key, _g_scanned_ids, _google_flagged, {}, prefix=_gck_prefix)
_g_items_since_save = 0
except GoogleError as e: except GoogleError as e:
broadcast("scan_error", {"file": f"Gmail/{user_email}", "error": str(e)}) broadcast("scan_error", {"file": f"Gmail/{user_email}", "error": str(e)})
except Exception as e: except Exception as e:
@ -327,6 +361,10 @@ def _run_google_scan(options: dict):
for meta, data in drive_items: for meta, data in drive_items:
if _check_abort(): if _check_abort():
return return
_item_id = meta.get("id", "")
if _item_id in _g_scanned_ids:
total_scanned += 1
continue
total_scanned += 1 total_scanned += 1
broadcast("scan_file", {"file": meta.get("name", "")}) broadcast("scan_file", {"file": meta.get("name", "")})
broadcast("scan_progress", { broadcast("scan_progress", {
@ -341,6 +379,7 @@ def _run_google_scan(options: dict):
result = _scan_bytes(data, meta.get("name", "file")) result = _scan_bytes(data, meta.get("name", "file"))
except Exception as e: except Exception as e:
broadcast("scan_error", {"file": meta.get("name", ""), "error": str(e)}) broadcast("scan_error", {"file": meta.get("name", ""), "error": str(e)})
_g_scanned_ids.add(_item_id)
continue continue
cprs = result.get("cprs", []) cprs = result.get("cprs", [])
pii_counts = result.get("pii_counts") pii_counts = result.get("pii_counts")
@ -350,6 +389,11 @@ def _run_google_scan(options: dict):
meta["_email_count"] = len(_em) meta["_email_count"] = len(_em)
meta["_phone_count"] = len(_ph) meta["_phone_count"] = len(_ph)
_broadcast_card(meta, cprs, pii_counts) _broadcast_card(meta, cprs, pii_counts)
_g_scanned_ids.add(_item_id)
_g_items_since_save += 1
if _g_items_since_save >= _GCHECKPOINT_SAVE_EVERY:
_save_checkpoint(_gck_key, _g_scanned_ids, _google_flagged, {}, prefix=_gck_prefix)
_g_items_since_save = 0
except GoogleError as e: except GoogleError as e:
broadcast("scan_error", {"file": f"Drive/{user_email}", "error": str(e)}) broadcast("scan_error", {"file": f"Drive/{user_email}", "error": str(e)})
except Exception as e: except Exception as e:
@ -362,6 +406,10 @@ def _run_google_scan(options: dict):
except Exception as e: except Exception as e:
logger.warning("[gdrive delta] token save failed: %s", e) logger.warning("[gdrive delta] token save failed: %s", e)
from gdpr_scanner import _scan_abort as _gsa
if not _gsa.is_set():
_clear_checkpoint(prefix=_gck_prefix)
elapsed = _time.monotonic() - t_start elapsed = _time.monotonic() - t_start
broadcast("google_scan_done", { broadcast("google_scan_done", {
"flagged_count": total_flagged, "flagged_count": total_flagged,

View File

@ -13,7 +13,7 @@ from app_config import (
) )
from checkpoint import ( from checkpoint import (
_checkpoint_key, _load_checkpoint, _clear_checkpoint, _checkpoint_key, _load_checkpoint, _clear_checkpoint,
_load_delta_tokens, _DELTA_PATH, _load_delta_tokens, _DELTA_PATH, _cp_path,
) )
bp = Blueprint("scan", __name__) bp = Blueprint("scan", __name__)
@ -121,28 +121,80 @@ def scan_stop():
def scan_checkpoint_info(): def scan_checkpoint_info():
"""Return info about any saved checkpoint for the given scan options. """Return info about any saved checkpoint for the given scan options.
If check_only=true, just reports whether a scan is currently running.""" If check_only=true, just reports whether a scan is currently running."""
import hashlib, json as _json
options = request.get_json() or {} options = request.get_json() or {}
if options.get("check_only"): if options.get("check_only"):
acquired = state._scan_lock.acquire(blocking=False) acquired = state._scan_lock.acquire(blocking=False)
if acquired: if acquired:
state._scan_lock.release() state._scan_lock.release()
return jsonify({"running": not acquired}) return jsonify({"running": not acquired})
engines = {}
# M365
if options.get("sources"):
key = _checkpoint_key(options) key = _checkpoint_key(options)
cp = _load_checkpoint(key) cp = _load_checkpoint(key, prefix="m365")
if not cp: if cp:
return jsonify({"exists": False}) engines["m365"] = {
return jsonify({
"exists": True, "exists": True,
"scanned_count": len(cp.get("scanned_ids", [])), "scanned_count": len(cp.get("scanned_ids", [])),
"flagged_count": len(cp.get("flagged", [])), "flagged_count": len(cp.get("flagged", [])),
"started_at": cp.get("meta", {}).get("started_at"), "started_at": cp.get("meta", {}).get("started_at"),
}
# Google
google_emails = options.get("googleUserEmails", [])
google_sources = options.get("googleSources", [])
if google_emails and google_sources:
gkey = hashlib.sha256(_json.dumps({
"emails": sorted(google_emails),
"sources": sorted(google_sources),
"older_than_days": options.get("options", {}).get("older_than_days", 0),
}, sort_keys=True).encode()).hexdigest()[:16]
cp = _load_checkpoint(gkey, prefix="google")
if cp:
engines["google"] = {
"exists": True,
"scanned_count": len(cp.get("scanned_ids", [])),
"flagged_count": len(cp.get("flagged", [])),
"started_at": cp.get("meta", {}).get("started_at"),
}
# File sources (one checkpoint per source ID)
for src_id in options.get("fileSources", []):
fkey = _checkpoint_key({"sources": ["file"], "user_ids": [src_id], "options": {}})
cp = _load_checkpoint(fkey, prefix=f"file_{src_id}")
if cp:
fe = engines.setdefault("file", {"exists": True, "scanned_count": 0, "flagged_count": 0, "started_at": None})
fe["scanned_count"] += len(cp.get("scanned_ids", []))
fe["flagged_count"] += len(cp.get("flagged", []))
if not fe["started_at"]:
fe["started_at"] = cp.get("meta", {}).get("started_at")
if not engines:
return jsonify({"exists": False})
started_ats = [v["started_at"] for v in engines.values() if v.get("started_at")]
return jsonify({
"exists": True,
"scanned_count": sum(v.get("scanned_count", 0) for v in engines.values()),
"flagged_count": sum(v.get("flagged_count", 0) for v in engines.values()),
"started_at": min(started_ats) if started_ats else None,
"engines": engines,
}) })
@bp.route("/api/scan/clear_checkpoint", methods=["POST"]) @bp.route("/api/scan/clear_checkpoint", methods=["POST"])
def scan_clear_checkpoint(): def scan_clear_checkpoint():
"""Discard any saved checkpoint so the next scan starts fresh.""" """Discard all saved checkpoints so the next scan starts fresh."""
_clear_checkpoint() from pathlib import Path
data_dir = Path.home() / ".gdprscanner"
for f in data_dir.glob("checkpoint_*.json"):
try:
f.unlink()
except Exception:
pass
return jsonify({"status": "cleared"}) return jsonify({"status": "cleared"})

View File

@ -125,8 +125,8 @@ def _html_esc(s): return str(s) # type: ignore[misc]
# checkpoint helpers — injected by gdpr_scanner.py # checkpoint helpers — injected by gdpr_scanner.py
def _checkpoint_key(opts): return "" # type: ignore[misc] def _checkpoint_key(opts): return "" # type: ignore[misc]
def _save_checkpoint(*a, **kw): pass # type: ignore[misc] def _save_checkpoint(*a, **kw): pass # type: ignore[misc]
def _load_checkpoint(key): return None # type: ignore[misc] def _load_checkpoint(key, **kw): return None # type: ignore[misc]
def _clear_checkpoint(): pass # type: ignore[misc] def _clear_checkpoint(**kw): pass # type: ignore[misc]
def _load_delta_tokens(): return {} # type: ignore[misc] def _load_delta_tokens(): return {} # type: ignore[misc]
def _save_delta_tokens(t): pass # type: ignore[misc] def _save_delta_tokens(t): pass # type: ignore[misc]
@ -209,6 +209,23 @@ def run_file_scan(source: dict):
except Exception as e: except Exception as e:
logger.error("[db] start_scan failed: %s", e) logger.error("[db] start_scan failed: %s", e)
# \u2500\u2500 Checkpoint: resume from a previous interrupted file scan \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500
_ck_prefix = f"file_{source.get('id', 'local')}"
_ck_key = _checkpoint_key({"sources": [source.get("source_type", "local")], "user_ids": [source.get("id", path)], "options": {}})
_ck = _load_checkpoint(_ck_key, prefix=_ck_prefix)
_file_scanned_ids: set = set(_ck["scanned_ids"]) if _ck else set()
_file_flagged: list = [] # items found by this file scan run (for checkpoint)
_ck_resumed = len(_file_scanned_ids)
if _ck:
_file_flagged = list(_ck.get("flagged", []))
for card in _file_flagged:
_state.flagged_items.append(card)
broadcast("scan_phase", {"phase": LANG.get("m365_resuming", f"Resuming \u2014 skipping {_ck_resumed} already-scanned items\u2026")})
for card in _file_flagged:
broadcast("scan_file_flagged", _with_disposition(card, _db))
_CHECKPOINT_SAVE_EVERY_FILE = 25
_file_items_since_save = 0
total_scanned = 0 total_scanned = 0
total_flagged = 0 total_flagged = 0
@ -247,6 +264,10 @@ def run_file_scan(source: dict):
if _state._scan_abort.is_set(): if _state._scan_abort.is_set():
break break
if rel_path in _file_scanned_ids:
total_scanned += 1
continue
total_scanned += 1 total_scanned += 1
broadcast("scan_progress", {"scanned": total_scanned, "flagged": total_flagged, "file": rel_path, "pct": min(90, 10 + total_scanned // 10), "source": "file"}) broadcast("scan_progress", {"scanned": total_scanned, "flagged": total_flagged, "file": rel_path, "pct": min(90, 10 + total_scanned // 10), "source": "file"})
@ -353,6 +374,7 @@ def run_file_scan(source: dict):
} }
_state.flagged_items.append(card) _state.flagged_items.append(card)
_file_flagged.append(card)
total_flagged += 1 total_flagged += 1
broadcast("scan_file_flagged", _with_disposition(card, _db)) broadcast("scan_file_flagged", _with_disposition(card, _db))
@ -362,10 +384,19 @@ def run_file_scan(source: dict):
except Exception as e: except Exception as e:
logger.error("[db] save_item failed: %s", e) logger.error("[db] save_item failed: %s", e)
_file_scanned_ids.add(rel_path)
_file_items_since_save += 1
if _file_items_since_save >= _CHECKPOINT_SAVE_EVERY_FILE:
_save_checkpoint(_ck_key, _file_scanned_ids, _file_flagged, _state.scan_meta, prefix=_ck_prefix)
_file_items_since_save = 0
except Exception as e: except Exception as e:
import traceback import traceback
broadcast("scan_error", {"file": label, "error": str(e)}) broadcast("scan_error", {"file": label, "error": str(e)})
logger.error("[file_scan] error:\n%s", traceback.format_exc()) logger.error("[file_scan] error:\n%s", traceback.format_exc())
else:
if not _state._scan_abort.is_set():
_clear_checkpoint(prefix=_ck_prefix)
finally: finally:
if _db and _db_scan_id: if _db and _db_scan_id:
try: try:

View File

@ -136,26 +136,39 @@ function buildScanPayload() {
return { sources, fileSources, allSources, googleSources, user_ids, options }; return { sources, fileSources, allSources, googleSources, user_ids, options };
} }
async function checkCheckpoint() { async function checkCheckpoint(onNoCheckpoint) {
const payload = buildScanPayload(); const payload = buildScanPayload();
if (!payload.sources.length && !payload.fileSources.length) return; const banner = document.getElementById('resumeBanner');
if (payload.sources.length && !payload.user_ids.length) return; const hasSources = payload.sources.length > 0 || payload.fileSources.length > 0 || payload.googleSources.length > 0;
if (!hasSources) {
if (banner) banner.style.display = 'none';
onNoCheckpoint?.(); return;
}
// M365 sources without users — scan button will handle the alert
if (payload.sources.length && !payload.user_ids.length && !payload.googleSources.length) {
if (banner) banner.style.display = 'none';
onNoCheckpoint?.(); return;
}
// Collect Google user emails for server-side checkpoint key computation
const googleUserEmails = payload.googleSources.length > 0
? (S._allUsers || []).filter(u => u.selected !== false && (u.platform === 'google' || u.platform === 'both')).map(u => u.email || u.id).filter(Boolean)
: [];
try { try {
const r = await fetch('/api/scan/checkpoint', { const r = await fetch('/api/scan/checkpoint', {
method: 'POST', headers: {'Content-Type':'application/json'}, method: 'POST', headers: {'Content-Type':'application/json'},
body: JSON.stringify(payload) body: JSON.stringify({...payload, googleUserEmails})
}); });
const d = await r.json(); const d = await r.json();
const banner = document.getElementById('resumeBanner');
if (d.exists) { if (d.exists) {
const ts = d.started_at ? new Date(d.started_at * 1000).toLocaleString([], {dateStyle:'short', timeStyle:'short'}) : ''; const ts = d.started_at ? new Date(d.started_at * 1000).toLocaleString([], {dateStyle:'short', timeStyle:'short'}) : '';
document.getElementById('resumeBannerText').textContent = document.getElementById('resumeBannerText').textContent =
t('m365_resume_banner', `Previous scan interrupted (${d.scanned_count} scanned, ${d.flagged_count} found${ts ? ' — ' + ts : ''})`); t('m365_resume_banner', `Previous scan interrupted (${d.scanned_count} scanned, ${d.flagged_count} found${ts ? ' — ' + ts : ''})`);
banner.style.display = 'flex'; if (banner) banner.style.display = 'flex';
} else { } else {
banner.style.display = 'none'; if (banner) banner.style.display = 'none';
onNoCheckpoint?.();
} }
} catch(e) { /* ignore */ } } catch(e) { onNoCheckpoint?.(); }
} }
async function clearCheckpointAndScan() { async function clearCheckpointAndScan() {

View File

@ -302,7 +302,7 @@ document.addEventListener('DOMContentLoaded', applyI18n);
<!-- Topbar --> <!-- Topbar -->
<div class="topbar"> <div class="topbar">
<span id="viewerBrand" style="display:none;font-size:15px;font-weight:600;color:var(--text);white-space:nowrap;margin-right:6px">🔍 GDPRScanner</span> <span id="viewerBrand" style="display:none;font-size:15px;font-weight:600;color:var(--text);white-space:nowrap;margin-right:6px">🔍 GDPRScanner</span>
<button class="scan-btn" id="scanBtn" onclick="startScan()" data-i18n="m365_btn_scan">Scan</button> <button class="scan-btn" id="scanBtn" onclick="checkCheckpoint(() => startScan(false))" data-i18n="m365_btn_scan">Scan</button>
<button class="stop-btn" id="stopBtn" style="display:none" onclick="stopScan()" data-i18n="m365_btn_stop">Stop</button> <button class="stop-btn" id="stopBtn" style="display:none" onclick="stopScan()" data-i18n="m365_btn_stop">Stop</button>
<!-- Profile selector (15c) --> <!-- Profile selector (15c) -->

View File

@ -22,7 +22,7 @@ import checkpoint
@pytest.fixture(autouse=True) @pytest.fixture(autouse=True)
def _isolate(tmp_path, monkeypatch): def _isolate(tmp_path, monkeypatch):
"""Redirect all disk writes to a temp dir for each test.""" """Redirect all disk writes to a temp dir for each test."""
monkeypatch.setattr(checkpoint, "_CHECKPOINT_PATH", tmp_path / "checkpoint.json") monkeypatch.setattr(checkpoint, "_DATA_DIR", tmp_path)
monkeypatch.setattr(checkpoint, "_DELTA_PATH", tmp_path / "delta.json") monkeypatch.setattr(checkpoint, "_DELTA_PATH", tmp_path / "delta.json")