Extended the M365 checkpoint/resume mechanism to all three scan engines. Each engine writes its own +file (checkpoint_m365.json, checkpoint_google.json, checkpoint_file_{source_id}.json) every 25 + items.
This commit is contained in:
parent
2254e00481
commit
8b55e9d933
@ -11,6 +11,8 @@ Version numbers follow [Semantic Versioning](https://semver.org/spec/v2.0.0.html
|
|||||||
|
|
||||||
### Added
|
### Added
|
||||||
|
|
||||||
|
- **Checkpoint / resume for Google and File scans** — stopping a Google Workspace or file (local/SMB/SFTP) scan mid-way and restarting now resumes from where it left off, exactly like M365 scans have always done. Each engine writes its own checkpoint file (`checkpoint_google.json`, `checkpoint_file_{source_id}.json`) every 25 items. On restart, previously found cards are re-emitted via SSE so the grid is repopulated before new items arrive. The Scan button now always checks for a live checkpoint before starting — if one exists the resume banner is shown regardless of whether the user reloaded the page. `POST /api/scan/checkpoint` returns a per-engine breakdown; `POST /api/scan/clear_checkpoint` wipes all `checkpoint_*.json` files. Google users' email addresses are included in the checkpoint payload from the frontend so the server can compute a matching key. `checkpoint.py` functions gained a `prefix` keyword argument (default `"m365"`) — existing M365 call sites are unchanged.
|
||||||
|
|
||||||
- **Email address and Danish phone number detection** — all three scan engines (M365, Google Workspace, local/SMB/SFTP) can now flag files and messages containing email addresses or Danish phone numbers in addition to CPR numbers. Detection is opt-in per profile: two new toggle options **Scan for email addresses** and **Scan for phone numbers** (default off) appear in the scan options panel and profile editor. When enabled, matches are stored as `email_count` / `phone_count` on each DB row and surfaced as colour-coded badges in list view, grid view, and the preview panel. Email regex requires a structurally valid address (`local@domain.tld`); phone regex covers 8-digit Danish numbers with optional `+45`/`0045` prefix and common spacing patterns. Both are deduplicated before counting. Requires DB migration (adds two INTEGER columns to `flagged_items`; applied automatically on first startup via `_MIGRATIONS`).
|
- **Email address and Danish phone number detection** — all three scan engines (M365, Google Workspace, local/SMB/SFTP) can now flag files and messages containing email addresses or Danish phone numbers in addition to CPR numbers. Detection is opt-in per profile: two new toggle options **Scan for email addresses** and **Scan for phone numbers** (default off) appear in the scan options panel and profile editor. When enabled, matches are stored as `email_count` / `phone_count` on each DB row and surfaced as colour-coded badges in list view, grid view, and the preview panel. Email regex requires a structurally valid address (`local@domain.tld`); phone regex covers 8-digit Danish numbers with optional `+45`/`0045` prefix and common spacing patterns. Both are deduplicated before counting. Requires DB migration (adds two INTEGER columns to `flagged_items`; applied automatically on first startup via `_MIGRATIONS`).
|
||||||
|
|
||||||
- **SFTP as a 4th file connector** — SFTP servers can now be added as file sources alongside local folders, SMB shares, and cloud sources. A new `SFTPScanner` class in `sftp_connector.py` implements the same `iter_files()` interface as `FileScanner`, so `run_file_scan()`, SSE broadcasting, DB persistence, card building, scheduled scans, and exports work without changes. Supports password auth and SSH private key auth (RSA, Ed25519, ECDSA, DSS); passphrases stored in the OS keychain. Key files uploaded via `POST /api/file_sources/upload_key` and stored in `~/.gdprscanner/sftp_keys/` with `chmod 600`. SFTP sources appear with a 🔒 icon in the sources panel. Requires `paramiko>=3.4` (optional — scanner falls back gracefully if not installed). New source-type selector (Local / Network (SMB) / SFTP) replaces the SMB path-prefix auto-detection in the add-source form.
|
- **SFTP as a 4th file connector** — SFTP servers can now be added as file sources alongside local folders, SMB shares, and cloud sources. A new `SFTPScanner` class in `sftp_connector.py` implements the same `iter_files()` interface as `FileScanner`, so `run_file_scan()`, SSE broadcasting, DB persistence, card building, scheduled scans, and exports work without changes. Supports password auth and SSH private key auth (RSA, Ed25519, ECDSA, DSS); passphrases stored in the OS keychain. Key files uploaded via `POST /api/file_sources/upload_key` and stored in `~/.gdprscanner/sftp_keys/` with `chmod 600`. SFTP sources appear with a 🔒 icon in the sources panel. Requires `paramiko>=3.4` (optional — scanner falls back gracefully if not installed). New source-type selector (Local / Network (SMB) / SFTP) replaces the SMB path-prefix auto-detection in the add-source form.
|
||||||
|
|||||||
@ -30,7 +30,9 @@ python -m pytest tests/ -q
|
|||||||
|
|
||||||
**Frontend:** `templates/index.html` (SPA), `static/style.css` (all styles), `static/js/*.js` (11 ES modules + `state.js`). `static/app.js` is an archived monolith — no longer loaded.
|
**Frontend:** `templates/index.html` (SPA), `static/style.css` (all styles), `static/js/*.js` (11 ES modules + `state.js`). `static/app.js` is an archived monolith — no longer loaded.
|
||||||
|
|
||||||
**Data dir** `~/.gdprscanner/`: `scanner.db`, `config.json`, `settings.json`, `schedule.json`, `token.json`, `delta.json`, `checkpoint.json`, `smtp.json`, `machine_id` (**never delete** — Fernet key), `role_overrides.json`, `google_sa.json`, `google.json`, `src_toggles.json`, `app.lock`, `viewer_tokens.json`
|
**Checkpoint / resume** — all three scan engines save progress to `~/.gdprscanner/checkpoint_{prefix}.json` every 25 items. Prefixes: `m365`, `google`, `file_{source_id}`. `checkpoint.py` functions accept a `prefix` keyword (default `"m365"`). Use `_cp_path(prefix)` to get the path — do not hard-code filenames. The Scan button calls `checkCheckpoint(() => startScan(false))` so a resume banner is offered before any grid clearing happens. `POST /api/scan/clear_checkpoint` globs and deletes all `checkpoint_*.json` files.
|
||||||
|
|
||||||
|
**Data dir** `~/.gdprscanner/`: `scanner.db`, `config.json`, `settings.json`, `schedule.json`, `token.json`, `delta.json`, `checkpoint_m365.json`, `checkpoint_google.json`, `checkpoint_file_*.json`, `smtp.json`, `machine_id` (**never delete** — Fernet key), `role_overrides.json`, `google_sa.json`, `google.json`, `src_toggles.json`, `app.lock`, `viewer_tokens.json`
|
||||||
|
|
||||||
## Non-obvious files
|
## Non-obvious files
|
||||||
|
|
||||||
|
|||||||
67
OSS_LANDSCAPE.md
Normal file
67
OSS_LANDSCAPE.md
Normal file
@ -0,0 +1,67 @@
|
|||||||
|
# Open Source Landscape — GDPR / PII Document Scanners
|
||||||
|
|
||||||
|
An overview of existing open source tools in the same space as GDPRScanner, and where the gaps are.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
No open source project covers the same combination of M365 + Google Workspace connectors, Danish CPR detection, and GDPR Article 30 reporting in a single web UI. The closest commercial equivalent is [PII Tools](https://pii-tools.com) (closed source, SaaS).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Existing open source tools
|
||||||
|
|
||||||
|
### [Microsoft Presidio](https://github.com/microsoft/presidio)
|
||||||
|
A well-maintained PII detection *library* (not an application) from Microsoft. Supports custom recognisers — a CPR pattern could be added. Covers text, images, and structured data via NLP + regex pipelines. No M365/GWS connectors, no UI, no reports, no scheduling. You would have to build the entire scanning application around it. ~9k GitHub stars.
|
||||||
|
|
||||||
|
### [Octopii](https://github.com/redhuntlabs/Octopii)
|
||||||
|
Local filesystem / S3 / Apache open-directory scanner using OCR + NLP + regex. Detects passports, government IDs, emails, and addresses in image and document files. No cloud connectors, no CPR awareness, no web UI.
|
||||||
|
|
||||||
|
### [pdscan](https://github.com/ankane/pdscan) / [piicatcher](https://github.com/tokern/piicatcher)
|
||||||
|
CLI tools that scan *databases* and data warehouses for PII columns using column-name heuristics and NLP sampling. No file storage scanning, no email, no cloud connectors.
|
||||||
|
|
||||||
|
### "GDPR scanners" on GitHub
|
||||||
|
Projects such as [baudev/gdpr-checker-backend](https://github.com/baudev/gdpr-checker-backend), [dev4privacy/gdpr-analyzer](https://github.com/dev4privacy/gdpr-analyzer), [mammuth/gdpr-scanner](https://github.com/mammuth/gdpr-scanner), and [City-of-Helsinki/GDPR-compliance-scanner](https://github.com/City-of-Helsinki/GDPR-compliance-scanner) are all **website and cookie compliance** scanners. They check whether a domain sets tracking cookies without consent — a completely different problem.
|
||||||
|
|
||||||
|
### CPR libraries
|
||||||
|
Several small libraries exist for validating or generating Danish CPR numbers ([mathiasvr/danish-ssn](https://github.com/mathiasvr/danish-ssn), [anhoej/cprr](https://github.com/anhoej/cprr), [ekstroem/DKcpr](https://github.com/ekstroem/DKcpr)). None of them are document or cloud-storage scanners.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Commercial products that do cover it
|
||||||
|
|
||||||
|
| Product | M365 | GWS | CPR | Article 30 | Open source |
|
||||||
|
|---|---|---|---|---|---|
|
||||||
|
| [PII Tools](https://pii-tools.com) | ✅ | ✅ | ❌ | ❌ | ❌ |
|
||||||
|
| BigID | ✅ | ✅ | ❌ | ❌ | ❌ |
|
||||||
|
| Varonis | ✅ | partial | ❌ | ❌ | ❌ |
|
||||||
|
| Spirion | ✅ | ❌ | ❌ | ❌ | ❌ |
|
||||||
|
|
||||||
|
PII Tools is the most direct commercial equivalent: Graph API + GWS service account connectors, document scanning, web UI. Closed source, SaaS pricing targeted at enterprise.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Capability comparison
|
||||||
|
|
||||||
|
| Capability | GDPRScanner | Presidio | Octopii | Commercial |
|
||||||
|
|---|---|---|---|---|
|
||||||
|
| M365 (Exchange / OneDrive / SharePoint / Teams) | ✅ | ❌ | ❌ | ✅ |
|
||||||
|
| Google Workspace (Gmail / Drive) | ✅ | ❌ | ❌ | ✅ |
|
||||||
|
| Local / SMB / SFTP | ✅ | ❌ | partial | ✅ |
|
||||||
|
| Danish CPR with modulus-11 validation | ✅ | plugin only | ❌ | ❌ |
|
||||||
|
| Email address + phone number detection | ✅ | ✅ | ✅ | ✅ |
|
||||||
|
| GDPR Article 30 report generation | ✅ | ❌ | ❌ | partial |
|
||||||
|
| Disposition tagging + bulk deletion | ✅ | ❌ | ❌ | partial |
|
||||||
|
| Scheduled scans | ✅ | ❌ | ❌ | ✅ |
|
||||||
|
| Checkpoint / resume | ✅ | ❌ | ❌ | unknown |
|
||||||
|
| Read-only viewer / share links | ✅ | ❌ | ❌ | partial |
|
||||||
|
| Web UI for non-technical staff | ✅ | ❌ | ❌ | ✅ |
|
||||||
|
| Danish-language UI | ✅ | ❌ | ❌ | ❌ |
|
||||||
|
| Open source | ✅ | ✅ | ✅ | ❌ |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What makes GDPRScanner unique
|
||||||
|
|
||||||
|
The combination of Danish CPR specificity (modulus-11 validation, date sanity checks), M365 + Google Workspace connectors in a single tool, and GDPR Article 30 output is the gap no open source project fills. The Danish public-sector target audience (schools, municipalities) also drives requirements — role classification (student/staff), Danish-language UI, municipal data retention rules — that no general-purpose PII tool addresses.
|
||||||
6
TODO.md
6
TODO.md
@ -119,6 +119,12 @@ Scan SFTP servers (SSH File Transfer Protocol) alongside local, SMB, and cloud s
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
### Checkpoint / resume for Google and File scans ✅
|
||||||
|
|
||||||
|
Extended the M365 checkpoint/resume mechanism to all three scan engines. Each engine writes its own file (`checkpoint_m365.json`, `checkpoint_google.json`, `checkpoint_file_{source_id}.json`) every 25 items. Previously found cards are re-emitted via SSE on resume so the grid repopulates before new items arrive. The Scan button now checks for a checkpoint before clearing the grid, so the resume banner appears even without a page reload. `POST /api/scan/checkpoint` returns a per-engine breakdown; `POST /api/scan/clear_checkpoint` wipes all `checkpoint_*.json` files. `checkpoint.py` functions gained a `prefix` keyword (default `"m365"`); M365 call sites are unchanged.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
### #32 — Windowed mode for Profiles, Sources, and Settings ✗ Won't do
|
### #32 — Windowed mode for Profiles, Sources, and Settings ✗ Won't do
|
||||||
The workflow is sequential (configure → scan → review), not parallel — there is no realistic scenario where a modal and the results grid need to be open simultaneously. The Sources panel is already visible in the sidebar. Option A (the least-work path) still loads the full 3800-line JS stack twice. Closed.
|
The workflow is sequential (configure → scan → review), not parallel — there is no realistic scenario where a modal and the results grid need to be open simultaneously. The Sources panel is already visible in the sidebar. Option A (the least-work path) still loads the full 3800-line JS stack twice. Closed.
|
||||||
|
|
||||||
|
|||||||
@ -15,7 +15,9 @@ logger = logging.getLogger(__name__)
|
|||||||
|
|
||||||
_DATA_DIR = Path.home() / ".gdprscanner"
|
_DATA_DIR = Path.home() / ".gdprscanner"
|
||||||
_DATA_DIR.mkdir(exist_ok=True)
|
_DATA_DIR.mkdir(exist_ok=True)
|
||||||
_CHECKPOINT_PATH = _DATA_DIR / "checkpoint.json"
|
|
||||||
|
def _cp_path(prefix: str) -> Path:
|
||||||
|
return _DATA_DIR / f"checkpoint_{prefix}.json"
|
||||||
|
|
||||||
def _checkpoint_key(options: dict) -> str:
|
def _checkpoint_key(options: dict) -> str:
|
||||||
"""Stable hash of the scan options — used to detect when a checkpoint
|
"""Stable hash of the scan options — used to detect when a checkpoint
|
||||||
@ -27,7 +29,7 @@ def _checkpoint_key(options: dict) -> str:
|
|||||||
}, sort_keys=True)
|
}, sort_keys=True)
|
||||||
return hashlib.sha256(sig.encode()).hexdigest()[:16]
|
return hashlib.sha256(sig.encode()).hexdigest()[:16]
|
||||||
|
|
||||||
def _save_checkpoint(key: str, scanned_ids: set, flagged: list, meta: dict) -> None:
|
def _save_checkpoint(key: str, scanned_ids: set, flagged: list, meta: dict, *, prefix: str = "m365") -> None:
|
||||||
"""Write checkpoint to disk. Called periodically during scanning."""
|
"""Write checkpoint to disk. Called periodically during scanning."""
|
||||||
try:
|
try:
|
||||||
payload = {
|
payload = {
|
||||||
@ -36,28 +38,31 @@ def _save_checkpoint(key: str, scanned_ids: set, flagged: list, meta: dict) -> N
|
|||||||
"flagged": flagged,
|
"flagged": flagged,
|
||||||
"meta": {k: v for k, v in meta.items() if k != "options"},
|
"meta": {k: v for k, v in meta.items() if k != "options"},
|
||||||
}
|
}
|
||||||
tmp = _CHECKPOINT_PATH.with_suffix(".tmp")
|
path = _cp_path(prefix)
|
||||||
|
tmp = path.with_suffix(".tmp")
|
||||||
tmp.write_text(json.dumps(payload, ensure_ascii=False, default=str), encoding="utf-8")
|
tmp.write_text(json.dumps(payload, ensure_ascii=False, default=str), encoding="utf-8")
|
||||||
tmp.replace(_CHECKPOINT_PATH)
|
tmp.replace(path)
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
logger.error("[checkpoint] save failed: %s", e)
|
logger.error("[checkpoint] save failed: %s", e)
|
||||||
|
|
||||||
def _load_checkpoint(key: str) -> dict | None:
|
def _load_checkpoint(key: str, *, prefix: str = "m365") -> dict | None:
|
||||||
"""Load checkpoint if it matches the current scan key. Returns None on mismatch or error."""
|
"""Load checkpoint if it matches the current scan key. Returns None on mismatch or error."""
|
||||||
try:
|
try:
|
||||||
if not _CHECKPOINT_PATH.exists():
|
path = _cp_path(prefix)
|
||||||
|
if not path.exists():
|
||||||
return None
|
return None
|
||||||
payload = json.loads(_CHECKPOINT_PATH.read_text(encoding="utf-8"))
|
payload = json.loads(path.read_text(encoding="utf-8"))
|
||||||
if payload.get("key") != key:
|
if payload.get("key") != key:
|
||||||
return None
|
return None
|
||||||
return payload
|
return payload
|
||||||
except Exception:
|
except Exception:
|
||||||
return None
|
return None
|
||||||
|
|
||||||
def _clear_checkpoint() -> None:
|
def _clear_checkpoint(*, prefix: str = "m365") -> None:
|
||||||
try:
|
try:
|
||||||
if _CHECKPOINT_PATH.exists():
|
path = _cp_path(prefix)
|
||||||
_CHECKPOINT_PATH.unlink()
|
if path.exists():
|
||||||
|
path.unlink()
|
||||||
except Exception:
|
except Exception:
|
||||||
pass
|
pass
|
||||||
|
|
||||||
|
|||||||
@ -251,7 +251,7 @@ from app_config import (
|
|||||||
from checkpoint import (
|
from checkpoint import (
|
||||||
_checkpoint_key, _save_checkpoint, _load_checkpoint, _clear_checkpoint,
|
_checkpoint_key, _save_checkpoint, _load_checkpoint, _clear_checkpoint,
|
||||||
_load_delta_tokens, _save_delta_tokens,
|
_load_delta_tokens, _save_delta_tokens,
|
||||||
_CHECKPOINT_PATH, _DELTA_PATH,
|
_cp_path, _DELTA_PATH,
|
||||||
)
|
)
|
||||||
|
|
||||||
from sse import broadcast, _sse_queues, _sse_buffer
|
from sse import broadcast, _sse_queues, _sse_buffer
|
||||||
@ -1842,7 +1842,7 @@ Example --settings file with SMTP:
|
|||||||
(_SETTINGS_PATH, "Headless scan settings"),
|
(_SETTINGS_PATH, "Headless scan settings"),
|
||||||
(_ROLE_OVERRIDES_PATH, "Manual role overrides"),
|
(_ROLE_OVERRIDES_PATH, "Manual role overrides"),
|
||||||
(_FILE_SOURCES_PATH, "File source definitions"),
|
(_FILE_SOURCES_PATH, "File source definitions"),
|
||||||
(_CHECKPOINT_PATH, "Scan checkpoint (resume state)"),
|
(_cp_path("m365"), "Scan checkpoint (resume state)"),
|
||||||
(_DELTA_PATH, "Delta scan tokens"),
|
(_DELTA_PATH, "Delta scan tokens"),
|
||||||
(_LANG_OVERRIDE_FILE, "Language preference"),
|
(_LANG_OVERRIDE_FILE, "Language preference"),
|
||||||
(Path.home() / ".gdprscanner" / "schedule.json", "Scheduler configuration"),
|
(Path.home() / ".gdprscanner" / "schedule.json", "Scheduler configuration"),
|
||||||
@ -1929,10 +1929,12 @@ Example --settings file with SMTP:
|
|||||||
print(" ✖ m365_db not available — cannot reset")
|
print(" ✖ m365_db not available — cannot reset")
|
||||||
_sys.exit(1)
|
_sys.exit(1)
|
||||||
|
|
||||||
# Also clear the JSON checkpoint so the UI starts with no cached results
|
# Also clear all checkpoints so the UI starts with no cached results
|
||||||
_clear_checkpoint()
|
from pathlib import Path as _Path
|
||||||
if not _CHECKPOINT_PATH.exists():
|
for _cpf in (_Path.home() / ".gdprscanner").glob("checkpoint_*.json"):
|
||||||
print(f" ✔ Checkpoint cleared")
|
try: _cpf.unlink()
|
||||||
|
except Exception: pass
|
||||||
|
print(f" ✔ Checkpoints cleared")
|
||||||
|
|
||||||
# Clear delta tokens too — stale after a full DB reset
|
# Clear delta tokens too — stale after a full DB reset
|
||||||
if _DELTA_PATH.exists():
|
if _DELTA_PATH.exists():
|
||||||
|
|||||||
@ -144,7 +144,8 @@ def _run_google_scan(options: dict):
|
|||||||
scan_emails = bool(scan_opts.get("scan_emails", False))
|
scan_emails = bool(scan_opts.get("scan_emails", False))
|
||||||
scan_phones = bool(scan_opts.get("scan_phones", False))
|
scan_phones = bool(scan_opts.get("scan_phones", False))
|
||||||
|
|
||||||
from checkpoint import _load_delta_tokens, _save_delta_tokens
|
from checkpoint import (_load_delta_tokens, _save_delta_tokens,
|
||||||
|
_save_checkpoint, _load_checkpoint, _clear_checkpoint)
|
||||||
_drive_delta_tokens: dict = _load_delta_tokens() if delta_enabled else {}
|
_drive_delta_tokens: dict = _load_delta_tokens() if delta_enabled else {}
|
||||||
_new_drive_tokens: dict = {}
|
_new_drive_tokens: dict = {}
|
||||||
|
|
||||||
@ -195,6 +196,28 @@ def _run_google_scan(options: dict):
|
|||||||
except Exception as e:
|
except Exception as e:
|
||||||
logger.error("[google_scan] begin_scan failed: %s", e)
|
logger.error("[google_scan] begin_scan failed: %s", e)
|
||||||
|
|
||||||
|
# ── Checkpoint: resume from a previous interrupted Google scan ────────────
|
||||||
|
import hashlib as _hl, json as _js
|
||||||
|
_gck_prefix = "google"
|
||||||
|
_gck_key = _hl.sha256(_js.dumps({
|
||||||
|
"emails": sorted(user_emails),
|
||||||
|
"sources": sorted(sources),
|
||||||
|
"older_than_days": scan_opts.get("older_than_days", 0),
|
||||||
|
}, sort_keys=True).encode()).hexdigest()[:16]
|
||||||
|
_gck = _load_checkpoint(_gck_key, prefix=_gck_prefix)
|
||||||
|
_g_scanned_ids: set = set(_gck["scanned_ids"]) if _gck else set()
|
||||||
|
_google_flagged: list = [] # items found by this Google scan (for checkpoint)
|
||||||
|
_gck_resumed = len(_g_scanned_ids)
|
||||||
|
if _gck:
|
||||||
|
from scan_engine import _with_disposition as _wd_ck
|
||||||
|
_google_flagged = list(_gck.get("flagged", []))
|
||||||
|
flagged_items.extend(_google_flagged)
|
||||||
|
broadcast("scan_phase", {"phase": f"Resuming — skipping {_gck_resumed} already-scanned items…"})
|
||||||
|
for _card in _google_flagged:
|
||||||
|
broadcast("scan_file_flagged", _wd_ck(_card, _db))
|
||||||
|
_GCHECKPOINT_SAVE_EVERY = 25
|
||||||
|
_g_items_since_save = 0
|
||||||
|
|
||||||
total_flagged = 0
|
total_flagged = 0
|
||||||
total_scanned = 0
|
total_scanned = 0
|
||||||
t_start = _time.monotonic()
|
t_start = _time.monotonic()
|
||||||
@ -234,6 +257,7 @@ def _run_google_scan(options: dict):
|
|||||||
"exif": {},
|
"exif": {},
|
||||||
}
|
}
|
||||||
flagged_items.append(card)
|
flagged_items.append(card)
|
||||||
|
_google_flagged.append(card)
|
||||||
broadcast("scan_file_flagged", _with_disposition(card, _db))
|
broadcast("scan_file_flagged", _with_disposition(card, _db))
|
||||||
total_flagged += 1
|
total_flagged += 1
|
||||||
if _db and _db_scan_id:
|
if _db and _db_scan_id:
|
||||||
@ -265,6 +289,10 @@ def _run_google_scan(options: dict):
|
|||||||
):
|
):
|
||||||
if _check_abort():
|
if _check_abort():
|
||||||
return
|
return
|
||||||
|
_item_id = meta.get("id", "")
|
||||||
|
if _item_id in _g_scanned_ids:
|
||||||
|
total_scanned += 1
|
||||||
|
continue
|
||||||
total_scanned += 1
|
total_scanned += 1
|
||||||
broadcast("scan_file", {"file": meta.get("name", "")})
|
broadcast("scan_file", {"file": meta.get("name", "")})
|
||||||
broadcast("scan_progress", {
|
broadcast("scan_progress", {
|
||||||
@ -279,6 +307,7 @@ def _run_google_scan(options: dict):
|
|||||||
result = _scan_bytes(data, meta.get("name", "msg.txt"))
|
result = _scan_bytes(data, meta.get("name", "msg.txt"))
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
broadcast("scan_error", {"file": meta.get("name", ""), "error": str(e)})
|
broadcast("scan_error", {"file": meta.get("name", ""), "error": str(e)})
|
||||||
|
_g_scanned_ids.add(_item_id)
|
||||||
continue
|
continue
|
||||||
cprs = result.get("cprs", [])
|
cprs = result.get("cprs", [])
|
||||||
pii_counts = result.get("pii_counts")
|
pii_counts = result.get("pii_counts")
|
||||||
@ -288,6 +317,11 @@ def _run_google_scan(options: dict):
|
|||||||
meta["_email_count"] = len(_em)
|
meta["_email_count"] = len(_em)
|
||||||
meta["_phone_count"] = len(_ph)
|
meta["_phone_count"] = len(_ph)
|
||||||
_broadcast_card(meta, cprs, pii_counts)
|
_broadcast_card(meta, cprs, pii_counts)
|
||||||
|
_g_scanned_ids.add(_item_id)
|
||||||
|
_g_items_since_save += 1
|
||||||
|
if _g_items_since_save >= _GCHECKPOINT_SAVE_EVERY:
|
||||||
|
_save_checkpoint(_gck_key, _g_scanned_ids, _google_flagged, {}, prefix=_gck_prefix)
|
||||||
|
_g_items_since_save = 0
|
||||||
except GoogleError as e:
|
except GoogleError as e:
|
||||||
broadcast("scan_error", {"file": f"Gmail/{user_email}", "error": str(e)})
|
broadcast("scan_error", {"file": f"Gmail/{user_email}", "error": str(e)})
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
@ -327,6 +361,10 @@ def _run_google_scan(options: dict):
|
|||||||
for meta, data in drive_items:
|
for meta, data in drive_items:
|
||||||
if _check_abort():
|
if _check_abort():
|
||||||
return
|
return
|
||||||
|
_item_id = meta.get("id", "")
|
||||||
|
if _item_id in _g_scanned_ids:
|
||||||
|
total_scanned += 1
|
||||||
|
continue
|
||||||
total_scanned += 1
|
total_scanned += 1
|
||||||
broadcast("scan_file", {"file": meta.get("name", "")})
|
broadcast("scan_file", {"file": meta.get("name", "")})
|
||||||
broadcast("scan_progress", {
|
broadcast("scan_progress", {
|
||||||
@ -341,6 +379,7 @@ def _run_google_scan(options: dict):
|
|||||||
result = _scan_bytes(data, meta.get("name", "file"))
|
result = _scan_bytes(data, meta.get("name", "file"))
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
broadcast("scan_error", {"file": meta.get("name", ""), "error": str(e)})
|
broadcast("scan_error", {"file": meta.get("name", ""), "error": str(e)})
|
||||||
|
_g_scanned_ids.add(_item_id)
|
||||||
continue
|
continue
|
||||||
cprs = result.get("cprs", [])
|
cprs = result.get("cprs", [])
|
||||||
pii_counts = result.get("pii_counts")
|
pii_counts = result.get("pii_counts")
|
||||||
@ -350,6 +389,11 @@ def _run_google_scan(options: dict):
|
|||||||
meta["_email_count"] = len(_em)
|
meta["_email_count"] = len(_em)
|
||||||
meta["_phone_count"] = len(_ph)
|
meta["_phone_count"] = len(_ph)
|
||||||
_broadcast_card(meta, cprs, pii_counts)
|
_broadcast_card(meta, cprs, pii_counts)
|
||||||
|
_g_scanned_ids.add(_item_id)
|
||||||
|
_g_items_since_save += 1
|
||||||
|
if _g_items_since_save >= _GCHECKPOINT_SAVE_EVERY:
|
||||||
|
_save_checkpoint(_gck_key, _g_scanned_ids, _google_flagged, {}, prefix=_gck_prefix)
|
||||||
|
_g_items_since_save = 0
|
||||||
except GoogleError as e:
|
except GoogleError as e:
|
||||||
broadcast("scan_error", {"file": f"Drive/{user_email}", "error": str(e)})
|
broadcast("scan_error", {"file": f"Drive/{user_email}", "error": str(e)})
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
@ -362,6 +406,10 @@ def _run_google_scan(options: dict):
|
|||||||
except Exception as e:
|
except Exception as e:
|
||||||
logger.warning("[gdrive delta] token save failed: %s", e)
|
logger.warning("[gdrive delta] token save failed: %s", e)
|
||||||
|
|
||||||
|
from gdpr_scanner import _scan_abort as _gsa
|
||||||
|
if not _gsa.is_set():
|
||||||
|
_clear_checkpoint(prefix=_gck_prefix)
|
||||||
|
|
||||||
elapsed = _time.monotonic() - t_start
|
elapsed = _time.monotonic() - t_start
|
||||||
broadcast("google_scan_done", {
|
broadcast("google_scan_done", {
|
||||||
"flagged_count": total_flagged,
|
"flagged_count": total_flagged,
|
||||||
|
|||||||
@ -13,7 +13,7 @@ from app_config import (
|
|||||||
)
|
)
|
||||||
from checkpoint import (
|
from checkpoint import (
|
||||||
_checkpoint_key, _load_checkpoint, _clear_checkpoint,
|
_checkpoint_key, _load_checkpoint, _clear_checkpoint,
|
||||||
_load_delta_tokens, _DELTA_PATH,
|
_load_delta_tokens, _DELTA_PATH, _cp_path,
|
||||||
)
|
)
|
||||||
|
|
||||||
bp = Blueprint("scan", __name__)
|
bp = Blueprint("scan", __name__)
|
||||||
@ -121,28 +121,80 @@ def scan_stop():
|
|||||||
def scan_checkpoint_info():
|
def scan_checkpoint_info():
|
||||||
"""Return info about any saved checkpoint for the given scan options.
|
"""Return info about any saved checkpoint for the given scan options.
|
||||||
If check_only=true, just reports whether a scan is currently running."""
|
If check_only=true, just reports whether a scan is currently running."""
|
||||||
|
import hashlib, json as _json
|
||||||
options = request.get_json() or {}
|
options = request.get_json() or {}
|
||||||
if options.get("check_only"):
|
if options.get("check_only"):
|
||||||
acquired = state._scan_lock.acquire(blocking=False)
|
acquired = state._scan_lock.acquire(blocking=False)
|
||||||
if acquired:
|
if acquired:
|
||||||
state._scan_lock.release()
|
state._scan_lock.release()
|
||||||
return jsonify({"running": not acquired})
|
return jsonify({"running": not acquired})
|
||||||
|
|
||||||
|
engines = {}
|
||||||
|
|
||||||
|
# M365
|
||||||
|
if options.get("sources"):
|
||||||
key = _checkpoint_key(options)
|
key = _checkpoint_key(options)
|
||||||
cp = _load_checkpoint(key)
|
cp = _load_checkpoint(key, prefix="m365")
|
||||||
if not cp:
|
if cp:
|
||||||
return jsonify({"exists": False})
|
engines["m365"] = {
|
||||||
return jsonify({
|
|
||||||
"exists": True,
|
"exists": True,
|
||||||
"scanned_count": len(cp.get("scanned_ids", [])),
|
"scanned_count": len(cp.get("scanned_ids", [])),
|
||||||
"flagged_count": len(cp.get("flagged", [])),
|
"flagged_count": len(cp.get("flagged", [])),
|
||||||
"started_at": cp.get("meta", {}).get("started_at"),
|
"started_at": cp.get("meta", {}).get("started_at"),
|
||||||
|
}
|
||||||
|
|
||||||
|
# Google
|
||||||
|
google_emails = options.get("googleUserEmails", [])
|
||||||
|
google_sources = options.get("googleSources", [])
|
||||||
|
if google_emails and google_sources:
|
||||||
|
gkey = hashlib.sha256(_json.dumps({
|
||||||
|
"emails": sorted(google_emails),
|
||||||
|
"sources": sorted(google_sources),
|
||||||
|
"older_than_days": options.get("options", {}).get("older_than_days", 0),
|
||||||
|
}, sort_keys=True).encode()).hexdigest()[:16]
|
||||||
|
cp = _load_checkpoint(gkey, prefix="google")
|
||||||
|
if cp:
|
||||||
|
engines["google"] = {
|
||||||
|
"exists": True,
|
||||||
|
"scanned_count": len(cp.get("scanned_ids", [])),
|
||||||
|
"flagged_count": len(cp.get("flagged", [])),
|
||||||
|
"started_at": cp.get("meta", {}).get("started_at"),
|
||||||
|
}
|
||||||
|
|
||||||
|
# File sources (one checkpoint per source ID)
|
||||||
|
for src_id in options.get("fileSources", []):
|
||||||
|
fkey = _checkpoint_key({"sources": ["file"], "user_ids": [src_id], "options": {}})
|
||||||
|
cp = _load_checkpoint(fkey, prefix=f"file_{src_id}")
|
||||||
|
if cp:
|
||||||
|
fe = engines.setdefault("file", {"exists": True, "scanned_count": 0, "flagged_count": 0, "started_at": None})
|
||||||
|
fe["scanned_count"] += len(cp.get("scanned_ids", []))
|
||||||
|
fe["flagged_count"] += len(cp.get("flagged", []))
|
||||||
|
if not fe["started_at"]:
|
||||||
|
fe["started_at"] = cp.get("meta", {}).get("started_at")
|
||||||
|
|
||||||
|
if not engines:
|
||||||
|
return jsonify({"exists": False})
|
||||||
|
|
||||||
|
started_ats = [v["started_at"] for v in engines.values() if v.get("started_at")]
|
||||||
|
return jsonify({
|
||||||
|
"exists": True,
|
||||||
|
"scanned_count": sum(v.get("scanned_count", 0) for v in engines.values()),
|
||||||
|
"flagged_count": sum(v.get("flagged_count", 0) for v in engines.values()),
|
||||||
|
"started_at": min(started_ats) if started_ats else None,
|
||||||
|
"engines": engines,
|
||||||
})
|
})
|
||||||
|
|
||||||
|
|
||||||
@bp.route("/api/scan/clear_checkpoint", methods=["POST"])
|
@bp.route("/api/scan/clear_checkpoint", methods=["POST"])
|
||||||
def scan_clear_checkpoint():
|
def scan_clear_checkpoint():
|
||||||
"""Discard any saved checkpoint so the next scan starts fresh."""
|
"""Discard all saved checkpoints so the next scan starts fresh."""
|
||||||
_clear_checkpoint()
|
from pathlib import Path
|
||||||
|
data_dir = Path.home() / ".gdprscanner"
|
||||||
|
for f in data_dir.glob("checkpoint_*.json"):
|
||||||
|
try:
|
||||||
|
f.unlink()
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
return jsonify({"status": "cleared"})
|
return jsonify({"status": "cleared"})
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
@ -125,8 +125,8 @@ def _html_esc(s): return str(s) # type: ignore[misc]
|
|||||||
# checkpoint helpers — injected by gdpr_scanner.py
|
# checkpoint helpers — injected by gdpr_scanner.py
|
||||||
def _checkpoint_key(opts): return "" # type: ignore[misc]
|
def _checkpoint_key(opts): return "" # type: ignore[misc]
|
||||||
def _save_checkpoint(*a, **kw): pass # type: ignore[misc]
|
def _save_checkpoint(*a, **kw): pass # type: ignore[misc]
|
||||||
def _load_checkpoint(key): return None # type: ignore[misc]
|
def _load_checkpoint(key, **kw): return None # type: ignore[misc]
|
||||||
def _clear_checkpoint(): pass # type: ignore[misc]
|
def _clear_checkpoint(**kw): pass # type: ignore[misc]
|
||||||
def _load_delta_tokens(): return {} # type: ignore[misc]
|
def _load_delta_tokens(): return {} # type: ignore[misc]
|
||||||
def _save_delta_tokens(t): pass # type: ignore[misc]
|
def _save_delta_tokens(t): pass # type: ignore[misc]
|
||||||
|
|
||||||
@ -209,6 +209,23 @@ def run_file_scan(source: dict):
|
|||||||
except Exception as e:
|
except Exception as e:
|
||||||
logger.error("[db] start_scan failed: %s", e)
|
logger.error("[db] start_scan failed: %s", e)
|
||||||
|
|
||||||
|
# \u2500\u2500 Checkpoint: resume from a previous interrupted file scan \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500
|
||||||
|
_ck_prefix = f"file_{source.get('id', 'local')}"
|
||||||
|
_ck_key = _checkpoint_key({"sources": [source.get("source_type", "local")], "user_ids": [source.get("id", path)], "options": {}})
|
||||||
|
_ck = _load_checkpoint(_ck_key, prefix=_ck_prefix)
|
||||||
|
_file_scanned_ids: set = set(_ck["scanned_ids"]) if _ck else set()
|
||||||
|
_file_flagged: list = [] # items found by this file scan run (for checkpoint)
|
||||||
|
_ck_resumed = len(_file_scanned_ids)
|
||||||
|
if _ck:
|
||||||
|
_file_flagged = list(_ck.get("flagged", []))
|
||||||
|
for card in _file_flagged:
|
||||||
|
_state.flagged_items.append(card)
|
||||||
|
broadcast("scan_phase", {"phase": LANG.get("m365_resuming", f"Resuming \u2014 skipping {_ck_resumed} already-scanned items\u2026")})
|
||||||
|
for card in _file_flagged:
|
||||||
|
broadcast("scan_file_flagged", _with_disposition(card, _db))
|
||||||
|
_CHECKPOINT_SAVE_EVERY_FILE = 25
|
||||||
|
_file_items_since_save = 0
|
||||||
|
|
||||||
total_scanned = 0
|
total_scanned = 0
|
||||||
total_flagged = 0
|
total_flagged = 0
|
||||||
|
|
||||||
@ -247,6 +264,10 @@ def run_file_scan(source: dict):
|
|||||||
if _state._scan_abort.is_set():
|
if _state._scan_abort.is_set():
|
||||||
break
|
break
|
||||||
|
|
||||||
|
if rel_path in _file_scanned_ids:
|
||||||
|
total_scanned += 1
|
||||||
|
continue
|
||||||
|
|
||||||
total_scanned += 1
|
total_scanned += 1
|
||||||
broadcast("scan_progress", {"scanned": total_scanned, "flagged": total_flagged, "file": rel_path, "pct": min(90, 10 + total_scanned // 10), "source": "file"})
|
broadcast("scan_progress", {"scanned": total_scanned, "flagged": total_flagged, "file": rel_path, "pct": min(90, 10 + total_scanned // 10), "source": "file"})
|
||||||
|
|
||||||
@ -353,6 +374,7 @@ def run_file_scan(source: dict):
|
|||||||
}
|
}
|
||||||
|
|
||||||
_state.flagged_items.append(card)
|
_state.flagged_items.append(card)
|
||||||
|
_file_flagged.append(card)
|
||||||
total_flagged += 1
|
total_flagged += 1
|
||||||
broadcast("scan_file_flagged", _with_disposition(card, _db))
|
broadcast("scan_file_flagged", _with_disposition(card, _db))
|
||||||
|
|
||||||
@ -362,10 +384,19 @@ def run_file_scan(source: dict):
|
|||||||
except Exception as e:
|
except Exception as e:
|
||||||
logger.error("[db] save_item failed: %s", e)
|
logger.error("[db] save_item failed: %s", e)
|
||||||
|
|
||||||
|
_file_scanned_ids.add(rel_path)
|
||||||
|
_file_items_since_save += 1
|
||||||
|
if _file_items_since_save >= _CHECKPOINT_SAVE_EVERY_FILE:
|
||||||
|
_save_checkpoint(_ck_key, _file_scanned_ids, _file_flagged, _state.scan_meta, prefix=_ck_prefix)
|
||||||
|
_file_items_since_save = 0
|
||||||
|
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
import traceback
|
import traceback
|
||||||
broadcast("scan_error", {"file": label, "error": str(e)})
|
broadcast("scan_error", {"file": label, "error": str(e)})
|
||||||
logger.error("[file_scan] error:\n%s", traceback.format_exc())
|
logger.error("[file_scan] error:\n%s", traceback.format_exc())
|
||||||
|
else:
|
||||||
|
if not _state._scan_abort.is_set():
|
||||||
|
_clear_checkpoint(prefix=_ck_prefix)
|
||||||
finally:
|
finally:
|
||||||
if _db and _db_scan_id:
|
if _db and _db_scan_id:
|
||||||
try:
|
try:
|
||||||
|
|||||||
@ -136,26 +136,39 @@ function buildScanPayload() {
|
|||||||
return { sources, fileSources, allSources, googleSources, user_ids, options };
|
return { sources, fileSources, allSources, googleSources, user_ids, options };
|
||||||
}
|
}
|
||||||
|
|
||||||
async function checkCheckpoint() {
|
async function checkCheckpoint(onNoCheckpoint) {
|
||||||
const payload = buildScanPayload();
|
const payload = buildScanPayload();
|
||||||
if (!payload.sources.length && !payload.fileSources.length) return;
|
const banner = document.getElementById('resumeBanner');
|
||||||
if (payload.sources.length && !payload.user_ids.length) return;
|
const hasSources = payload.sources.length > 0 || payload.fileSources.length > 0 || payload.googleSources.length > 0;
|
||||||
|
if (!hasSources) {
|
||||||
|
if (banner) banner.style.display = 'none';
|
||||||
|
onNoCheckpoint?.(); return;
|
||||||
|
}
|
||||||
|
// M365 sources without users — scan button will handle the alert
|
||||||
|
if (payload.sources.length && !payload.user_ids.length && !payload.googleSources.length) {
|
||||||
|
if (banner) banner.style.display = 'none';
|
||||||
|
onNoCheckpoint?.(); return;
|
||||||
|
}
|
||||||
|
// Collect Google user emails for server-side checkpoint key computation
|
||||||
|
const googleUserEmails = payload.googleSources.length > 0
|
||||||
|
? (S._allUsers || []).filter(u => u.selected !== false && (u.platform === 'google' || u.platform === 'both')).map(u => u.email || u.id).filter(Boolean)
|
||||||
|
: [];
|
||||||
try {
|
try {
|
||||||
const r = await fetch('/api/scan/checkpoint', {
|
const r = await fetch('/api/scan/checkpoint', {
|
||||||
method: 'POST', headers: {'Content-Type':'application/json'},
|
method: 'POST', headers: {'Content-Type':'application/json'},
|
||||||
body: JSON.stringify(payload)
|
body: JSON.stringify({...payload, googleUserEmails})
|
||||||
});
|
});
|
||||||
const d = await r.json();
|
const d = await r.json();
|
||||||
const banner = document.getElementById('resumeBanner');
|
|
||||||
if (d.exists) {
|
if (d.exists) {
|
||||||
const ts = d.started_at ? new Date(d.started_at * 1000).toLocaleString([], {dateStyle:'short', timeStyle:'short'}) : '';
|
const ts = d.started_at ? new Date(d.started_at * 1000).toLocaleString([], {dateStyle:'short', timeStyle:'short'}) : '';
|
||||||
document.getElementById('resumeBannerText').textContent =
|
document.getElementById('resumeBannerText').textContent =
|
||||||
t('m365_resume_banner', `Previous scan interrupted (${d.scanned_count} scanned, ${d.flagged_count} found${ts ? ' — ' + ts : ''})`);
|
t('m365_resume_banner', `Previous scan interrupted (${d.scanned_count} scanned, ${d.flagged_count} found${ts ? ' — ' + ts : ''})`);
|
||||||
banner.style.display = 'flex';
|
if (banner) banner.style.display = 'flex';
|
||||||
} else {
|
} else {
|
||||||
banner.style.display = 'none';
|
if (banner) banner.style.display = 'none';
|
||||||
|
onNoCheckpoint?.();
|
||||||
}
|
}
|
||||||
} catch(e) { /* ignore */ }
|
} catch(e) { onNoCheckpoint?.(); }
|
||||||
}
|
}
|
||||||
|
|
||||||
async function clearCheckpointAndScan() {
|
async function clearCheckpointAndScan() {
|
||||||
|
|||||||
@ -302,7 +302,7 @@ document.addEventListener('DOMContentLoaded', applyI18n);
|
|||||||
<!-- Topbar -->
|
<!-- Topbar -->
|
||||||
<div class="topbar">
|
<div class="topbar">
|
||||||
<span id="viewerBrand" style="display:none;font-size:15px;font-weight:600;color:var(--text);white-space:nowrap;margin-right:6px">🔍 GDPRScanner</span>
|
<span id="viewerBrand" style="display:none;font-size:15px;font-weight:600;color:var(--text);white-space:nowrap;margin-right:6px">🔍 GDPRScanner</span>
|
||||||
<button class="scan-btn" id="scanBtn" onclick="startScan()" data-i18n="m365_btn_scan">Scan</button>
|
<button class="scan-btn" id="scanBtn" onclick="checkCheckpoint(() => startScan(false))" data-i18n="m365_btn_scan">Scan</button>
|
||||||
<button class="stop-btn" id="stopBtn" style="display:none" onclick="stopScan()" data-i18n="m365_btn_stop">Stop</button>
|
<button class="stop-btn" id="stopBtn" style="display:none" onclick="stopScan()" data-i18n="m365_btn_stop">Stop</button>
|
||||||
|
|
||||||
<!-- Profile selector (15c) -->
|
<!-- Profile selector (15c) -->
|
||||||
|
|||||||
@ -22,7 +22,7 @@ import checkpoint
|
|||||||
@pytest.fixture(autouse=True)
|
@pytest.fixture(autouse=True)
|
||||||
def _isolate(tmp_path, monkeypatch):
|
def _isolate(tmp_path, monkeypatch):
|
||||||
"""Redirect all disk writes to a temp dir for each test."""
|
"""Redirect all disk writes to a temp dir for each test."""
|
||||||
monkeypatch.setattr(checkpoint, "_CHECKPOINT_PATH", tmp_path / "checkpoint.json")
|
monkeypatch.setattr(checkpoint, "_DATA_DIR", tmp_path)
|
||||||
monkeypatch.setattr(checkpoint, "_DELTA_PATH", tmp_path / "delta.json")
|
monkeypatch.setattr(checkpoint, "_DELTA_PATH", tmp_path / "delta.json")
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user