Built-in file redaction for local files

This commit is contained in:
StyxX65 2026-05-27 14:49:06 +02:00
parent c490b3d76a
commit 23b9555dcf
11 changed files with 576 additions and 20 deletions

View File

@ -7,6 +7,28 @@ Version numbers follow [Semantic Versioning](https://semver.org/spec/v2.0.0.html
--- ---
## [Unreleased]
### Added
- **Built-in file redaction for local files** — a scissor button (`✂`) appears on cards for local DOCX, XLSX, CSV, and TXT files. Clicking it rewrites the file in-place with all detected CPR numbers replaced by `██████-████` (DOCX/XLSX) or `█`-blocks (CSV/TXT), then removes the card from the grid and logs a `"redacted"` disposition. The redaction is atomic: a temp file in the same directory is written first and then moved over the original, so a crash never leaves a half-written file. Implemented in `routes/export.py` (`POST /api/redact_item`) using the existing `document_scanner` redact functions; front-end in `results.js` (`redactItem`) with the button hidden for non-local or unsupported-extension items and for resolved/viewer-mode cards.
- **`DELETE /api/delete_item` route registration fix** — the `delete_item` handler in `routes/export.py` was missing its `@bp.route` decorator, so the endpoint was never registered in Flask's URL map. The route now works correctly.
---
## [1.6.27] — 2026-05-27
### Added
- **Email body excerpt preserved for offline preview** — when an M365 email or Gmail message is flagged, the first 500 characters of its plain-text body are stored in the card (`body_excerpt`), the checkpoint JSON, and a new `body_excerpt` DB column (migration #10). The M365 email preview now falls back to this excerpt when Graph is unavailable (not authenticated, token expired) or when resuming from a checkpoint without a live connection. The Gmail preview now shows the stored excerpt as the primary content (with the "Open in Gmail" link appended below) rather than the previous plain link-card. A helper `_excerpt_page()` in `routes/database.py` renders the excerpt with the same header layout as the full Graph-fetched preview.
- **Re-scan diff — resolved items in history view** — when browsing a past scan session, items that were flagged in the immediately preceding session but are no longer present in the current one are automatically appended below a "N items no longer present" divider. Resolved items are greyed out and carry a green `✓ Resolved` badge; the delete button is hidden since the file is already gone. The history banner updates to show the resolved count alongside the flagged count. The diff is computed client-side by fetching the previous session's items and comparing IDs — no new API endpoint needed. Implemented in `history.js` (`loadHistorySession`) and `results.js` (`appendCard`).
- **Google Workspace scan test suite** — 19 new tests in `tests/test_google_scan.py` covering all three routes (`GET /api/google/scan/users`, `POST /api/google/scan/start`, `POST /api/google/scan/cancel`) and the core scan engine (`_run_google_scan`). Route tests verify: 401 when unauthenticated, 409 when scan already running, lock released on both normal completion and exception, abort event cleared on start. Engine tests verify: CPR hits are broadcast as `scan_file_flagged`, clean items are not, `source_type` is correctly set to `"gmail"` for Gmail items and `"gdrive"` for Drive items, and `google_scan_done` always fires with correct `flagged_count` / `total_scanned` values.
---
## [1.6.26] — 2026-04-29 ## [1.6.26] — 2026-04-29
### Fixed ### Fixed

View File

@ -50,6 +50,8 @@ python -m pytest tests/ -q
182 tests in `tests/`. No integration tests for live M365/Google connections. 182 tests in `tests/`. No integration tests for live M365/Google connections.
**`tests/test_google_scan.py`** — 19 tests for the Google Workspace scan module. Route tests for `GET /api/google/scan/users`, `POST /api/google/scan/start`, `POST /api/google/scan/cancel`. Engine tests for `_run_google_scan` using synchronous invocation with mocked `broadcast`, `_scan_bytes`, `checkpoint.*`, `scan_engine._with_disposition`, and `gdpr_db.get_db`. The `clean_google_state` autouse fixture releases `_google_scan_lock` and clears `_google_scan_abort` after each test.
**`tests/test_route_integration.py`** — 54 Flask test-client tests covering security-sensitive paths: viewer token CRUD and scope validation, `GET /api/db/flagged` role/user scope enforcement, bulk disposition isolation, viewer PIN (set/verify/rate-limit/change/clear), interface PIN gate (multi-step flows require `session["interface_ok"] = True` after PIN set — the `before_request` hook blocks the same endpoint once a PIN exists), scan lock release on `run_scan()` exception, `GET /api/db/sessions` shape and ordering, profile routes CRUD and rename (including the rename-after-copy regression). Uses a tmp-path `ScanDB` monkeypatched into `routes.database._get_db` — tests never touch the real database. Interface PIN tests manipulate the real `config.json` via `setup_method`/`teardown_method` calling `clear_interface_pin()`. **`tests/test_route_integration.py`** — 54 Flask test-client tests covering security-sensitive paths: viewer token CRUD and scope validation, `GET /api/db/flagged` role/user scope enforcement, bulk disposition isolation, viewer PIN (set/verify/rate-limit/change/clear), interface PIN gate (multi-step flows require `session["interface_ok"] = True` after PIN set — the `before_request` hook blocks the same endpoint once a PIN exists), scan lock release on `run_scan()` exception, `GET /api/db/sessions` shape and ordering, profile routes CRUD and rename (including the rename-after-copy regression). Uses a tmp-path `ScanDB` monkeypatched into `routes.database._get_db` — tests never touch the real database. Interface PIN tests manipulate the real `config.json` via `setup_method`/`teardown_method` calling `clear_interface_pin()`.
**Local-file scan fixtures** — `tests/fixtures/local_files/` holds 19 files for manual/UI-level testing of the file scanner. 14 should be flagged; 5 are true negatives. All CPR numbers verified against `is_valid_cpr`. `generate_fixtures.py` (requires `python-docx`, `openpyxl`, `mutagen` — all in venv) regenerates the binary `.docx`/`.xlsx`/`.mp3`/`.flac`/`.mp4` files. Audio fixtures need 2 silent MPEG frames so mutagen can sync; FLAC uses a hand-packed STREAMINFO + Vorbis comment block; MP4 uses a minimal `ftyp`+`moov`/`mvhd` base that mutagen can tag. **Local-file scan fixtures** — `tests/fixtures/local_files/` holds 19 files for manual/UI-level testing of the file scanner. 14 should be flagged; 5 are true negatives. All CPR numbers verified against `is_valid_cpr`. `generate_fixtures.py` (requires `python-docx`, `openpyxl`, `mutagen` — all in venv) regenerates the binary `.docx`/`.xlsx`/`.mp3`/`.flac`/`.mp4` files. Audio fixtures need 2 silent MPEG frames so mutagen can sync; FLAC uses a hand-packed STREAMINFO + Vorbis comment block; MP4 uses a minimal `ftyp`+`moov`/`mvhd` base that mutagen can tag.
@ -111,6 +113,7 @@ Exception hierarchy (all inherit `M365Error(Exception)`):
Large M365 tenants can generate enormous memory pressure. Key rules to preserve: Large M365 tenants can generate enormous memory pressure. Key rules to preserve:
- **Email body stripped at collection time**`_scan_user_email` calls `conn.get_message_body_text(msg)`, stores the result as `msg["_precomputed_body"]`, then deletes `msg["body"]` and `msg["bodyPreview"]` before appending to `work_items`. The processing loop reads `meta.pop("_precomputed_body", "")`. Do not re-add `body` to the `$select` query without also stripping it here. - **Email body stripped at collection time**`_scan_user_email` calls `conn.get_message_body_text(msg)`, stores the result as `msg["_precomputed_body"]`, then deletes `msg["body"]` and `msg["bodyPreview"]` before appending to `work_items`. The processing loop reads `meta.pop("_precomputed_body", "")`. Do not re-add `body` to the `$select` query without also stripping it here.
- **`body_excerpt` — 500-char plain-text preview stored per flagged email** — just before `del body_text` in M365 email processing, `meta["_body_excerpt"] = body_text[:500].strip()`. In `google_scan.py`, a regex HTML-strip of the first 3000 bytes of Gmail body data is stored the same way. `_broadcast_card` in both engines includes `"body_excerpt"` in the card dict so the excerpt flows into `flagged_items`, the checkpoint JSON, and the DB (`body_excerpt TEXT`, migration #10). The M365 email preview route falls back to `_excerpt_page()` when Graph raises or the connector is absent. The Gmail preview shows `_excerpt_page()` as primary content with the "Open in Gmail" link appended. Do not remove the excerpt before broadcasting — that's what makes preview work on checkpoint resume.
- **`work_items``deque` before processing** — converted with `deque(work_items)` and drained via `popleft()` so each item's memory is released immediately after processing. Do not convert back to a list or iterate with `enumerate()`. - **`work_items``deque` before processing** — converted with `deque(work_items)` and drained via `popleft()` so each item's memory is released immediately after processing. Do not convert back to a list or iterate with `enumerate()`.
- **`del content` in file branch** — raw download bytes are deleted as soon as `content.decode()` is done (before NER/PII counting). Both the hit and no-hit paths have explicit `del content`. - **`del content` in file branch** — raw download bytes are deleted as soon as `content.decode()` is done (before NER/PII counting). Both the hit and no-hit paths have explicit `del content`.
- **`del body_text` in email branch** — deleted after `_broadcast_card` call. - **`del body_text` in email branch** — deleted after `_broadcast_card` call.
@ -124,6 +127,7 @@ Large M365 tenants can generate enormous memory pressure. Key rules to preserve:
- **Excel Summary sheet vs. per-source tabs** — the Summary sheet shows all scanned sources (even with 0 items). Per-source tabs are only created for sources with items; an empty tab has no value. - **Excel Summary sheet vs. per-source tabs** — the Summary sheet shows all scanned sources (even with 0 items). Per-source tabs are only created for sources with items; an empty tab has no value.
- **ART.30 breakdown table** — iterates `scanned_sources` (not `by_source`) so Gmail, Google Drive, etc. appear with `0 | 0 | 0 | —` when the scan found nothing. - **ART.30 breakdown table** — iterates `scanned_sources` (not `by_source`) so Gmail, Google Drive, etc. appear with `0 | 0 | 0 | —` when the scan found nothing.
- **Role-filtered exports**`_build_excel_bytes(role='')` and `_build_article30_docx(role='')` accept `role='student'` or `role='staff'`. A local `_items` list is built at the top of each function and used everywhere instead of `state.flagged_items` directly — GPS sheet, External transfers sheet, and Art.30 staff/student tables all see only the filtered subset. Route handlers read `request.args.get('role', '')` and forward it. Filenames get `_elever` / `_ansatte` suffix. The `#filterRole` dropdown in the filter bar drives both the client-side grid filter and the export URL param — do not separate them. - **Role-filtered exports**`_build_excel_bytes(role='')` and `_build_article30_docx(role='')` accept `role='student'` or `role='staff'`. A local `_items` list is built at the top of each function and used everywhere instead of `state.flagged_items` directly — GPS sheet, External transfers sheet, and Art.30 staff/student tables all see only the filtered subset. Route handlers read `request.args.get('role', '')` and forward it. Filenames get `_elever` / `_ansatte` suffix. The `#filterRole` dropdown in the filter bar drives both the client-side grid filter and the export URL param — do not separate them.
- **`POST /api/redact_item`** — rewrites a local file in-place with CPR numbers replaced by `██████-████` / `█` blocks, then removes the card from the grid and logs a `"redacted"` disposition. Supported extensions: `.docx`, `.xlsx`, `.csv`, `.txt` (`_REDACT_EXTS`). The file is written to a temp path in the **same directory** as the original before `shutil.move` — this avoids cross-device rename failures on mounted volumes. Uses existing `document_scanner` functions (`redact_docx`, `redact_xlsx`, `redact_csv`, `find_pii_spans_in_text`). Only works for `source_type == "local"` — SMB/cloud files are not supported (button is hidden on those cards). The button (`✂`, class `card-redact-btn`) appears in `appendCard` when `_redactable(f)` is true; hidden in viewer mode and for resolved items.
## Scan history browser — static/js/history.js + gdpr_db.py + routes/database.py ## Scan history browser — static/js/history.js + gdpr_db.py + routes/database.py
@ -137,6 +141,7 @@ Allows reviewing results from any past scan session without running a new scan.
- **History banner** (`#historyBanner`) — shown when `S._historyRefScanId` is set. Contains `#historyBannerText` (session date · sources · N items), `#historyPickerBtn` (opens `#historyDropdown`), and `#historyLatestBtn` (visible only when the viewed session is not the latest). Do not hide/show these elements from outside `history.js`. - **History banner** (`#historyBanner`) — shown when `S._historyRefScanId` is set. Contains `#historyBannerText` (session date · sources · N items), `#historyPickerBtn` (opens `#historyDropdown`), and `#historyLatestBtn` (visible only when the viewed session is not the latest). Do not hide/show these elements from outside `history.js`.
- **Session picker** (`#historyDropdown`) — rendered inside `[data-history-wrap]` container so the outside-click handler (`document` listener, closes on clicks outside `[data-history-wrap]`) works correctly. Do not move the picker outside this wrapper. - **Session picker** (`#historyDropdown`) — rendered inside `[data-history-wrap]` container so the outside-click handler (`document` listener, closes on clicks outside `[data-history-wrap]`) works correctly. Do not move the picker outside this wrapper.
- **Cache invalidation**`_sessions` and `_latestRefScanId` are module-level in `history.js`. `invalidateHistoryCache()` clears both. All three `*_done` SSE handlers in `scan.js` call `window.invalidateHistoryCache?.()` so the picker reflects the newest scan after completion. - **Cache invalidation**`_sessions` and `_latestRefScanId` are module-level in `history.js`. `invalidateHistoryCache()` clears both. All three `*_done` SSE handlers in `scan.js` call `window.invalidateHistoryCache?.()` so the picker reflects the newest scan after completion.
- **Re-scan diff**`loadHistorySession` fetches the immediately preceding session's items after rendering the current session. Items present in the previous session but absent from the current one (compared by `id`) are tagged `_resolved: true` and appended after a `.resolved-divider` separator. `appendCard` in `results.js` adds `.card-resolved` (opacity 0.6), a green `✓ Resolved` badge, and hides the delete button for resolved items. `_setHistoryBanner` accepts an optional `resolvedCount` parameter and appends it to the banner label. Resolved items are NOT added to `S.flaggedData` — they are grid-only and cannot be bulk-selected or exported.
- **Auto-load on page load**`results.js` calls `window.loadHistorySession?.(null)` once when the SSE watchdog confirms `!status.running`. `null` resolves to the latest completed session via `_fetchSessions()[0].ref_scan_id`. The `_initialStatusChecked` guard ensures this fires at most once per page load. - **Auto-load on page load**`results.js` calls `window.loadHistorySession?.(null)` once when the SSE watchdog confirms `!status.running`. `null` resolves to the latest completed session via `_fetchSessions()[0].ref_scan_id`. The `_initialStatusChecked` guard ensures this fires at most once per page load.
- **Mode transitions**`startScan()` calls `window.exitHistoryMode?.()` before clearing the grid, so any history banner is dismissed and `S._historyRefScanId` is reset before SSE events start arriving. - **Mode transitions**`startScan()` calls `window.exitHistoryMode?.()` before clearing the grid, so any history banner is dismissed and `S._historyRefScanId` is reset before SSE events start arriving.

View File

@ -202,6 +202,7 @@ _MIGRATIONS: list[tuple[int, str]] = [
(6, "ALTER TABLE flagged_items ADD COLUMN full_path TEXT NOT NULL DEFAULT ''"), (6, "ALTER TABLE flagged_items ADD COLUMN full_path TEXT NOT NULL DEFAULT ''"),
(8, "ALTER TABLE flagged_items ADD COLUMN email_count INTEGER NOT NULL DEFAULT 0"), (8, "ALTER TABLE flagged_items ADD COLUMN email_count INTEGER NOT NULL DEFAULT 0"),
(9, "ALTER TABLE flagged_items ADD COLUMN phone_count INTEGER NOT NULL DEFAULT 0"), (9, "ALTER TABLE flagged_items ADD COLUMN phone_count INTEGER NOT NULL DEFAULT 0"),
(10, "ALTER TABLE flagged_items ADD COLUMN body_excerpt TEXT NOT NULL DEFAULT ''"),
(7, """CREATE TABLE IF NOT EXISTS schedule_runs ( (7, """CREATE TABLE IF NOT EXISTS schedule_runs (
id INTEGER PRIMARY KEY AUTOINCREMENT, id INTEGER PRIMARY KEY AUTOINCREMENT,
started_at REAL NOT NULL, started_at REAL NOT NULL,
@ -314,8 +315,8 @@ class ScanDB:
url, drive_id, size_kb, modified, cpr_count, risk, url, drive_id, size_kb, modified, cpr_count, risk,
thumb_b64, thumb_mime, attachments, user_role, transfer_risk, thumb_b64, thumb_mime, attachments, user_role, transfer_risk,
special_category, face_count, exif_json, full_path, special_category, face_count, exif_json, full_path,
email_count, phone_count, scanned_at) email_count, phone_count, body_excerpt, scanned_at)
VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)""", VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)""",
( (
card.get("id", ""), card.get("id", ""),
scan_id, scan_id,
@ -341,6 +342,7 @@ class ScanDB:
card.get("full_path", ""), card.get("full_path", ""),
card.get("email_count", 0), card.get("email_count", 0),
card.get("phone_count", 0), card.get("phone_count", 0),
card.get("body_excerpt", ""),
now, now,
), ),
) )

View File

@ -344,6 +344,29 @@ def db_import():
return jsonify({"error": str(e)}), 500 return jsonify({"error": str(e)}), 500
def _excerpt_page(excerpt: str, item_meta: dict) -> str:
"""Minimal HTML page showing a stored body excerpt as a preview fallback."""
import html as _html
subject = _html.escape(item_meta.get("name", ""))
modified = item_meta.get("modified", "")
account = _html.escape(item_meta.get("account_name", ""))
body = "<pre style='white-space:pre-wrap;font-family:sans-serif;margin:0'>" + _html.escape(excerpt) + "</pre>"
note = "<p style='font-size:11px;color:#888;margin-top:12px'>Stored excerpt — connect to reload the full message.</p>"
return (
"<!DOCTYPE html><html><head><meta charset='utf-8'>"
"<style>body{font-family:-apple-system,sans-serif;font-size:13px;"
"padding:12px 16px;background:#fff;color:#111;word-break:break-word}"
".hdr{border-bottom:1px solid #eee;margin-bottom:12px;padding-bottom:10px}"
".hdr-row{color:#555;font-size:12px;margin-bottom:3px}"
".hdr-row b{color:#111}</style></head><body>"
f"<div class='hdr'>"
+ (f"<div class='hdr-row'><b>From:</b> {account}</div>" if account else "")
+ (f"<div class='hdr-row'><b>Date:</b> {_html.escape(modified)}</div>" if modified else "")
+ (f"<div class='hdr-row'><b>Subject:</b> {subject}</div>" if subject else "")
+ f"</div>{body}{note}</body></html>"
)
@bp.route("/api/preview/<item_id>") @bp.route("/api/preview/<item_id>")
def get_preview(item_id): def get_preview(item_id):
"""Return a preview URL or HTML for a flagged item.""" """Return a preview URL or HTML for a flagged item."""
@ -541,7 +564,11 @@ def get_preview(item_id):
try: try:
if source_type == "email": if source_type == "email":
excerpt = item_meta.get("body_excerpt", "")
if not state.connector: if not state.connector:
if excerpt:
import html as _html
return jsonify({"type": "html", "html": _excerpt_page(excerpt, item_meta)})
return jsonify({"error": "not authenticated"}), 401 return jsonify({"error": "not authenticated"}), 401
uid = account_id uid = account_id
try: try:
@ -550,6 +577,8 @@ def get_preview(item_id):
{"$select": "subject,from,receivedDateTime,body"} {"$select": "subject,from,receivedDateTime,body"}
) )
except Exception as e: except Exception as e:
if excerpt:
return jsonify({"type": "html", "html": _excerpt_page(excerpt, item_meta)})
return jsonify({"error": f"Could not load email: {e}"}) return jsonify({"error": f"Could not load email: {e}"})
sender = msg.get("from", {}).get("emailAddress", {}) sender = msg.get("from", {}).get("emailAddress", {})
@ -619,23 +648,33 @@ def get_preview(item_id):
return jsonify({"type": "iframe", "url": f"https://drive.google.com/file/d/{fid}/preview"}) return jsonify({"type": "iframe", "url": f"https://drive.google.com/file/d/{fid}/preview"})
# Fallback: generic Drive embed # Fallback: generic Drive embed
return jsonify({"type": "iframe", "url": item_url.replace("/view", "/preview")}) return jsonify({"type": "iframe", "url": item_url.replace("/view", "/preview")})
# Gmail — not embeddable; show link card # Gmail — not embeddable; show link card + stored body excerpt if available
icon = "✉️" if source_type == "gmail" else "☁️" icon = "✉️" if source_type == "gmail" else "☁️"
label = "Open in Gmail" if source_type == "gmail" else "Open in Google Drive" label = "Open in Gmail" if source_type == "gmail" else "Open in Google Drive"
excerpt = item_meta.get("body_excerpt", "")
link_html = ( link_html = (
f'<a href="{_html_esc(item_url)}" target="_blank" ' f'<a href="{_html_esc(item_url)}" target="_blank" '
f'style="display:inline-block;margin-top:12px;padding:8px 16px;' f'style="display:inline-block;margin-top:12px;padding:8px 16px;'
f'background:#3b7dd8;color:#fff;border-radius:6px;text-decoration:none;font-size:12px">' f'background:#3b7dd8;color:#fff;border-radius:6px;text-decoration:none;font-size:12px">'
f'{label}</a>' f'{label}</a>'
) if item_url else "" ) if item_url else ""
html_out = ( if excerpt and source_type == "gmail":
f'<div style="padding:24px;text-align:center;font-family:sans-serif">' html_out = _excerpt_page(excerpt, item_meta)
f'<div style="font-size:40px">{icon}</div>' if item_url:
f'<div style="font-size:13px;font-weight:600;margin:8px 0">{_html_esc(name)}</div>' # Inject the "Open in Gmail" link before </body>
f'<div style="font-size:11px;color:var(--muted)">No inline preview available for this item</div>' html_out = html_out.replace(
f'{link_html}' "</body>",
f'</div>' f'<div style="margin-top:12px">{link_html}</div></body>'
) )
else:
html_out = (
f'<div style="padding:24px;text-align:center;font-family:sans-serif">'
f'<div style="font-size:40px">{icon}</div>'
f'<div style="font-size:13px;font-weight:600;margin:8px 0">{_html_esc(name)}</div>'
f'<div style="font-size:11px;color:var(--muted)">No inline preview available for this item</div>'
f'{link_html}'
f'</div>'
)
return jsonify({"type": "html", "html": html_out}) return jsonify({"type": "html", "html": html_out})
else: else:

View File

@ -1158,6 +1158,7 @@ def export_article30():
return jsonify({"error": str(e)}), 500 return jsonify({"error": str(e)}), 500
@bp.route("/api/delete_item", methods=["POST"])
def delete_item(): def delete_item():
"""Delete a single flagged item. Returns {ok, error}.""" """Delete a single flagged item. Returns {ok, error}."""
if not state.connector: if not state.connector:
@ -1200,6 +1201,104 @@ def delete_item():
return jsonify({"ok": False, "error": str(e)}) return jsonify({"ok": False, "error": str(e)})
_REDACT_EXTS = {".docx", ".xlsx", ".csv", ".txt"}
@bp.route("/api/redact_item", methods=["POST"])
def redact_item():
"""Redact CPR numbers in-place in a local file. Returns {ok, redacted}."""
from pathlib import Path as _Path
import tempfile as _tempfile
import shutil as _shutil
data = request.get_json() or {}
item_id = data.get("id", "")
if not item_id:
return jsonify({"ok": False, "error": "id required"}), 400
# Resolve item meta: in-memory first (active scan), then DB (history)
item_meta = next((x for x in state.flagged_items if x.get("id") == item_id), None)
if item_meta is None:
_db = _get_db() if DB_OK else None
if _db:
row = _db._connect().execute(
"SELECT * FROM flagged_items WHERE id=? LIMIT 1", (item_id,)
).fetchone()
item_meta = dict(row) if row else {}
else:
item_meta = {}
source_type = item_meta.get("source_type", "")
if source_type not in ("local",):
return jsonify({"ok": False, "error": "Redaction is only supported for local files"}), 400
full_path = item_meta.get("full_path", "")
if not full_path:
return jsonify({"ok": False, "error": "File path not available — rescan to enable redaction"}), 400
path = _Path(full_path).expanduser()
if not path.exists():
return jsonify({"ok": False, "error": f"File not found: {full_path}"}), 404
ext = path.suffix.lower()
if ext not in _REDACT_EXTS:
return jsonify({"ok": False, "error": f"Redaction not supported for {ext or 'this'} files. Supported: DOCX, XLSX, CSV, TXT"}), 400
tmp_path = None
try:
from document_scanner import (
scan_docx, redact_docx,
scan_xlsx, redact_xlsx,
redact_csv,
find_pii_spans_in_text,
)
with _tempfile.NamedTemporaryFile(suffix=ext, delete=False, dir=path.parent) as tmp:
tmp_path = _Path(tmp.name)
if ext == ".docx":
results = scan_docx(path)
redacted = redact_docx(path, tmp_path, results, use_ner=False)
elif ext == ".xlsx":
results = scan_xlsx(path)
redacted = redact_xlsx(path, tmp_path, results, use_ner=False)
elif ext == ".csv":
redacted = redact_csv(path, tmp_path, use_ner=False)
else: # .txt
text = path.read_text(encoding="utf-8", errors="replace")
spans = [(s, e, l) for s, e, l in find_pii_spans_in_text(text, use_ner=False) if l == "CPR"]
chars = list(text)
for s, e, _ in sorted(spans, reverse=True):
chars[s:e] = [""] * (e - s)
tmp_path.write_text("".join(chars), encoding="utf-8")
redacted = len(spans)
_shutil.move(str(tmp_path), str(path))
tmp_path = None
state.flagged_items[:] = [x for x in state.flagged_items if x.get("id") != item_id]
_db = _get_db() if DB_OK else None
if _db:
try:
_db.log_deletion(item_meta, reason="redacted")
_db.delete_item_record(item_id)
except Exception:
pass
logger.info("[redact] %s%d CPR span(s) redacted", path.name, redacted)
return jsonify({"ok": True, "redacted": redacted})
except Exception as e:
logger.error("[redact] failed: %s", e)
return jsonify({"ok": False, "error": str(e)})
finally:
if tmp_path and tmp_path.exists():
try:
tmp_path.unlink()
except Exception:
pass
@bp.route("/api/delete_bulk", methods=["POST"]) @bp.route("/api/delete_bulk", methods=["POST"])
def delete_bulk(): def delete_bulk():
"""Delete multiple items matching criteria. Streams progress as SSE.""" """Delete multiple items matching criteria. Streams progress as SSE."""

View File

@ -255,6 +255,7 @@ def _run_google_scan(options: dict):
"special_category": [], "special_category": [],
"face_count": 0, "face_count": 0,
"exif": {}, "exif": {},
"body_excerpt": item_meta.get("_body_excerpt", ""),
} }
flagged_items.append(card) flagged_items.append(card)
_google_flagged.append(card) _google_flagged.append(card)
@ -305,6 +306,14 @@ def _run_google_scan(options: dict):
try: try:
meta["_account"] = _display_name meta["_account"] = _display_name
meta["_source_type"] = "gmail" meta["_source_type"] = "gmail"
# Extract a plain-text excerpt before scanning (body is discarded after)
try:
import re as _re
_raw = data[:3000].decode("utf-8", errors="replace")
_plain = _re.sub(r"<[^>]+>", " ", _raw)
meta["_body_excerpt"] = " ".join(_plain.split())[:500]
except Exception:
meta["_body_excerpt"] = ""
result = _scan_bytes(data, meta.get("name", "msg.txt")) result = _scan_bytes(data, meta.get("name", "msg.txt"))
except Exception as e: except Exception as e:
broadcast("scan_error", {"file": meta.get("name", ""), "error": str(e)}) broadcast("scan_error", {"file": meta.get("name", ""), "error": str(e)})

View File

@ -549,6 +549,7 @@ def run_scan(options: dict):
"special_category": item_meta.get("_special_category", []), "special_category": item_meta.get("_special_category", []),
"face_count": item_meta.get("_face_count", 0), "face_count": item_meta.get("_face_count", 0),
"exif": item_meta.get("_exif", {}), "exif": item_meta.get("_exif", {}),
"body_excerpt": item_meta.get("_body_excerpt", ""),
} }
_state.flagged_items.append(card) _state.flagged_items.append(card)
broadcast("scan_file_flagged", _with_disposition(card, _db)) broadcast("scan_file_flagged", _with_disposition(card, _db))
@ -1153,6 +1154,8 @@ def run_scan(options: dict):
meta["_transfer_risk"] = _check_transfer_risk(meta) meta["_transfer_risk"] = _check_transfer_risk(meta)
meta["_special_category"] = _check_special_category( meta["_special_category"] = _check_special_category(
body_text if scan_email_body else "", all_cprs) body_text if scan_email_body else "", all_cprs)
# Store a short excerpt so preview still works if Graph is unavailable
meta["_body_excerpt"] = body_text[:500].strip() if body_text else ""
_broadcast_card(meta, all_cprs, pii_counts=_email_pii) _broadcast_card(meta, all_cprs, pii_counts=_email_pii)
del body_text # free email text — may be large for HTML-rich emails del body_text # free email text — may be large for HTML-rich emails

View File

@ -82,6 +82,31 @@ async function loadHistorySession(refScanId) {
try { window.markOverdueCards(); } catch(_) {} try { window.markOverdueCards(); } catch(_) {}
try { window.loadTrend(); } catch(_) {} try { window.loadTrend(); } catch(_) {}
_setHistoryBanner(true, resolvedRef); _setHistoryBanner(true, resolvedRef);
// ── Re-scan diff: append items from previous session no longer present ────
const allSessions = _sessions !== null ? _sessions : await _fetchSessions();
const idx = allSessions.findIndex(s => s.ref_scan_id === resolvedRef);
if (idx !== -1 && idx + 1 < allSessions.length) {
const prevRef = allSessions[idx + 1].ref_scan_id;
try {
const pr = await fetch('/api/db/flagged?ref=' + prevRef);
const prevItems = await pr.json();
if (Array.isArray(prevItems) && prevItems.length) {
const currentIds = new Set(items.map(f => f.id));
const resolved = prevItems.filter(f => !currentIds.has(f.id));
if (resolved.length) {
const divider = document.createElement('div');
divider.className = 'resolved-divider';
divider.textContent = resolved.length + ' ' + t('history_resolved_label', 'items no longer present');
document.getElementById('grid')?.appendChild(divider);
resolved.forEach(f => { f._resolved = true; window.appendCard(f); });
_setHistoryBanner(true, resolvedRef, resolved.length);
}
}
} catch(e) {
console.warn('[history] diff failed:', e);
}
}
} catch(e) { } catch(e) {
console.error('[history] failed to load session:', e); console.error('[history] failed to load session:', e);
} }
@ -89,7 +114,7 @@ async function loadHistorySession(refScanId) {
// ── Banner ──────────────────────────────────────────────────────────────────── // ── Banner ────────────────────────────────────────────────────────────────────
function _setHistoryBanner(visible, resolvedRef) { function _setHistoryBanner(visible, resolvedRef, resolvedCount) {
const banner = document.getElementById('historyBanner'); const banner = document.getElementById('historyBanner');
const bannerTxt = document.getElementById('historyBannerText'); const bannerTxt = document.getElementById('historyBannerText');
const latestBtn = document.getElementById('historyLatestBtn'); const latestBtn = document.getElementById('historyLatestBtn');
@ -107,6 +132,7 @@ function _setHistoryBanner(visible, resolvedRef) {
label = date + ' ' + time label = date + ' ' + time
+ (srcStr ? ' · ' + srcStr : '') + (srcStr ? ' · ' + srcStr : '')
+ ' · ' + sess.flagged_count + ' ' + t('history_items', 'items'); + ' · ' + sess.flagged_count + ' ' + t('history_items', 'items');
if (resolvedCount) label += ' · ' + resolvedCount + ' ' + t('history_resolved_badge', 'resolved');
} else { } else {
label = S.flaggedData.length + ' ' + t('history_items', 'items'); label = S.flaggedData.length + ' ' + t('history_items', 'items');
} }

View File

@ -24,7 +24,7 @@ function appendCard(f) {
: '/api/thumb?name=' + encodeURIComponent(f.name) + '&type=' + encodeURIComponent(f.source_type); : '/api/thumb?name=' + encodeURIComponent(f.name) + '&type=' + encodeURIComponent(f.source_type);
const card = document.createElement('div'); const card = document.createElement('div');
card.className = 'card' + (S.isListView ? ' list-view' : '') + (S._selectedIds.has(f.id) ? ' card-selected-bulk' : ''); card.className = 'card' + (S.isListView ? ' list-view' : '') + (S._selectedIds.has(f.id) ? ' card-selected-bulk' : '') + (f._resolved ? ' card-resolved' : '');
card.dataset.id = f.id; card.dataset.id = f.id;
card.onclick = (e) => { if (S._selectMode) { toggleCardSelect(f.id, e); } else { openPreview(f); } }; card.onclick = (e) => { if (S._selectMode) { toggleCardSelect(f.id, e); } else { openPreview(f); } };
@ -35,7 +35,11 @@ function appendCard(f) {
cb.onclick = (e) => { e.stopPropagation(); toggleCardSelect(f.id, e); }; cb.onclick = (e) => { e.stopPropagation(); toggleCardSelect(f.id, e); };
card.appendChild(cb); card.appendChild(cb);
const delBtn = window.VIEWER_MODE ? '' : `<button class="card-delete-btn" title="${t('m365_delete_confirm','Delete')}" onclick="event.stopPropagation();deleteItem(${JSON.stringify(f).replace(/"/g,'&quot;')},this.closest('.card'))">🗑</button>`; const delBtn = (window.VIEWER_MODE || f._resolved) ? '' : `<button class="card-delete-btn" title="${t('m365_delete_confirm','Delete')}" onclick="event.stopPropagation();deleteItem(${JSON.stringify(f).replace(/"/g,'&quot;')},this.closest('.card'))">🗑</button>`;
const _redactExts = new Set(['.docx', '.xlsx', '.txt', '.csv']);
const _redactable = !window.VIEWER_MODE && !f._resolved && f.source_type === 'local' && f.cpr_count > 0
&& _redactExts.has((f.name || '').substring((f.name || '').lastIndexOf('.')).toLowerCase());
const redactBtn = _redactable ? `<button class="card-redact-btn" title="${t('redact_btn','Redact CPR')}" onclick="event.stopPropagation();redactItem(${JSON.stringify(f).replace(/"/g,'&quot;')},this.closest('.card'))">✏</button>` : '';
if (S.isListView) { if (S.isListView) {
card.innerHTML = ` card.innerHTML = `
@ -50,8 +54,8 @@ function appendCard(f) {
${f.phone_count > 0 ? '<span class="phone-badge">' + f.phone_count + ' ' + t('m365_badge_phones', 'tlf.') + '</span> ' : ''} ${f.phone_count > 0 ? '<span class="phone-badge">' + f.phone_count + ' ' + t('m365_badge_phones', 'tlf.') + '</span> ' : ''}
${f.face_count > 0 ? '<span class="photo-face-badge">' + f.face_count + ' ' + t('m365_badge_faces', f.face_count === 1 ? 'face' : 'faces') + '</span> ' : ''} ${f.face_count > 0 ? '<span class="photo-face-badge">' + f.face_count + ' ' + t('m365_badge_faces', f.face_count === 1 ? 'face' : 'faces') + '</span> ' : ''}
${f.exif && f.exif.gps ? '<span class="photo-face-badge" style="background:#0a3a5a;color:#7ec8d0">🌍 GPS</span> ' : ''} ${f.exif && f.exif.gps ? '<span class="photo-face-badge" style="background:#0a3a5a;color:#7ec8d0">🌍 GPS</span> ' : ''}
${f.special_category && f.special_category.length ? '<span class="special-cat-badge">⚠ Art.9 — ' + f.special_category.filter(function(s){return s !== 'gps_location' && s !== 'exif_pii';}).join(', ') + '</span> ' : ''}${f.overdue ? '<span class="overdue-badge">🗓 Overdue</span>' : ''} ${f.special_category && f.special_category.length ? '<span class="special-cat-badge">⚠ Art.9 — ' + f.special_category.filter(function(s){return s !== 'gps_location' && s !== 'exif_pii';}).join(', ') + '</span> ' : ''}${f._resolved ? '<span class="resolved-badge">✓ ' + t('history_resolved_badge', 'Resolved') + '</span> ' : ''}${f.overdue ? '<span class="overdue-badge">🗓 Overdue</span>' : ''}
${delBtn}`; ${delBtn}${redactBtn}`;
} else { } else {
card.innerHTML = ` card.innerHTML = `
<div class="thumb-wrap"><img src="${src}" alt="${f.name}" loading="lazy"></div> <div class="thumb-wrap"><img src="${src}" alt="${f.name}" loading="lazy"></div>
@ -60,9 +64,9 @@ function appendCard(f) {
<div class="card-meta">${f.size_kb} KB · ${f.modified || ''}</div> <div class="card-meta">${f.size_kb} KB · ${f.modified || ''}</div>
${f.folder ? `<div class="card-meta" style="font-size:10px" title="${f.folder}">📂 ${f.folder}</div>` : ''} ${f.folder ? `<div class="card-meta" style="font-size:10px" title="${f.folder}">📂 ${f.folder}</div>` : ''}
<div class="card-source"><span class="source-badge ${badgeCls}">${label}</span>${f.account_name ? ' <span class="account-pill" title="' + f.account_name + '">' + (f.user_role === "student" ? '<span class="role-badge">' + t("role_student","Elev") + "</span>" : f.user_role === "staff" ? '<span class="role-badge">' + t("role_staff","Ansat") + "</span>" : "") + f.account_name + '</span>' : ''}${f.transfer_risk === "external-recipient" ? ' <span class="role-pill" style="background:#7B2D00;color:#FFD0B0"> Ext.</span>' : f.transfer_risk ? ' <span class="role-pill" style="background:#003D7B;color:#B0D4FF">🔗</span>' : ''}</div> <div class="card-source"><span class="source-badge ${badgeCls}">${label}</span>${f.account_name ? ' <span class="account-pill" title="' + f.account_name + '">' + (f.user_role === "student" ? '<span class="role-badge">' + t("role_student","Elev") + "</span>" : f.user_role === "staff" ? '<span class="role-badge">' + t("role_staff","Ansat") + "</span>" : "") + f.account_name + '</span>' : ''}${f.transfer_risk === "external-recipient" ? ' <span class="role-pill" style="background:#7B2D00;color:#FFD0B0"> Ext.</span>' : f.transfer_risk ? ' <span class="role-pill" style="background:#003D7B;color:#B0D4FF">🔗</span>' : ''}</div>
<span class="cpr-badge">${f.cpr_count} CPR</span>${f.email_count > 0 ? ' <span class="email-badge">' + f.email_count + ' ' + t('m365_badge_emails', 'e-mail') + '</span>' : ''}${f.phone_count > 0 ? ' <span class="phone-badge">' + f.phone_count + ' ' + t('m365_badge_phones', 'tlf.') + '</span>' : ''}${f.face_count > 0 ? ' <span class="photo-face-badge">' + f.face_count + ' ' + t('m365_badge_faces', f.face_count === 1 ? 'face' : 'faces') + '</span>' : ''}${f.exif && f.exif.gps ? ' <span class="photo-face-badge" style="background:#0a3a5a;color:#7ec8d0">🌍 GPS</span>' : ''}${f.overdue ? ' <span class="overdue-badge">🗓 Overdue</span>' : ''} <span class="cpr-badge">${f.cpr_count} CPR</span>${f.email_count > 0 ? ' <span class="email-badge">' + f.email_count + ' ' + t('m365_badge_emails', 'e-mail') + '</span>' : ''}${f.phone_count > 0 ? ' <span class="phone-badge">' + f.phone_count + ' ' + t('m365_badge_phones', 'tlf.') + '</span>' : ''}${f.face_count > 0 ? ' <span class="photo-face-badge">' + f.face_count + ' ' + t('m365_badge_faces', f.face_count === 1 ? 'face' : 'faces') + '</span>' : ''}${f.exif && f.exif.gps ? ' <span class="photo-face-badge" style="background:#0a3a5a;color:#7ec8d0">🌍 GPS</span>' : ''}${f._resolved ? ' <span class="resolved-badge"> ' + t('history_resolved_badge', 'Resolved') + '</span>' : ''}${f.overdue ? ' <span class="overdue-badge">🗓 Overdue</span>' : ''}
</div> </div>
${delBtn}`; ${delBtn}${redactBtn}`;
} }
grid.appendChild(card); grid.appendChild(card);
} }
@ -594,6 +598,32 @@ async function deleteItem(f, cardEl) {
} }
} }
async function redactItem(f, cardEl) {
if (!confirm(t('redact_confirm', 'Redact all CPR numbers in') + ' "' + f.name + '"?\n\n' + t('redact_warning', 'CPR numbers will be replaced with █ characters. This cannot be undone.'))) return;
if (cardEl) { cardEl.style.opacity = '0.5'; cardEl.style.pointerEvents = 'none'; }
try {
const r = await fetch('/api/redact_item', {
method: 'POST', headers: {'Content-Type': 'application/json'},
body: JSON.stringify({id: f.id, source_type: f.source_type})
});
const d = await r.json();
if (d.ok) {
S.flaggedData = S.flaggedData.filter(x => x.id !== f.id);
S.filteredData = S.filteredData.filter(x => x.id !== f.id);
if (cardEl) cardEl.remove();
updateStats();
log(t('redact_done', 'Redacted') + ' ' + f.name + ' (' + (d.redacted || 0) + ' ' + t('redact_spans', 'CPR spans') + ')', 'ok');
if (_previewItemId === f.id) closePreview();
} else {
if (cardEl) { cardEl.style.opacity = ''; cardEl.style.pointerEvents = ''; }
log(t('redact_failed', 'Redaction failed:') + ' ' + (d.error || '?'), 'err');
}
} catch(e) {
if (cardEl) { cardEl.style.opacity = ''; cardEl.style.pointerEvents = ''; }
log(t('redact_failed', 'Redaction failed:') + ' ' + e.message, 'err');
}
}
// ── Bulk delete modal ───────────────────────────────────────────────────────── // ── Bulk delete modal ─────────────────────────────────────────────────────────
function openBulkDelete() { function openBulkDelete() {
@ -1049,6 +1079,7 @@ window.loadDisposition = loadDisposition;
window.saveDisposition = saveDisposition; window.saveDisposition = saveDisposition;
window.closePreview = closePreview; window.closePreview = closePreview;
window.deleteItem = deleteItem; window.deleteItem = deleteItem;
window.redactItem = redactItem;
window.openBulkDelete = openBulkDelete; window.openBulkDelete = openBulkDelete;
window.closeBulkDelete = closeBulkDelete; window.closeBulkDelete = closeBulkDelete;
window._bdFilters = _bdFilters; window._bdFilters = _bdFilters;

View File

@ -253,6 +253,9 @@
.card-delete-btn { position:absolute; top:6px; right:6px; background:rgba(0,0,0,0.45); color:#fff; border:none; border-radius:50%; width:22px; height:22px; font-size:13px; line-height:22px; text-align:center; cursor:pointer; opacity:0.35; transition:opacity .15s; padding:0; z-index:1; } .card-delete-btn { position:absolute; top:6px; right:6px; background:rgba(0,0,0,0.45); color:#fff; border:none; border-radius:50%; width:22px; height:22px; font-size:13px; line-height:22px; text-align:center; cursor:pointer; opacity:0.35; transition:opacity .15s; padding:0; z-index:1; }
.card:hover .card-delete-btn { opacity:1; } .card:hover .card-delete-btn { opacity:1; }
.card.list-view .card-delete-btn { position:static; opacity:1; background:transparent; color:var(--muted); flex-shrink:0; } .card.list-view .card-delete-btn { position:static; opacity:1; background:transparent; color:var(--muted); flex-shrink:0; }
.card-redact-btn { position:absolute; top:6px; right:32px; background:rgba(0,80,40,0.55); color:#7effc0; border:none; border-radius:50%; width:22px; height:22px; font-size:12px; line-height:22px; text-align:center; cursor:pointer; opacity:0; transition:opacity .15s; padding:0; z-index:1; }
.card:hover .card-redact-btn { opacity:1; }
.card.list-view .card-redact-btn { position:static; opacity:1; background:transparent; color:#7effc0; flex-shrink:0; }
/* Per-card checkbox (select mode) */ /* Per-card checkbox (select mode) */
.card-cb { position:absolute; top:6px; left:6px; width:16px; height:16px; margin:0; cursor:pointer; z-index:2; .card-cb { position:absolute; top:6px; left:6px; width:16px; height:16px; margin:0; cursor:pointer; z-index:2;
@ -491,6 +494,12 @@
.overdue-badge { font-size: 9px; padding: 1px 5px; border-radius: 10px; .overdue-badge { font-size: 9px; padding: 1px 5px; border-radius: 10px;
background: #7c3200; color: #ffb347; font-weight: 600; white-space: nowrap; } background: #7c3200; color: #ffb347; font-weight: 600; white-space: nowrap; }
[data-theme="light"] .overdue-badge { background: #fff3e0; color: #c55a00; } [data-theme="light"] .overdue-badge { background: #fff3e0; color: #c55a00; }
.resolved-badge { font-size: 9px; padding: 1px 5px; border-radius: 10px;
background: #1a3a28; color: #7effc0; font-weight: 600; white-space: nowrap; }
[data-theme="light"] .resolved-badge { background: #d0f5ea; color: #005a3a; }
.card-resolved { opacity: 0.6; }
.resolved-divider { grid-column: 1 / -1; padding: 8px 2px; font-size: 11px;
color: var(--muted); border-top: 1px dashed var(--border); text-align: center; }
.email-badge { font-size: 9px; padding: 1px 5px; border-radius: 10px; .email-badge { font-size: 9px; padding: 1px 5px; border-radius: 10px;
background: #1a3a5c; color: #7ec8f0; font-weight: 500; white-space: nowrap; } background: #1a3a5c; color: #7ec8f0; font-weight: 500; white-space: nowrap; }
[data-theme="light"] .email-badge { background: #d0eaff; color: #004a80; } [data-theme="light"] .email-badge { background: #d0eaff; color: #004a80; }

311
tests/test_google_scan.py Normal file
View File

@ -0,0 +1,311 @@
"""
Route and engine tests for the Google Workspace scan module.
Covers:
- GET /api/google/scan/users auth guard, user list, error propagation
- POST /api/google/scan/start auth guard, concurrency lock, successful start, lock release
- POST /api/google/scan/cancel abort signal
- _run_google_scan no-connector broadcast, CPR hit flagging, source_type tagging
"""
from __future__ import annotations
import threading
import time
from unittest.mock import MagicMock
import pytest
# ── Fixtures ──────────────────────────────────────────────────────────────────
@pytest.fixture(scope="module")
def flask_app():
import gdpr_scanner
gdpr_scanner.app.config["TESTING"] = True
gdpr_scanner.app.config["WTF_CSRF_ENABLED"] = False
return gdpr_scanner.app
@pytest.fixture()
def client(flask_app):
with flask_app.test_client() as c:
yield c
@pytest.fixture()
def mock_google_connector(monkeypatch):
from routes import state
conn = MagicMock()
conn.list_users.return_value = []
monkeypatch.setattr(state, "google_connector", conn)
return conn
@pytest.fixture(autouse=True)
def clean_google_state():
yield
from routes import state
# Release the Google scan lock if a test left it acquired
acquired = state._google_scan_lock.acquire(blocking=False)
if acquired:
state._google_scan_lock.release()
state._google_scan_abort.clear()
# ── GET /api/google/scan/users ────────────────────────────────────────────────
class TestGoogleScanUsers:
def test_not_connected_returns_401(self, client, monkeypatch):
from routes import state
monkeypatch.setattr(state, "google_connector", None)
r = client.get("/api/google/scan/users")
assert r.status_code == 401
assert r.json["error"] == "not connected"
def test_returns_user_list(self, client, mock_google_connector):
mock_google_connector.list_users.return_value = [
{"id": "1", "email": "alice@test.dk", "displayName": "Alice", "userRole": "student"},
]
r = client.get("/api/google/scan/users")
assert r.status_code == 200
assert len(r.json["users"]) == 1
assert r.json["users"][0]["email"] == "alice@test.dk"
def test_returns_empty_list_when_no_users(self, client, mock_google_connector):
mock_google_connector.list_users.return_value = []
r = client.get("/api/google/scan/users")
assert r.status_code == 200
assert r.json["users"] == []
def test_connector_error_returns_500(self, client, mock_google_connector):
mock_google_connector.list_users.side_effect = Exception("Admin SDK unavailable")
r = client.get("/api/google/scan/users")
assert r.status_code == 500
assert "error" in r.json
# ── POST /api/google/scan/start ───────────────────────────────────────────────
class TestGoogleScanStart:
def test_not_connected_returns_401(self, client, monkeypatch):
from routes import state
monkeypatch.setattr(state, "google_connector", None)
r = client.post("/api/google/scan/start", json={})
assert r.status_code == 401
assert "not connected" in r.json["error"]
def test_already_running_returns_409(self, client, mock_google_connector):
from routes import state
state._google_scan_lock.acquire()
try:
r = client.post("/api/google/scan/start", json={})
assert r.status_code == 409
assert "already running" in r.json["error"]
finally:
state._google_scan_lock.release()
def test_starts_successfully(self, client, mock_google_connector, monkeypatch):
import routes.google_scan
monkeypatch.setattr(routes.google_scan, "_run_google_scan", lambda opts: None)
r = client.post("/api/google/scan/start", json={})
assert r.status_code == 200
assert r.json["status"] == "started"
def test_abort_event_cleared_on_start(self, client, mock_google_connector, monkeypatch):
import routes.google_scan
from routes import state
state._google_scan_abort.set()
monkeypatch.setattr(routes.google_scan, "_run_google_scan", lambda opts: None)
client.post("/api/google/scan/start", json={})
assert not state._google_scan_abort.is_set()
def test_lock_released_after_scan_completes(self, client, mock_google_connector, monkeypatch):
import routes.google_scan
from routes import state
done = threading.Event()
def _fake_scan(opts):
time.sleep(0.02)
done.set()
monkeypatch.setattr(routes.google_scan, "_run_google_scan", _fake_scan)
r = client.post("/api/google/scan/start", json={})
assert r.status_code == 200
assert done.wait(timeout=3), "Scan thread did not complete in time"
time.sleep(0.05) # allow finally block to run
acquired = state._google_scan_lock.acquire(blocking=False)
assert acquired, "Lock was not released after scan completed"
state._google_scan_lock.release()
@pytest.mark.filterwarnings("ignore::pytest.PytestUnhandledThreadExceptionWarning")
def test_lock_released_on_scan_exception(self, client, mock_google_connector, monkeypatch):
import routes.google_scan
from routes import state
done = threading.Event()
def _failing_scan(opts):
done.set()
raise RuntimeError("simulated crash")
monkeypatch.setattr(routes.google_scan, "_run_google_scan", _failing_scan)
r = client.post("/api/google/scan/start", json={})
assert r.status_code == 200
assert done.wait(timeout=3), "Scan thread did not complete in time"
time.sleep(0.05)
acquired = state._google_scan_lock.acquire(blocking=False)
assert acquired, "Lock was not released after scan raised an exception"
state._google_scan_lock.release()
# ── POST /api/google/scan/cancel ─────────────────────────────────────────────
class TestGoogleScanCancel:
def test_sets_abort_event(self, client):
from routes import state
state._google_scan_abort.clear()
r = client.post("/api/google/scan/cancel")
assert r.status_code == 200
assert r.json["status"] == "cancelling"
assert state._google_scan_abort.is_set()
def test_idempotent_when_not_running(self, client):
r = client.post("/api/google/scan/cancel")
assert r.status_code == 200
assert r.json["status"] == "cancelling"
# ── _run_google_scan engine ───────────────────────────────────────────────────
class TestRunGoogleScan:
"""
Unit-tests for _run_google_scan() called synchronously with all heavy
dependencies mocked: broadcast, _scan_bytes, DB, checkpoint I/O.
"""
def _setup_mocks(self, monkeypatch, conn, scan_bytes_result=None):
import gdpr_scanner
import checkpoint
import scan_engine
import gdpr_db
from routes import state
events = []
monkeypatch.setattr(state, "google_connector", conn)
monkeypatch.setattr(gdpr_scanner, "broadcast",
lambda evt, data=None: events.append((evt, data or {})))
monkeypatch.setattr(gdpr_scanner, "_scan_bytes",
lambda data, name: scan_bytes_result or {
"cprs": [], "pii_counts": None, "emails": [], "phones": []
})
monkeypatch.setattr(checkpoint, "_load_checkpoint", lambda *a, **kw: None)
monkeypatch.setattr(checkpoint, "_save_checkpoint", lambda *a, **kw: None)
monkeypatch.setattr(checkpoint, "_clear_checkpoint", lambda *a, **kw: None)
monkeypatch.setattr(checkpoint, "_load_delta_tokens", lambda: {})
monkeypatch.setattr(checkpoint, "_save_delta_tokens", lambda *a: None)
monkeypatch.setattr(scan_engine, "_with_disposition", lambda card, db: card)
monkeypatch.setattr(gdpr_db, "get_db", lambda *a, **kw: None)
gdpr_scanner.flagged_items.clear()
return events
def _run(self, monkeypatch, conn, options, scan_bytes_result=None):
import gdpr_scanner
import routes.google_scan as gs
events = self._setup_mocks(monkeypatch, conn, scan_bytes_result)
gs._run_google_scan(options)
gdpr_scanner.flagged_items.clear()
return events
def test_no_connector_broadcasts_error_and_done(self, monkeypatch):
import gdpr_scanner
import routes.google_scan as gs
from routes import state
events = []
monkeypatch.setattr(state, "google_connector", None)
monkeypatch.setattr(gdpr_scanner, "broadcast",
lambda evt, data=None: events.append((evt, data or {})))
gs._run_google_scan({"sources": ["gmail"], "user_emails": ["a@b.dk"], "options": {}})
assert any(evt == "scan_error" for evt, _ in events)
assert any(evt == "google_scan_done" for evt, _ in events)
def test_gmail_item_with_cpr_is_flagged(self, monkeypatch):
conn = MagicMock()
conn.list_users.return_value = []
conn.iter_gmail_messages.return_value = [
({"id": "msg1", "name": "report.txt", "size": 1024, "lastModifiedDateTime": "2026-01-01"}, b"content"),
]
cpr_result = {"cprs": [{"formatted": "010101-1234"}], "pii_counts": None, "emails": [], "phones": []}
events = self._run(monkeypatch, conn,
{"sources": ["gmail"], "user_emails": ["a@test.dk"], "options": {}},
scan_bytes_result=cpr_result)
flagged = [d for evt, d in events if evt == "scan_file_flagged"]
assert len(flagged) == 1
def test_gmail_item_source_type_is_gmail(self, monkeypatch):
conn = MagicMock()
conn.list_users.return_value = []
conn.iter_gmail_messages.return_value = [
({"id": "msg2", "name": "invoice.txt", "size": 512, "lastModifiedDateTime": "2026-01-01"}, b"data"),
]
cpr_result = {"cprs": [{"formatted": "020202-2345"}], "pii_counts": None, "emails": [], "phones": []}
events = self._run(monkeypatch, conn,
{"sources": ["gmail"], "user_emails": ["a@test.dk"], "options": {}},
scan_bytes_result=cpr_result)
flagged = [d for evt, d in events if evt == "scan_file_flagged"]
assert flagged[0]["source_type"] == "gmail"
def test_gmail_item_without_pii_not_flagged(self, monkeypatch):
conn = MagicMock()
conn.list_users.return_value = []
conn.iter_gmail_messages.return_value = [
({"id": "msg3", "name": "memo.txt", "size": 100}, b"hello world"),
]
events = self._run(monkeypatch, conn,
{"sources": ["gmail"], "user_emails": ["a@test.dk"], "options": {}})
assert not any(evt == "scan_file_flagged" for evt, _ in events)
def test_gdrive_item_source_type_is_gdrive(self, monkeypatch):
conn = MagicMock()
conn.list_users.return_value = []
conn.iter_gmail_messages.return_value = []
conn.iter_drive_files.return_value = [
({"id": "file1", "name": "doc.docx", "size": 2048, "lastModifiedDateTime": "2026-01-01"}, b"data"),
]
cpr_result = {"cprs": [{"formatted": "030303-3456"}], "pii_counts": None, "emails": [], "phones": []}
events = self._run(monkeypatch, conn,
{"sources": ["gmail", "gdrive"], "user_emails": ["a@test.dk"], "options": {}},
scan_bytes_result=cpr_result)
gdrive = [d for evt, d in events if evt == "scan_file_flagged" and d.get("source_type") == "gdrive"]
assert len(gdrive) == 1
def test_scan_done_always_broadcast(self, monkeypatch):
conn = MagicMock()
conn.list_users.return_value = []
conn.iter_gmail_messages.return_value = []
events = self._run(monkeypatch, conn,
{"sources": ["gmail"], "user_emails": ["a@test.dk"], "options": {}})
done = [d for evt, d in events if evt == "google_scan_done"]
assert len(done) == 1
assert "flagged_count" in done[0]
assert "total_scanned" in done[0]
def test_scan_done_counts_are_correct(self, monkeypatch):
conn = MagicMock()
conn.list_users.return_value = []
conn.iter_gmail_messages.return_value = [
({"id": "m1", "name": "a.txt", "size": 100}, b"x"),
({"id": "m2", "name": "b.txt", "size": 100}, b"y"),
]
cpr_result = {"cprs": [{"formatted": "040404-4567"}], "pii_counts": None, "emails": [], "phones": []}
events = self._run(monkeypatch, conn,
{"sources": ["gmail"], "user_emails": ["a@test.dk"], "options": {}},
scan_bytes_result=cpr_result)
done = next(d for evt, d in events if evt == "google_scan_done")
assert done["total_scanned"] == 2
assert done["flagged_count"] == 2