Built-in file redaction for local files
This commit is contained in:
parent
c490b3d76a
commit
23b9555dcf
22
CHANGELOG.md
22
CHANGELOG.md
@ -7,6 +7,28 @@ Version numbers follow [Semantic Versioning](https://semver.org/spec/v2.0.0.html
|
||||
|
||||
---
|
||||
|
||||
## [Unreleased]
|
||||
|
||||
### Added
|
||||
|
||||
- **Built-in file redaction for local files** — a scissor button (`✂`) appears on cards for local DOCX, XLSX, CSV, and TXT files. Clicking it rewrites the file in-place with all detected CPR numbers replaced by `██████-████` (DOCX/XLSX) or `█`-blocks (CSV/TXT), then removes the card from the grid and logs a `"redacted"` disposition. The redaction is atomic: a temp file in the same directory is written first and then moved over the original, so a crash never leaves a half-written file. Implemented in `routes/export.py` (`POST /api/redact_item`) using the existing `document_scanner` redact functions; front-end in `results.js` (`redactItem`) with the button hidden for non-local or unsupported-extension items and for resolved/viewer-mode cards.
|
||||
|
||||
- **`DELETE /api/delete_item` route registration fix** — the `delete_item` handler in `routes/export.py` was missing its `@bp.route` decorator, so the endpoint was never registered in Flask's URL map. The route now works correctly.
|
||||
|
||||
---
|
||||
|
||||
## [1.6.27] — 2026-05-27
|
||||
|
||||
### Added
|
||||
|
||||
- **Email body excerpt preserved for offline preview** — when an M365 email or Gmail message is flagged, the first 500 characters of its plain-text body are stored in the card (`body_excerpt`), the checkpoint JSON, and a new `body_excerpt` DB column (migration #10). The M365 email preview now falls back to this excerpt when Graph is unavailable (not authenticated, token expired) or when resuming from a checkpoint without a live connection. The Gmail preview now shows the stored excerpt as the primary content (with the "Open in Gmail" link appended below) rather than the previous plain link-card. A helper `_excerpt_page()` in `routes/database.py` renders the excerpt with the same header layout as the full Graph-fetched preview.
|
||||
|
||||
- **Re-scan diff — resolved items in history view** — when browsing a past scan session, items that were flagged in the immediately preceding session but are no longer present in the current one are automatically appended below a "N items no longer present" divider. Resolved items are greyed out and carry a green `✓ Resolved` badge; the delete button is hidden since the file is already gone. The history banner updates to show the resolved count alongside the flagged count. The diff is computed client-side by fetching the previous session's items and comparing IDs — no new API endpoint needed. Implemented in `history.js` (`loadHistorySession`) and `results.js` (`appendCard`).
|
||||
|
||||
- **Google Workspace scan test suite** — 19 new tests in `tests/test_google_scan.py` covering all three routes (`GET /api/google/scan/users`, `POST /api/google/scan/start`, `POST /api/google/scan/cancel`) and the core scan engine (`_run_google_scan`). Route tests verify: 401 when unauthenticated, 409 when scan already running, lock released on both normal completion and exception, abort event cleared on start. Engine tests verify: CPR hits are broadcast as `scan_file_flagged`, clean items are not, `source_type` is correctly set to `"gmail"` for Gmail items and `"gdrive"` for Drive items, and `google_scan_done` always fires with correct `flagged_count` / `total_scanned` values.
|
||||
|
||||
---
|
||||
|
||||
## [1.6.26] — 2026-04-29
|
||||
|
||||
### Fixed
|
||||
|
||||
@ -50,6 +50,8 @@ python -m pytest tests/ -q
|
||||
|
||||
182 tests in `tests/`. No integration tests for live M365/Google connections.
|
||||
|
||||
**`tests/test_google_scan.py`** — 19 tests for the Google Workspace scan module. Route tests for `GET /api/google/scan/users`, `POST /api/google/scan/start`, `POST /api/google/scan/cancel`. Engine tests for `_run_google_scan` using synchronous invocation with mocked `broadcast`, `_scan_bytes`, `checkpoint.*`, `scan_engine._with_disposition`, and `gdpr_db.get_db`. The `clean_google_state` autouse fixture releases `_google_scan_lock` and clears `_google_scan_abort` after each test.
|
||||
|
||||
**`tests/test_route_integration.py`** — 54 Flask test-client tests covering security-sensitive paths: viewer token CRUD and scope validation, `GET /api/db/flagged` role/user scope enforcement, bulk disposition isolation, viewer PIN (set/verify/rate-limit/change/clear), interface PIN gate (multi-step flows require `session["interface_ok"] = True` after PIN set — the `before_request` hook blocks the same endpoint once a PIN exists), scan lock release on `run_scan()` exception, `GET /api/db/sessions` shape and ordering, profile routes CRUD and rename (including the rename-after-copy regression). Uses a tmp-path `ScanDB` monkeypatched into `routes.database._get_db` — tests never touch the real database. Interface PIN tests manipulate the real `config.json` via `setup_method`/`teardown_method` calling `clear_interface_pin()`.
|
||||
|
||||
**Local-file scan fixtures** — `tests/fixtures/local_files/` holds 19 files for manual/UI-level testing of the file scanner. 14 should be flagged; 5 are true negatives. All CPR numbers verified against `is_valid_cpr`. `generate_fixtures.py` (requires `python-docx`, `openpyxl`, `mutagen` — all in venv) regenerates the binary `.docx`/`.xlsx`/`.mp3`/`.flac`/`.mp4` files. Audio fixtures need 2 silent MPEG frames so mutagen can sync; FLAC uses a hand-packed STREAMINFO + Vorbis comment block; MP4 uses a minimal `ftyp`+`moov`/`mvhd` base that mutagen can tag.
|
||||
@ -111,6 +113,7 @@ Exception hierarchy (all inherit `M365Error(Exception)`):
|
||||
Large M365 tenants can generate enormous memory pressure. Key rules to preserve:
|
||||
|
||||
- **Email body stripped at collection time** — `_scan_user_email` calls `conn.get_message_body_text(msg)`, stores the result as `msg["_precomputed_body"]`, then deletes `msg["body"]` and `msg["bodyPreview"]` before appending to `work_items`. The processing loop reads `meta.pop("_precomputed_body", "")`. Do not re-add `body` to the `$select` query without also stripping it here.
|
||||
- **`body_excerpt` — 500-char plain-text preview stored per flagged email** — just before `del body_text` in M365 email processing, `meta["_body_excerpt"] = body_text[:500].strip()`. In `google_scan.py`, a regex HTML-strip of the first 3000 bytes of Gmail body data is stored the same way. `_broadcast_card` in both engines includes `"body_excerpt"` in the card dict so the excerpt flows into `flagged_items`, the checkpoint JSON, and the DB (`body_excerpt TEXT`, migration #10). The M365 email preview route falls back to `_excerpt_page()` when Graph raises or the connector is absent. The Gmail preview shows `_excerpt_page()` as primary content with the "Open in Gmail" link appended. Do not remove the excerpt before broadcasting — that's what makes preview work on checkpoint resume.
|
||||
- **`work_items` → `deque` before processing** — converted with `deque(work_items)` and drained via `popleft()` so each item's memory is released immediately after processing. Do not convert back to a list or iterate with `enumerate()`.
|
||||
- **`del content` in file branch** — raw download bytes are deleted as soon as `content.decode()` is done (before NER/PII counting). Both the hit and no-hit paths have explicit `del content`.
|
||||
- **`del body_text` in email branch** — deleted after `_broadcast_card` call.
|
||||
@ -124,6 +127,7 @@ Large M365 tenants can generate enormous memory pressure. Key rules to preserve:
|
||||
- **Excel Summary sheet vs. per-source tabs** — the Summary sheet shows all scanned sources (even with 0 items). Per-source tabs are only created for sources with items; an empty tab has no value.
|
||||
- **ART.30 breakdown table** — iterates `scanned_sources` (not `by_source`) so Gmail, Google Drive, etc. appear with `0 | 0 | 0 | —` when the scan found nothing.
|
||||
- **Role-filtered exports** — `_build_excel_bytes(role='')` and `_build_article30_docx(role='')` accept `role='student'` or `role='staff'`. A local `_items` list is built at the top of each function and used everywhere instead of `state.flagged_items` directly — GPS sheet, External transfers sheet, and Art.30 staff/student tables all see only the filtered subset. Route handlers read `request.args.get('role', '')` and forward it. Filenames get `_elever` / `_ansatte` suffix. The `#filterRole` dropdown in the filter bar drives both the client-side grid filter and the export URL param — do not separate them.
|
||||
- **`POST /api/redact_item`** — rewrites a local file in-place with CPR numbers replaced by `██████-████` / `█` blocks, then removes the card from the grid and logs a `"redacted"` disposition. Supported extensions: `.docx`, `.xlsx`, `.csv`, `.txt` (`_REDACT_EXTS`). The file is written to a temp path in the **same directory** as the original before `shutil.move` — this avoids cross-device rename failures on mounted volumes. Uses existing `document_scanner` functions (`redact_docx`, `redact_xlsx`, `redact_csv`, `find_pii_spans_in_text`). Only works for `source_type == "local"` — SMB/cloud files are not supported (button is hidden on those cards). The button (`✂`, class `card-redact-btn`) appears in `appendCard` when `_redactable(f)` is true; hidden in viewer mode and for resolved items.
|
||||
|
||||
## Scan history browser — static/js/history.js + gdpr_db.py + routes/database.py
|
||||
|
||||
@ -137,6 +141,7 @@ Allows reviewing results from any past scan session without running a new scan.
|
||||
- **History banner** (`#historyBanner`) — shown when `S._historyRefScanId` is set. Contains `#historyBannerText` (session date · sources · N items), `#historyPickerBtn` (opens `#historyDropdown`), and `#historyLatestBtn` (visible only when the viewed session is not the latest). Do not hide/show these elements from outside `history.js`.
|
||||
- **Session picker** (`#historyDropdown`) — rendered inside `[data-history-wrap]` container so the outside-click handler (`document` listener, closes on clicks outside `[data-history-wrap]`) works correctly. Do not move the picker outside this wrapper.
|
||||
- **Cache invalidation** — `_sessions` and `_latestRefScanId` are module-level in `history.js`. `invalidateHistoryCache()` clears both. All three `*_done` SSE handlers in `scan.js` call `window.invalidateHistoryCache?.()` so the picker reflects the newest scan after completion.
|
||||
- **Re-scan diff** — `loadHistorySession` fetches the immediately preceding session's items after rendering the current session. Items present in the previous session but absent from the current one (compared by `id`) are tagged `_resolved: true` and appended after a `.resolved-divider` separator. `appendCard` in `results.js` adds `.card-resolved` (opacity 0.6), a green `✓ Resolved` badge, and hides the delete button for resolved items. `_setHistoryBanner` accepts an optional `resolvedCount` parameter and appends it to the banner label. Resolved items are NOT added to `S.flaggedData` — they are grid-only and cannot be bulk-selected or exported.
|
||||
- **Auto-load on page load** — `results.js` calls `window.loadHistorySession?.(null)` once when the SSE watchdog confirms `!status.running`. `null` resolves to the latest completed session via `_fetchSessions()[0].ref_scan_id`. The `_initialStatusChecked` guard ensures this fires at most once per page load.
|
||||
- **Mode transitions** — `startScan()` calls `window.exitHistoryMode?.()` before clearing the grid, so any history banner is dismissed and `S._historyRefScanId` is reset before SSE events start arriving.
|
||||
|
||||
|
||||
@ -202,6 +202,7 @@ _MIGRATIONS: list[tuple[int, str]] = [
|
||||
(6, "ALTER TABLE flagged_items ADD COLUMN full_path TEXT NOT NULL DEFAULT ''"),
|
||||
(8, "ALTER TABLE flagged_items ADD COLUMN email_count INTEGER NOT NULL DEFAULT 0"),
|
||||
(9, "ALTER TABLE flagged_items ADD COLUMN phone_count INTEGER NOT NULL DEFAULT 0"),
|
||||
(10, "ALTER TABLE flagged_items ADD COLUMN body_excerpt TEXT NOT NULL DEFAULT ''"),
|
||||
(7, """CREATE TABLE IF NOT EXISTS schedule_runs (
|
||||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
started_at REAL NOT NULL,
|
||||
@ -314,8 +315,8 @@ class ScanDB:
|
||||
url, drive_id, size_kb, modified, cpr_count, risk,
|
||||
thumb_b64, thumb_mime, attachments, user_role, transfer_risk,
|
||||
special_category, face_count, exif_json, full_path,
|
||||
email_count, phone_count, scanned_at)
|
||||
VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)""",
|
||||
email_count, phone_count, body_excerpt, scanned_at)
|
||||
VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)""",
|
||||
(
|
||||
card.get("id", ""),
|
||||
scan_id,
|
||||
@ -341,6 +342,7 @@ class ScanDB:
|
||||
card.get("full_path", ""),
|
||||
card.get("email_count", 0),
|
||||
card.get("phone_count", 0),
|
||||
card.get("body_excerpt", ""),
|
||||
now,
|
||||
),
|
||||
)
|
||||
|
||||
@ -344,6 +344,29 @@ def db_import():
|
||||
return jsonify({"error": str(e)}), 500
|
||||
|
||||
|
||||
def _excerpt_page(excerpt: str, item_meta: dict) -> str:
|
||||
"""Minimal HTML page showing a stored body excerpt as a preview fallback."""
|
||||
import html as _html
|
||||
subject = _html.escape(item_meta.get("name", ""))
|
||||
modified = item_meta.get("modified", "")
|
||||
account = _html.escape(item_meta.get("account_name", ""))
|
||||
body = "<pre style='white-space:pre-wrap;font-family:sans-serif;margin:0'>" + _html.escape(excerpt) + "</pre>"
|
||||
note = "<p style='font-size:11px;color:#888;margin-top:12px'>Stored excerpt — connect to reload the full message.</p>"
|
||||
return (
|
||||
"<!DOCTYPE html><html><head><meta charset='utf-8'>"
|
||||
"<style>body{font-family:-apple-system,sans-serif;font-size:13px;"
|
||||
"padding:12px 16px;background:#fff;color:#111;word-break:break-word}"
|
||||
".hdr{border-bottom:1px solid #eee;margin-bottom:12px;padding-bottom:10px}"
|
||||
".hdr-row{color:#555;font-size:12px;margin-bottom:3px}"
|
||||
".hdr-row b{color:#111}</style></head><body>"
|
||||
f"<div class='hdr'>"
|
||||
+ (f"<div class='hdr-row'><b>From:</b> {account}</div>" if account else "")
|
||||
+ (f"<div class='hdr-row'><b>Date:</b> {_html.escape(modified)}</div>" if modified else "")
|
||||
+ (f"<div class='hdr-row'><b>Subject:</b> {subject}</div>" if subject else "")
|
||||
+ f"</div>{body}{note}</body></html>"
|
||||
)
|
||||
|
||||
|
||||
@bp.route("/api/preview/<item_id>")
|
||||
def get_preview(item_id):
|
||||
"""Return a preview URL or HTML for a flagged item."""
|
||||
@ -541,7 +564,11 @@ def get_preview(item_id):
|
||||
|
||||
try:
|
||||
if source_type == "email":
|
||||
excerpt = item_meta.get("body_excerpt", "")
|
||||
if not state.connector:
|
||||
if excerpt:
|
||||
import html as _html
|
||||
return jsonify({"type": "html", "html": _excerpt_page(excerpt, item_meta)})
|
||||
return jsonify({"error": "not authenticated"}), 401
|
||||
uid = account_id
|
||||
try:
|
||||
@ -550,6 +577,8 @@ def get_preview(item_id):
|
||||
{"$select": "subject,from,receivedDateTime,body"}
|
||||
)
|
||||
except Exception as e:
|
||||
if excerpt:
|
||||
return jsonify({"type": "html", "html": _excerpt_page(excerpt, item_meta)})
|
||||
return jsonify({"error": f"Could not load email: {e}"})
|
||||
|
||||
sender = msg.get("from", {}).get("emailAddress", {})
|
||||
@ -619,23 +648,33 @@ def get_preview(item_id):
|
||||
return jsonify({"type": "iframe", "url": f"https://drive.google.com/file/d/{fid}/preview"})
|
||||
# Fallback: generic Drive embed
|
||||
return jsonify({"type": "iframe", "url": item_url.replace("/view", "/preview")})
|
||||
# Gmail — not embeddable; show link card
|
||||
icon = "✉️" if source_type == "gmail" else "☁️"
|
||||
label = "Open in Gmail" if source_type == "gmail" else "Open in Google Drive"
|
||||
# Gmail — not embeddable; show link card + stored body excerpt if available
|
||||
icon = "✉️" if source_type == "gmail" else "☁️"
|
||||
label = "Open in Gmail" if source_type == "gmail" else "Open in Google Drive"
|
||||
excerpt = item_meta.get("body_excerpt", "")
|
||||
link_html = (
|
||||
f'<a href="{_html_esc(item_url)}" target="_blank" '
|
||||
f'style="display:inline-block;margin-top:12px;padding:8px 16px;'
|
||||
f'background:#3b7dd8;color:#fff;border-radius:6px;text-decoration:none;font-size:12px">'
|
||||
f'{label}</a>'
|
||||
) if item_url else ""
|
||||
html_out = (
|
||||
f'<div style="padding:24px;text-align:center;font-family:sans-serif">'
|
||||
f'<div style="font-size:40px">{icon}</div>'
|
||||
f'<div style="font-size:13px;font-weight:600;margin:8px 0">{_html_esc(name)}</div>'
|
||||
f'<div style="font-size:11px;color:var(--muted)">No inline preview available for this item</div>'
|
||||
f'{link_html}'
|
||||
f'</div>'
|
||||
)
|
||||
if excerpt and source_type == "gmail":
|
||||
html_out = _excerpt_page(excerpt, item_meta)
|
||||
if item_url:
|
||||
# Inject the "Open in Gmail" link before </body>
|
||||
html_out = html_out.replace(
|
||||
"</body>",
|
||||
f'<div style="margin-top:12px">{link_html}</div></body>'
|
||||
)
|
||||
else:
|
||||
html_out = (
|
||||
f'<div style="padding:24px;text-align:center;font-family:sans-serif">'
|
||||
f'<div style="font-size:40px">{icon}</div>'
|
||||
f'<div style="font-size:13px;font-weight:600;margin:8px 0">{_html_esc(name)}</div>'
|
||||
f'<div style="font-size:11px;color:var(--muted)">No inline preview available for this item</div>'
|
||||
f'{link_html}'
|
||||
f'</div>'
|
||||
)
|
||||
return jsonify({"type": "html", "html": html_out})
|
||||
|
||||
else:
|
||||
|
||||
@ -1158,6 +1158,7 @@ def export_article30():
|
||||
return jsonify({"error": str(e)}), 500
|
||||
|
||||
|
||||
@bp.route("/api/delete_item", methods=["POST"])
|
||||
def delete_item():
|
||||
"""Delete a single flagged item. Returns {ok, error}."""
|
||||
if not state.connector:
|
||||
@ -1200,6 +1201,104 @@ def delete_item():
|
||||
return jsonify({"ok": False, "error": str(e)})
|
||||
|
||||
|
||||
_REDACT_EXTS = {".docx", ".xlsx", ".csv", ".txt"}
|
||||
|
||||
|
||||
@bp.route("/api/redact_item", methods=["POST"])
|
||||
def redact_item():
|
||||
"""Redact CPR numbers in-place in a local file. Returns {ok, redacted}."""
|
||||
from pathlib import Path as _Path
|
||||
import tempfile as _tempfile
|
||||
import shutil as _shutil
|
||||
|
||||
data = request.get_json() or {}
|
||||
item_id = data.get("id", "")
|
||||
if not item_id:
|
||||
return jsonify({"ok": False, "error": "id required"}), 400
|
||||
|
||||
# Resolve item meta: in-memory first (active scan), then DB (history)
|
||||
item_meta = next((x for x in state.flagged_items if x.get("id") == item_id), None)
|
||||
if item_meta is None:
|
||||
_db = _get_db() if DB_OK else None
|
||||
if _db:
|
||||
row = _db._connect().execute(
|
||||
"SELECT * FROM flagged_items WHERE id=? LIMIT 1", (item_id,)
|
||||
).fetchone()
|
||||
item_meta = dict(row) if row else {}
|
||||
else:
|
||||
item_meta = {}
|
||||
|
||||
source_type = item_meta.get("source_type", "")
|
||||
if source_type not in ("local",):
|
||||
return jsonify({"ok": False, "error": "Redaction is only supported for local files"}), 400
|
||||
|
||||
full_path = item_meta.get("full_path", "")
|
||||
if not full_path:
|
||||
return jsonify({"ok": False, "error": "File path not available — rescan to enable redaction"}), 400
|
||||
|
||||
path = _Path(full_path).expanduser()
|
||||
if not path.exists():
|
||||
return jsonify({"ok": False, "error": f"File not found: {full_path}"}), 404
|
||||
|
||||
ext = path.suffix.lower()
|
||||
if ext not in _REDACT_EXTS:
|
||||
return jsonify({"ok": False, "error": f"Redaction not supported for {ext or 'this'} files. Supported: DOCX, XLSX, CSV, TXT"}), 400
|
||||
|
||||
tmp_path = None
|
||||
try:
|
||||
from document_scanner import (
|
||||
scan_docx, redact_docx,
|
||||
scan_xlsx, redact_xlsx,
|
||||
redact_csv,
|
||||
find_pii_spans_in_text,
|
||||
)
|
||||
|
||||
with _tempfile.NamedTemporaryFile(suffix=ext, delete=False, dir=path.parent) as tmp:
|
||||
tmp_path = _Path(tmp.name)
|
||||
|
||||
if ext == ".docx":
|
||||
results = scan_docx(path)
|
||||
redacted = redact_docx(path, tmp_path, results, use_ner=False)
|
||||
elif ext == ".xlsx":
|
||||
results = scan_xlsx(path)
|
||||
redacted = redact_xlsx(path, tmp_path, results, use_ner=False)
|
||||
elif ext == ".csv":
|
||||
redacted = redact_csv(path, tmp_path, use_ner=False)
|
||||
else: # .txt
|
||||
text = path.read_text(encoding="utf-8", errors="replace")
|
||||
spans = [(s, e, l) for s, e, l in find_pii_spans_in_text(text, use_ner=False) if l == "CPR"]
|
||||
chars = list(text)
|
||||
for s, e, _ in sorted(spans, reverse=True):
|
||||
chars[s:e] = ["█"] * (e - s)
|
||||
tmp_path.write_text("".join(chars), encoding="utf-8")
|
||||
redacted = len(spans)
|
||||
|
||||
_shutil.move(str(tmp_path), str(path))
|
||||
tmp_path = None
|
||||
|
||||
state.flagged_items[:] = [x for x in state.flagged_items if x.get("id") != item_id]
|
||||
_db = _get_db() if DB_OK else None
|
||||
if _db:
|
||||
try:
|
||||
_db.log_deletion(item_meta, reason="redacted")
|
||||
_db.delete_item_record(item_id)
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
logger.info("[redact] %s — %d CPR span(s) redacted", path.name, redacted)
|
||||
return jsonify({"ok": True, "redacted": redacted})
|
||||
|
||||
except Exception as e:
|
||||
logger.error("[redact] failed: %s", e)
|
||||
return jsonify({"ok": False, "error": str(e)})
|
||||
finally:
|
||||
if tmp_path and tmp_path.exists():
|
||||
try:
|
||||
tmp_path.unlink()
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
|
||||
@bp.route("/api/delete_bulk", methods=["POST"])
|
||||
def delete_bulk():
|
||||
"""Delete multiple items matching criteria. Streams progress as SSE."""
|
||||
|
||||
@ -255,6 +255,7 @@ def _run_google_scan(options: dict):
|
||||
"special_category": [],
|
||||
"face_count": 0,
|
||||
"exif": {},
|
||||
"body_excerpt": item_meta.get("_body_excerpt", ""),
|
||||
}
|
||||
flagged_items.append(card)
|
||||
_google_flagged.append(card)
|
||||
@ -305,6 +306,14 @@ def _run_google_scan(options: dict):
|
||||
try:
|
||||
meta["_account"] = _display_name
|
||||
meta["_source_type"] = "gmail"
|
||||
# Extract a plain-text excerpt before scanning (body is discarded after)
|
||||
try:
|
||||
import re as _re
|
||||
_raw = data[:3000].decode("utf-8", errors="replace")
|
||||
_plain = _re.sub(r"<[^>]+>", " ", _raw)
|
||||
meta["_body_excerpt"] = " ".join(_plain.split())[:500]
|
||||
except Exception:
|
||||
meta["_body_excerpt"] = ""
|
||||
result = _scan_bytes(data, meta.get("name", "msg.txt"))
|
||||
except Exception as e:
|
||||
broadcast("scan_error", {"file": meta.get("name", ""), "error": str(e)})
|
||||
|
||||
@ -549,6 +549,7 @@ def run_scan(options: dict):
|
||||
"special_category": item_meta.get("_special_category", []),
|
||||
"face_count": item_meta.get("_face_count", 0),
|
||||
"exif": item_meta.get("_exif", {}),
|
||||
"body_excerpt": item_meta.get("_body_excerpt", ""),
|
||||
}
|
||||
_state.flagged_items.append(card)
|
||||
broadcast("scan_file_flagged", _with_disposition(card, _db))
|
||||
@ -1153,6 +1154,8 @@ def run_scan(options: dict):
|
||||
meta["_transfer_risk"] = _check_transfer_risk(meta)
|
||||
meta["_special_category"] = _check_special_category(
|
||||
body_text if scan_email_body else "", all_cprs)
|
||||
# Store a short excerpt so preview still works if Graph is unavailable
|
||||
meta["_body_excerpt"] = body_text[:500].strip() if body_text else ""
|
||||
_broadcast_card(meta, all_cprs, pii_counts=_email_pii)
|
||||
del body_text # free email text — may be large for HTML-rich emails
|
||||
|
||||
|
||||
@ -82,6 +82,31 @@ async function loadHistorySession(refScanId) {
|
||||
try { window.markOverdueCards(); } catch(_) {}
|
||||
try { window.loadTrend(); } catch(_) {}
|
||||
_setHistoryBanner(true, resolvedRef);
|
||||
|
||||
// ── Re-scan diff: append items from previous session no longer present ────
|
||||
const allSessions = _sessions !== null ? _sessions : await _fetchSessions();
|
||||
const idx = allSessions.findIndex(s => s.ref_scan_id === resolvedRef);
|
||||
if (idx !== -1 && idx + 1 < allSessions.length) {
|
||||
const prevRef = allSessions[idx + 1].ref_scan_id;
|
||||
try {
|
||||
const pr = await fetch('/api/db/flagged?ref=' + prevRef);
|
||||
const prevItems = await pr.json();
|
||||
if (Array.isArray(prevItems) && prevItems.length) {
|
||||
const currentIds = new Set(items.map(f => f.id));
|
||||
const resolved = prevItems.filter(f => !currentIds.has(f.id));
|
||||
if (resolved.length) {
|
||||
const divider = document.createElement('div');
|
||||
divider.className = 'resolved-divider';
|
||||
divider.textContent = resolved.length + ' ' + t('history_resolved_label', 'items no longer present');
|
||||
document.getElementById('grid')?.appendChild(divider);
|
||||
resolved.forEach(f => { f._resolved = true; window.appendCard(f); });
|
||||
_setHistoryBanner(true, resolvedRef, resolved.length);
|
||||
}
|
||||
}
|
||||
} catch(e) {
|
||||
console.warn('[history] diff failed:', e);
|
||||
}
|
||||
}
|
||||
} catch(e) {
|
||||
console.error('[history] failed to load session:', e);
|
||||
}
|
||||
@ -89,7 +114,7 @@ async function loadHistorySession(refScanId) {
|
||||
|
||||
// ── Banner ────────────────────────────────────────────────────────────────────
|
||||
|
||||
function _setHistoryBanner(visible, resolvedRef) {
|
||||
function _setHistoryBanner(visible, resolvedRef, resolvedCount) {
|
||||
const banner = document.getElementById('historyBanner');
|
||||
const bannerTxt = document.getElementById('historyBannerText');
|
||||
const latestBtn = document.getElementById('historyLatestBtn');
|
||||
@ -107,6 +132,7 @@ function _setHistoryBanner(visible, resolvedRef) {
|
||||
label = date + ' ' + time
|
||||
+ (srcStr ? ' · ' + srcStr : '')
|
||||
+ ' · ' + sess.flagged_count + ' ' + t('history_items', 'items');
|
||||
if (resolvedCount) label += ' · ' + resolvedCount + ' ' + t('history_resolved_badge', 'resolved');
|
||||
} else {
|
||||
label = S.flaggedData.length + ' ' + t('history_items', 'items');
|
||||
}
|
||||
|
||||
@ -24,7 +24,7 @@ function appendCard(f) {
|
||||
: '/api/thumb?name=' + encodeURIComponent(f.name) + '&type=' + encodeURIComponent(f.source_type);
|
||||
|
||||
const card = document.createElement('div');
|
||||
card.className = 'card' + (S.isListView ? ' list-view' : '') + (S._selectedIds.has(f.id) ? ' card-selected-bulk' : '');
|
||||
card.className = 'card' + (S.isListView ? ' list-view' : '') + (S._selectedIds.has(f.id) ? ' card-selected-bulk' : '') + (f._resolved ? ' card-resolved' : '');
|
||||
card.dataset.id = f.id;
|
||||
card.onclick = (e) => { if (S._selectMode) { toggleCardSelect(f.id, e); } else { openPreview(f); } };
|
||||
|
||||
@ -35,7 +35,11 @@ function appendCard(f) {
|
||||
cb.onclick = (e) => { e.stopPropagation(); toggleCardSelect(f.id, e); };
|
||||
card.appendChild(cb);
|
||||
|
||||
const delBtn = window.VIEWER_MODE ? '' : `<button class="card-delete-btn" title="${t('m365_delete_confirm','Delete')}" onclick="event.stopPropagation();deleteItem(${JSON.stringify(f).replace(/"/g,'"')},this.closest('.card'))">🗑</button>`;
|
||||
const delBtn = (window.VIEWER_MODE || f._resolved) ? '' : `<button class="card-delete-btn" title="${t('m365_delete_confirm','Delete')}" onclick="event.stopPropagation();deleteItem(${JSON.stringify(f).replace(/"/g,'"')},this.closest('.card'))">🗑</button>`;
|
||||
const _redactExts = new Set(['.docx', '.xlsx', '.txt', '.csv']);
|
||||
const _redactable = !window.VIEWER_MODE && !f._resolved && f.source_type === 'local' && f.cpr_count > 0
|
||||
&& _redactExts.has((f.name || '').substring((f.name || '').lastIndexOf('.')).toLowerCase());
|
||||
const redactBtn = _redactable ? `<button class="card-redact-btn" title="${t('redact_btn','Redact CPR')}" onclick="event.stopPropagation();redactItem(${JSON.stringify(f).replace(/"/g,'"')},this.closest('.card'))">✏</button>` : '';
|
||||
|
||||
if (S.isListView) {
|
||||
card.innerHTML = `
|
||||
@ -50,8 +54,8 @@ function appendCard(f) {
|
||||
${f.phone_count > 0 ? '<span class="phone-badge">' + f.phone_count + ' ' + t('m365_badge_phones', 'tlf.') + '</span> ' : ''}
|
||||
${f.face_count > 0 ? '<span class="photo-face-badge">' + f.face_count + ' ' + t('m365_badge_faces', f.face_count === 1 ? 'face' : 'faces') + '</span> ' : ''}
|
||||
${f.exif && f.exif.gps ? '<span class="photo-face-badge" style="background:#0a3a5a;color:#7ec8d0">🌍 GPS</span> ' : ''}
|
||||
${f.special_category && f.special_category.length ? '<span class="special-cat-badge">⚠ Art.9 — ' + f.special_category.filter(function(s){return s !== 'gps_location' && s !== 'exif_pii';}).join(', ') + '</span> ' : ''}${f.overdue ? '<span class="overdue-badge">🗓 Overdue</span>' : ''}
|
||||
${delBtn}`;
|
||||
${f.special_category && f.special_category.length ? '<span class="special-cat-badge">⚠ Art.9 — ' + f.special_category.filter(function(s){return s !== 'gps_location' && s !== 'exif_pii';}).join(', ') + '</span> ' : ''}${f._resolved ? '<span class="resolved-badge">✓ ' + t('history_resolved_badge', 'Resolved') + '</span> ' : ''}${f.overdue ? '<span class="overdue-badge">🗓 Overdue</span>' : ''}
|
||||
${delBtn}${redactBtn}`;
|
||||
} else {
|
||||
card.innerHTML = `
|
||||
<div class="thumb-wrap"><img src="${src}" alt="${f.name}" loading="lazy"></div>
|
||||
@ -60,9 +64,9 @@ function appendCard(f) {
|
||||
<div class="card-meta">${f.size_kb} KB · ${f.modified || ''}</div>
|
||||
${f.folder ? `<div class="card-meta" style="font-size:10px" title="${f.folder}">📂 ${f.folder}</div>` : ''}
|
||||
<div class="card-source"><span class="source-badge ${badgeCls}">${label}</span>${f.account_name ? ' <span class="account-pill" title="' + f.account_name + '">' + (f.user_role === "student" ? '<span class="role-badge">' + t("role_student","Elev") + "</span>" : f.user_role === "staff" ? '<span class="role-badge">' + t("role_staff","Ansat") + "</span>" : "") + f.account_name + '</span>' : ''}${f.transfer_risk === "external-recipient" ? ' <span class="role-pill" style="background:#7B2D00;color:#FFD0B0">⚠ Ext.</span>' : f.transfer_risk ? ' <span class="role-pill" style="background:#003D7B;color:#B0D4FF">🔗</span>' : ''}</div>
|
||||
<span class="cpr-badge">${f.cpr_count} CPR</span>${f.email_count > 0 ? ' <span class="email-badge">' + f.email_count + ' ' + t('m365_badge_emails', 'e-mail') + '</span>' : ''}${f.phone_count > 0 ? ' <span class="phone-badge">' + f.phone_count + ' ' + t('m365_badge_phones', 'tlf.') + '</span>' : ''}${f.face_count > 0 ? ' <span class="photo-face-badge">' + f.face_count + ' ' + t('m365_badge_faces', f.face_count === 1 ? 'face' : 'faces') + '</span>' : ''}${f.exif && f.exif.gps ? ' <span class="photo-face-badge" style="background:#0a3a5a;color:#7ec8d0">🌍 GPS</span>' : ''}${f.overdue ? ' <span class="overdue-badge">🗓 Overdue</span>' : ''}
|
||||
<span class="cpr-badge">${f.cpr_count} CPR</span>${f.email_count > 0 ? ' <span class="email-badge">' + f.email_count + ' ' + t('m365_badge_emails', 'e-mail') + '</span>' : ''}${f.phone_count > 0 ? ' <span class="phone-badge">' + f.phone_count + ' ' + t('m365_badge_phones', 'tlf.') + '</span>' : ''}${f.face_count > 0 ? ' <span class="photo-face-badge">' + f.face_count + ' ' + t('m365_badge_faces', f.face_count === 1 ? 'face' : 'faces') + '</span>' : ''}${f.exif && f.exif.gps ? ' <span class="photo-face-badge" style="background:#0a3a5a;color:#7ec8d0">🌍 GPS</span>' : ''}${f._resolved ? ' <span class="resolved-badge">✓ ' + t('history_resolved_badge', 'Resolved') + '</span>' : ''}${f.overdue ? ' <span class="overdue-badge">🗓 Overdue</span>' : ''}
|
||||
</div>
|
||||
${delBtn}`;
|
||||
${delBtn}${redactBtn}`;
|
||||
}
|
||||
grid.appendChild(card);
|
||||
}
|
||||
@ -594,6 +598,32 @@ async function deleteItem(f, cardEl) {
|
||||
}
|
||||
}
|
||||
|
||||
async function redactItem(f, cardEl) {
|
||||
if (!confirm(t('redact_confirm', 'Redact all CPR numbers in') + ' "' + f.name + '"?\n\n' + t('redact_warning', 'CPR numbers will be replaced with █ characters. This cannot be undone.'))) return;
|
||||
if (cardEl) { cardEl.style.opacity = '0.5'; cardEl.style.pointerEvents = 'none'; }
|
||||
try {
|
||||
const r = await fetch('/api/redact_item', {
|
||||
method: 'POST', headers: {'Content-Type': 'application/json'},
|
||||
body: JSON.stringify({id: f.id, source_type: f.source_type})
|
||||
});
|
||||
const d = await r.json();
|
||||
if (d.ok) {
|
||||
S.flaggedData = S.flaggedData.filter(x => x.id !== f.id);
|
||||
S.filteredData = S.filteredData.filter(x => x.id !== f.id);
|
||||
if (cardEl) cardEl.remove();
|
||||
updateStats();
|
||||
log(t('redact_done', 'Redacted') + ' ' + f.name + ' (' + (d.redacted || 0) + ' ' + t('redact_spans', 'CPR spans') + ')', 'ok');
|
||||
if (_previewItemId === f.id) closePreview();
|
||||
} else {
|
||||
if (cardEl) { cardEl.style.opacity = ''; cardEl.style.pointerEvents = ''; }
|
||||
log(t('redact_failed', 'Redaction failed:') + ' ' + (d.error || '?'), 'err');
|
||||
}
|
||||
} catch(e) {
|
||||
if (cardEl) { cardEl.style.opacity = ''; cardEl.style.pointerEvents = ''; }
|
||||
log(t('redact_failed', 'Redaction failed:') + ' ' + e.message, 'err');
|
||||
}
|
||||
}
|
||||
|
||||
// ── Bulk delete modal ─────────────────────────────────────────────────────────
|
||||
|
||||
function openBulkDelete() {
|
||||
@ -1049,6 +1079,7 @@ window.loadDisposition = loadDisposition;
|
||||
window.saveDisposition = saveDisposition;
|
||||
window.closePreview = closePreview;
|
||||
window.deleteItem = deleteItem;
|
||||
window.redactItem = redactItem;
|
||||
window.openBulkDelete = openBulkDelete;
|
||||
window.closeBulkDelete = closeBulkDelete;
|
||||
window._bdFilters = _bdFilters;
|
||||
|
||||
@ -253,6 +253,9 @@
|
||||
.card-delete-btn { position:absolute; top:6px; right:6px; background:rgba(0,0,0,0.45); color:#fff; border:none; border-radius:50%; width:22px; height:22px; font-size:13px; line-height:22px; text-align:center; cursor:pointer; opacity:0.35; transition:opacity .15s; padding:0; z-index:1; }
|
||||
.card:hover .card-delete-btn { opacity:1; }
|
||||
.card.list-view .card-delete-btn { position:static; opacity:1; background:transparent; color:var(--muted); flex-shrink:0; }
|
||||
.card-redact-btn { position:absolute; top:6px; right:32px; background:rgba(0,80,40,0.55); color:#7effc0; border:none; border-radius:50%; width:22px; height:22px; font-size:12px; line-height:22px; text-align:center; cursor:pointer; opacity:0; transition:opacity .15s; padding:0; z-index:1; }
|
||||
.card:hover .card-redact-btn { opacity:1; }
|
||||
.card.list-view .card-redact-btn { position:static; opacity:1; background:transparent; color:#7effc0; flex-shrink:0; }
|
||||
|
||||
/* Per-card checkbox (select mode) */
|
||||
.card-cb { position:absolute; top:6px; left:6px; width:16px; height:16px; margin:0; cursor:pointer; z-index:2;
|
||||
@ -491,6 +494,12 @@
|
||||
.overdue-badge { font-size: 9px; padding: 1px 5px; border-radius: 10px;
|
||||
background: #7c3200; color: #ffb347; font-weight: 600; white-space: nowrap; }
|
||||
[data-theme="light"] .overdue-badge { background: #fff3e0; color: #c55a00; }
|
||||
.resolved-badge { font-size: 9px; padding: 1px 5px; border-radius: 10px;
|
||||
background: #1a3a28; color: #7effc0; font-weight: 600; white-space: nowrap; }
|
||||
[data-theme="light"] .resolved-badge { background: #d0f5ea; color: #005a3a; }
|
||||
.card-resolved { opacity: 0.6; }
|
||||
.resolved-divider { grid-column: 1 / -1; padding: 8px 2px; font-size: 11px;
|
||||
color: var(--muted); border-top: 1px dashed var(--border); text-align: center; }
|
||||
.email-badge { font-size: 9px; padding: 1px 5px; border-radius: 10px;
|
||||
background: #1a3a5c; color: #7ec8f0; font-weight: 500; white-space: nowrap; }
|
||||
[data-theme="light"] .email-badge { background: #d0eaff; color: #004a80; }
|
||||
|
||||
311
tests/test_google_scan.py
Normal file
311
tests/test_google_scan.py
Normal file
@ -0,0 +1,311 @@
|
||||
"""
|
||||
Route and engine tests for the Google Workspace scan module.
|
||||
|
||||
Covers:
|
||||
- GET /api/google/scan/users — auth guard, user list, error propagation
|
||||
- POST /api/google/scan/start — auth guard, concurrency lock, successful start, lock release
|
||||
- POST /api/google/scan/cancel — abort signal
|
||||
- _run_google_scan — no-connector broadcast, CPR hit flagging, source_type tagging
|
||||
"""
|
||||
from __future__ import annotations
|
||||
import threading
|
||||
import time
|
||||
from unittest.mock import MagicMock
|
||||
|
||||
import pytest
|
||||
|
||||
|
||||
# ── Fixtures ──────────────────────────────────────────────────────────────────
|
||||
|
||||
@pytest.fixture(scope="module")
|
||||
def flask_app():
|
||||
import gdpr_scanner
|
||||
gdpr_scanner.app.config["TESTING"] = True
|
||||
gdpr_scanner.app.config["WTF_CSRF_ENABLED"] = False
|
||||
return gdpr_scanner.app
|
||||
|
||||
|
||||
@pytest.fixture()
|
||||
def client(flask_app):
|
||||
with flask_app.test_client() as c:
|
||||
yield c
|
||||
|
||||
|
||||
@pytest.fixture()
|
||||
def mock_google_connector(monkeypatch):
|
||||
from routes import state
|
||||
conn = MagicMock()
|
||||
conn.list_users.return_value = []
|
||||
monkeypatch.setattr(state, "google_connector", conn)
|
||||
return conn
|
||||
|
||||
|
||||
@pytest.fixture(autouse=True)
|
||||
def clean_google_state():
|
||||
yield
|
||||
from routes import state
|
||||
# Release the Google scan lock if a test left it acquired
|
||||
acquired = state._google_scan_lock.acquire(blocking=False)
|
||||
if acquired:
|
||||
state._google_scan_lock.release()
|
||||
state._google_scan_abort.clear()
|
||||
|
||||
|
||||
# ── GET /api/google/scan/users ────────────────────────────────────────────────
|
||||
|
||||
class TestGoogleScanUsers:
|
||||
def test_not_connected_returns_401(self, client, monkeypatch):
|
||||
from routes import state
|
||||
monkeypatch.setattr(state, "google_connector", None)
|
||||
r = client.get("/api/google/scan/users")
|
||||
assert r.status_code == 401
|
||||
assert r.json["error"] == "not connected"
|
||||
|
||||
def test_returns_user_list(self, client, mock_google_connector):
|
||||
mock_google_connector.list_users.return_value = [
|
||||
{"id": "1", "email": "alice@test.dk", "displayName": "Alice", "userRole": "student"},
|
||||
]
|
||||
r = client.get("/api/google/scan/users")
|
||||
assert r.status_code == 200
|
||||
assert len(r.json["users"]) == 1
|
||||
assert r.json["users"][0]["email"] == "alice@test.dk"
|
||||
|
||||
def test_returns_empty_list_when_no_users(self, client, mock_google_connector):
|
||||
mock_google_connector.list_users.return_value = []
|
||||
r = client.get("/api/google/scan/users")
|
||||
assert r.status_code == 200
|
||||
assert r.json["users"] == []
|
||||
|
||||
def test_connector_error_returns_500(self, client, mock_google_connector):
|
||||
mock_google_connector.list_users.side_effect = Exception("Admin SDK unavailable")
|
||||
r = client.get("/api/google/scan/users")
|
||||
assert r.status_code == 500
|
||||
assert "error" in r.json
|
||||
|
||||
|
||||
# ── POST /api/google/scan/start ───────────────────────────────────────────────
|
||||
|
||||
class TestGoogleScanStart:
|
||||
def test_not_connected_returns_401(self, client, monkeypatch):
|
||||
from routes import state
|
||||
monkeypatch.setattr(state, "google_connector", None)
|
||||
r = client.post("/api/google/scan/start", json={})
|
||||
assert r.status_code == 401
|
||||
assert "not connected" in r.json["error"]
|
||||
|
||||
def test_already_running_returns_409(self, client, mock_google_connector):
|
||||
from routes import state
|
||||
state._google_scan_lock.acquire()
|
||||
try:
|
||||
r = client.post("/api/google/scan/start", json={})
|
||||
assert r.status_code == 409
|
||||
assert "already running" in r.json["error"]
|
||||
finally:
|
||||
state._google_scan_lock.release()
|
||||
|
||||
def test_starts_successfully(self, client, mock_google_connector, monkeypatch):
|
||||
import routes.google_scan
|
||||
monkeypatch.setattr(routes.google_scan, "_run_google_scan", lambda opts: None)
|
||||
r = client.post("/api/google/scan/start", json={})
|
||||
assert r.status_code == 200
|
||||
assert r.json["status"] == "started"
|
||||
|
||||
def test_abort_event_cleared_on_start(self, client, mock_google_connector, monkeypatch):
|
||||
import routes.google_scan
|
||||
from routes import state
|
||||
state._google_scan_abort.set()
|
||||
monkeypatch.setattr(routes.google_scan, "_run_google_scan", lambda opts: None)
|
||||
client.post("/api/google/scan/start", json={})
|
||||
assert not state._google_scan_abort.is_set()
|
||||
|
||||
def test_lock_released_after_scan_completes(self, client, mock_google_connector, monkeypatch):
|
||||
import routes.google_scan
|
||||
from routes import state
|
||||
done = threading.Event()
|
||||
|
||||
def _fake_scan(opts):
|
||||
time.sleep(0.02)
|
||||
done.set()
|
||||
|
||||
monkeypatch.setattr(routes.google_scan, "_run_google_scan", _fake_scan)
|
||||
r = client.post("/api/google/scan/start", json={})
|
||||
assert r.status_code == 200
|
||||
assert done.wait(timeout=3), "Scan thread did not complete in time"
|
||||
time.sleep(0.05) # allow finally block to run
|
||||
acquired = state._google_scan_lock.acquire(blocking=False)
|
||||
assert acquired, "Lock was not released after scan completed"
|
||||
state._google_scan_lock.release()
|
||||
|
||||
@pytest.mark.filterwarnings("ignore::pytest.PytestUnhandledThreadExceptionWarning")
|
||||
def test_lock_released_on_scan_exception(self, client, mock_google_connector, monkeypatch):
|
||||
import routes.google_scan
|
||||
from routes import state
|
||||
done = threading.Event()
|
||||
|
||||
def _failing_scan(opts):
|
||||
done.set()
|
||||
raise RuntimeError("simulated crash")
|
||||
|
||||
monkeypatch.setattr(routes.google_scan, "_run_google_scan", _failing_scan)
|
||||
r = client.post("/api/google/scan/start", json={})
|
||||
assert r.status_code == 200
|
||||
assert done.wait(timeout=3), "Scan thread did not complete in time"
|
||||
time.sleep(0.05)
|
||||
acquired = state._google_scan_lock.acquire(blocking=False)
|
||||
assert acquired, "Lock was not released after scan raised an exception"
|
||||
state._google_scan_lock.release()
|
||||
|
||||
|
||||
# ── POST /api/google/scan/cancel ─────────────────────────────────────────────
|
||||
|
||||
class TestGoogleScanCancel:
|
||||
def test_sets_abort_event(self, client):
|
||||
from routes import state
|
||||
state._google_scan_abort.clear()
|
||||
r = client.post("/api/google/scan/cancel")
|
||||
assert r.status_code == 200
|
||||
assert r.json["status"] == "cancelling"
|
||||
assert state._google_scan_abort.is_set()
|
||||
|
||||
def test_idempotent_when_not_running(self, client):
|
||||
r = client.post("/api/google/scan/cancel")
|
||||
assert r.status_code == 200
|
||||
assert r.json["status"] == "cancelling"
|
||||
|
||||
|
||||
# ── _run_google_scan engine ───────────────────────────────────────────────────
|
||||
|
||||
class TestRunGoogleScan:
|
||||
"""
|
||||
Unit-tests for _run_google_scan() called synchronously with all heavy
|
||||
dependencies mocked: broadcast, _scan_bytes, DB, checkpoint I/O.
|
||||
"""
|
||||
|
||||
def _setup_mocks(self, monkeypatch, conn, scan_bytes_result=None):
|
||||
import gdpr_scanner
|
||||
import checkpoint
|
||||
import scan_engine
|
||||
import gdpr_db
|
||||
from routes import state
|
||||
|
||||
events = []
|
||||
monkeypatch.setattr(state, "google_connector", conn)
|
||||
monkeypatch.setattr(gdpr_scanner, "broadcast",
|
||||
lambda evt, data=None: events.append((evt, data or {})))
|
||||
monkeypatch.setattr(gdpr_scanner, "_scan_bytes",
|
||||
lambda data, name: scan_bytes_result or {
|
||||
"cprs": [], "pii_counts": None, "emails": [], "phones": []
|
||||
})
|
||||
monkeypatch.setattr(checkpoint, "_load_checkpoint", lambda *a, **kw: None)
|
||||
monkeypatch.setattr(checkpoint, "_save_checkpoint", lambda *a, **kw: None)
|
||||
monkeypatch.setattr(checkpoint, "_clear_checkpoint", lambda *a, **kw: None)
|
||||
monkeypatch.setattr(checkpoint, "_load_delta_tokens", lambda: {})
|
||||
monkeypatch.setattr(checkpoint, "_save_delta_tokens", lambda *a: None)
|
||||
monkeypatch.setattr(scan_engine, "_with_disposition", lambda card, db: card)
|
||||
monkeypatch.setattr(gdpr_db, "get_db", lambda *a, **kw: None)
|
||||
|
||||
gdpr_scanner.flagged_items.clear()
|
||||
return events
|
||||
|
||||
def _run(self, monkeypatch, conn, options, scan_bytes_result=None):
|
||||
import gdpr_scanner
|
||||
import routes.google_scan as gs
|
||||
events = self._setup_mocks(monkeypatch, conn, scan_bytes_result)
|
||||
gs._run_google_scan(options)
|
||||
gdpr_scanner.flagged_items.clear()
|
||||
return events
|
||||
|
||||
def test_no_connector_broadcasts_error_and_done(self, monkeypatch):
|
||||
import gdpr_scanner
|
||||
import routes.google_scan as gs
|
||||
from routes import state
|
||||
events = []
|
||||
monkeypatch.setattr(state, "google_connector", None)
|
||||
monkeypatch.setattr(gdpr_scanner, "broadcast",
|
||||
lambda evt, data=None: events.append((evt, data or {})))
|
||||
gs._run_google_scan({"sources": ["gmail"], "user_emails": ["a@b.dk"], "options": {}})
|
||||
|
||||
assert any(evt == "scan_error" for evt, _ in events)
|
||||
assert any(evt == "google_scan_done" for evt, _ in events)
|
||||
|
||||
def test_gmail_item_with_cpr_is_flagged(self, monkeypatch):
|
||||
conn = MagicMock()
|
||||
conn.list_users.return_value = []
|
||||
conn.iter_gmail_messages.return_value = [
|
||||
({"id": "msg1", "name": "report.txt", "size": 1024, "lastModifiedDateTime": "2026-01-01"}, b"content"),
|
||||
]
|
||||
cpr_result = {"cprs": [{"formatted": "010101-1234"}], "pii_counts": None, "emails": [], "phones": []}
|
||||
events = self._run(monkeypatch, conn,
|
||||
{"sources": ["gmail"], "user_emails": ["a@test.dk"], "options": {}},
|
||||
scan_bytes_result=cpr_result)
|
||||
|
||||
flagged = [d for evt, d in events if evt == "scan_file_flagged"]
|
||||
assert len(flagged) == 1
|
||||
|
||||
def test_gmail_item_source_type_is_gmail(self, monkeypatch):
|
||||
conn = MagicMock()
|
||||
conn.list_users.return_value = []
|
||||
conn.iter_gmail_messages.return_value = [
|
||||
({"id": "msg2", "name": "invoice.txt", "size": 512, "lastModifiedDateTime": "2026-01-01"}, b"data"),
|
||||
]
|
||||
cpr_result = {"cprs": [{"formatted": "020202-2345"}], "pii_counts": None, "emails": [], "phones": []}
|
||||
events = self._run(monkeypatch, conn,
|
||||
{"sources": ["gmail"], "user_emails": ["a@test.dk"], "options": {}},
|
||||
scan_bytes_result=cpr_result)
|
||||
|
||||
flagged = [d for evt, d in events if evt == "scan_file_flagged"]
|
||||
assert flagged[0]["source_type"] == "gmail"
|
||||
|
||||
def test_gmail_item_without_pii_not_flagged(self, monkeypatch):
|
||||
conn = MagicMock()
|
||||
conn.list_users.return_value = []
|
||||
conn.iter_gmail_messages.return_value = [
|
||||
({"id": "msg3", "name": "memo.txt", "size": 100}, b"hello world"),
|
||||
]
|
||||
events = self._run(monkeypatch, conn,
|
||||
{"sources": ["gmail"], "user_emails": ["a@test.dk"], "options": {}})
|
||||
|
||||
assert not any(evt == "scan_file_flagged" for evt, _ in events)
|
||||
|
||||
def test_gdrive_item_source_type_is_gdrive(self, monkeypatch):
|
||||
conn = MagicMock()
|
||||
conn.list_users.return_value = []
|
||||
conn.iter_gmail_messages.return_value = []
|
||||
conn.iter_drive_files.return_value = [
|
||||
({"id": "file1", "name": "doc.docx", "size": 2048, "lastModifiedDateTime": "2026-01-01"}, b"data"),
|
||||
]
|
||||
cpr_result = {"cprs": [{"formatted": "030303-3456"}], "pii_counts": None, "emails": [], "phones": []}
|
||||
events = self._run(monkeypatch, conn,
|
||||
{"sources": ["gmail", "gdrive"], "user_emails": ["a@test.dk"], "options": {}},
|
||||
scan_bytes_result=cpr_result)
|
||||
|
||||
gdrive = [d for evt, d in events if evt == "scan_file_flagged" and d.get("source_type") == "gdrive"]
|
||||
assert len(gdrive) == 1
|
||||
|
||||
def test_scan_done_always_broadcast(self, monkeypatch):
|
||||
conn = MagicMock()
|
||||
conn.list_users.return_value = []
|
||||
conn.iter_gmail_messages.return_value = []
|
||||
events = self._run(monkeypatch, conn,
|
||||
{"sources": ["gmail"], "user_emails": ["a@test.dk"], "options": {}})
|
||||
|
||||
done = [d for evt, d in events if evt == "google_scan_done"]
|
||||
assert len(done) == 1
|
||||
assert "flagged_count" in done[0]
|
||||
assert "total_scanned" in done[0]
|
||||
|
||||
def test_scan_done_counts_are_correct(self, monkeypatch):
|
||||
conn = MagicMock()
|
||||
conn.list_users.return_value = []
|
||||
conn.iter_gmail_messages.return_value = [
|
||||
({"id": "m1", "name": "a.txt", "size": 100}, b"x"),
|
||||
({"id": "m2", "name": "b.txt", "size": 100}, b"y"),
|
||||
]
|
||||
cpr_result = {"cprs": [{"formatted": "040404-4567"}], "pii_counts": None, "emails": [], "phones": []}
|
||||
events = self._run(monkeypatch, conn,
|
||||
{"sources": ["gmail"], "user_emails": ["a@test.dk"], "options": {}},
|
||||
scan_bytes_result=cpr_result)
|
||||
|
||||
done = next(d for evt, d in events if evt == "google_scan_done")
|
||||
assert done["total_scanned"] == 2
|
||||
assert done["flagged_count"] == 2
|
||||
Loading…
x
Reference in New Issue
Block a user