diff --git a/CHANGELOG.md b/CHANGELOG.md index 923d51e..871ca95 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -7,6 +7,28 @@ Version numbers follow [Semantic Versioning](https://semver.org/spec/v2.0.0.html --- +## [Unreleased] + +### Added + +- **Built-in file redaction for local files** — a scissor button (`✂`) appears on cards for local DOCX, XLSX, CSV, and TXT files. Clicking it rewrites the file in-place with all detected CPR numbers replaced by `██████-████` (DOCX/XLSX) or `█`-blocks (CSV/TXT), then removes the card from the grid and logs a `"redacted"` disposition. The redaction is atomic: a temp file in the same directory is written first and then moved over the original, so a crash never leaves a half-written file. Implemented in `routes/export.py` (`POST /api/redact_item`) using the existing `document_scanner` redact functions; front-end in `results.js` (`redactItem`) with the button hidden for non-local or unsupported-extension items and for resolved/viewer-mode cards. + +- **`DELETE /api/delete_item` route registration fix** — the `delete_item` handler in `routes/export.py` was missing its `@bp.route` decorator, so the endpoint was never registered in Flask's URL map. The route now works correctly. + +--- + +## [1.6.27] — 2026-05-27 + +### Added + +- **Email body excerpt preserved for offline preview** — when an M365 email or Gmail message is flagged, the first 500 characters of its plain-text body are stored in the card (`body_excerpt`), the checkpoint JSON, and a new `body_excerpt` DB column (migration #10). The M365 email preview now falls back to this excerpt when Graph is unavailable (not authenticated, token expired) or when resuming from a checkpoint without a live connection. The Gmail preview now shows the stored excerpt as the primary content (with the "Open in Gmail" link appended below) rather than the previous plain link-card. A helper `_excerpt_page()` in `routes/database.py` renders the excerpt with the same header layout as the full Graph-fetched preview. + +- **Re-scan diff — resolved items in history view** — when browsing a past scan session, items that were flagged in the immediately preceding session but are no longer present in the current one are automatically appended below a "N items no longer present" divider. Resolved items are greyed out and carry a green `✓ Resolved` badge; the delete button is hidden since the file is already gone. The history banner updates to show the resolved count alongside the flagged count. The diff is computed client-side by fetching the previous session's items and comparing IDs — no new API endpoint needed. Implemented in `history.js` (`loadHistorySession`) and `results.js` (`appendCard`). + +- **Google Workspace scan test suite** — 19 new tests in `tests/test_google_scan.py` covering all three routes (`GET /api/google/scan/users`, `POST /api/google/scan/start`, `POST /api/google/scan/cancel`) and the core scan engine (`_run_google_scan`). Route tests verify: 401 when unauthenticated, 409 when scan already running, lock released on both normal completion and exception, abort event cleared on start. Engine tests verify: CPR hits are broadcast as `scan_file_flagged`, clean items are not, `source_type` is correctly set to `"gmail"` for Gmail items and `"gdrive"` for Drive items, and `google_scan_done` always fires with correct `flagged_count` / `total_scanned` values. + +--- + ## [1.6.26] — 2026-04-29 ### Fixed diff --git a/CLAUDE.md b/CLAUDE.md index 5894c9b..ef89a9d 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -50,6 +50,8 @@ python -m pytest tests/ -q 182 tests in `tests/`. No integration tests for live M365/Google connections. +**`tests/test_google_scan.py`** — 19 tests for the Google Workspace scan module. Route tests for `GET /api/google/scan/users`, `POST /api/google/scan/start`, `POST /api/google/scan/cancel`. Engine tests for `_run_google_scan` using synchronous invocation with mocked `broadcast`, `_scan_bytes`, `checkpoint.*`, `scan_engine._with_disposition`, and `gdpr_db.get_db`. The `clean_google_state` autouse fixture releases `_google_scan_lock` and clears `_google_scan_abort` after each test. + **`tests/test_route_integration.py`** — 54 Flask test-client tests covering security-sensitive paths: viewer token CRUD and scope validation, `GET /api/db/flagged` role/user scope enforcement, bulk disposition isolation, viewer PIN (set/verify/rate-limit/change/clear), interface PIN gate (multi-step flows require `session["interface_ok"] = True` after PIN set — the `before_request` hook blocks the same endpoint once a PIN exists), scan lock release on `run_scan()` exception, `GET /api/db/sessions` shape and ordering, profile routes CRUD and rename (including the rename-after-copy regression). Uses a tmp-path `ScanDB` monkeypatched into `routes.database._get_db` — tests never touch the real database. Interface PIN tests manipulate the real `config.json` via `setup_method`/`teardown_method` calling `clear_interface_pin()`. **Local-file scan fixtures** — `tests/fixtures/local_files/` holds 19 files for manual/UI-level testing of the file scanner. 14 should be flagged; 5 are true negatives. All CPR numbers verified against `is_valid_cpr`. `generate_fixtures.py` (requires `python-docx`, `openpyxl`, `mutagen` — all in venv) regenerates the binary `.docx`/`.xlsx`/`.mp3`/`.flac`/`.mp4` files. Audio fixtures need 2 silent MPEG frames so mutagen can sync; FLAC uses a hand-packed STREAMINFO + Vorbis comment block; MP4 uses a minimal `ftyp`+`moov`/`mvhd` base that mutagen can tag. @@ -111,6 +113,7 @@ Exception hierarchy (all inherit `M365Error(Exception)`): Large M365 tenants can generate enormous memory pressure. Key rules to preserve: - **Email body stripped at collection time** — `_scan_user_email` calls `conn.get_message_body_text(msg)`, stores the result as `msg["_precomputed_body"]`, then deletes `msg["body"]` and `msg["bodyPreview"]` before appending to `work_items`. The processing loop reads `meta.pop("_precomputed_body", "")`. Do not re-add `body` to the `$select` query without also stripping it here. +- **`body_excerpt` — 500-char plain-text preview stored per flagged email** — just before `del body_text` in M365 email processing, `meta["_body_excerpt"] = body_text[:500].strip()`. In `google_scan.py`, a regex HTML-strip of the first 3000 bytes of Gmail body data is stored the same way. `_broadcast_card` in both engines includes `"body_excerpt"` in the card dict so the excerpt flows into `flagged_items`, the checkpoint JSON, and the DB (`body_excerpt TEXT`, migration #10). The M365 email preview route falls back to `_excerpt_page()` when Graph raises or the connector is absent. The Gmail preview shows `_excerpt_page()` as primary content with the "Open in Gmail" link appended. Do not remove the excerpt before broadcasting — that's what makes preview work on checkpoint resume. - **`work_items` → `deque` before processing** — converted with `deque(work_items)` and drained via `popleft()` so each item's memory is released immediately after processing. Do not convert back to a list or iterate with `enumerate()`. - **`del content` in file branch** — raw download bytes are deleted as soon as `content.decode()` is done (before NER/PII counting). Both the hit and no-hit paths have explicit `del content`. - **`del body_text` in email branch** — deleted after `_broadcast_card` call. @@ -124,6 +127,7 @@ Large M365 tenants can generate enormous memory pressure. Key rules to preserve: - **Excel Summary sheet vs. per-source tabs** — the Summary sheet shows all scanned sources (even with 0 items). Per-source tabs are only created for sources with items; an empty tab has no value. - **ART.30 breakdown table** — iterates `scanned_sources` (not `by_source`) so Gmail, Google Drive, etc. appear with `0 | 0 | 0 | —` when the scan found nothing. - **Role-filtered exports** — `_build_excel_bytes(role='')` and `_build_article30_docx(role='')` accept `role='student'` or `role='staff'`. A local `_items` list is built at the top of each function and used everywhere instead of `state.flagged_items` directly — GPS sheet, External transfers sheet, and Art.30 staff/student tables all see only the filtered subset. Route handlers read `request.args.get('role', '')` and forward it. Filenames get `_elever` / `_ansatte` suffix. The `#filterRole` dropdown in the filter bar drives both the client-side grid filter and the export URL param — do not separate them. +- **`POST /api/redact_item`** — rewrites a local file in-place with CPR numbers replaced by `██████-████` / `█` blocks, then removes the card from the grid and logs a `"redacted"` disposition. Supported extensions: `.docx`, `.xlsx`, `.csv`, `.txt` (`_REDACT_EXTS`). The file is written to a temp path in the **same directory** as the original before `shutil.move` — this avoids cross-device rename failures on mounted volumes. Uses existing `document_scanner` functions (`redact_docx`, `redact_xlsx`, `redact_csv`, `find_pii_spans_in_text`). Only works for `source_type == "local"` — SMB/cloud files are not supported (button is hidden on those cards). The button (`✂`, class `card-redact-btn`) appears in `appendCard` when `_redactable(f)` is true; hidden in viewer mode and for resolved items. ## Scan history browser — static/js/history.js + gdpr_db.py + routes/database.py @@ -137,6 +141,7 @@ Allows reviewing results from any past scan session without running a new scan. - **History banner** (`#historyBanner`) — shown when `S._historyRefScanId` is set. Contains `#historyBannerText` (session date · sources · N items), `#historyPickerBtn` (opens `#historyDropdown`), and `#historyLatestBtn` (visible only when the viewed session is not the latest). Do not hide/show these elements from outside `history.js`. - **Session picker** (`#historyDropdown`) — rendered inside `[data-history-wrap]` container so the outside-click handler (`document` listener, closes on clicks outside `[data-history-wrap]`) works correctly. Do not move the picker outside this wrapper. - **Cache invalidation** — `_sessions` and `_latestRefScanId` are module-level in `history.js`. `invalidateHistoryCache()` clears both. All three `*_done` SSE handlers in `scan.js` call `window.invalidateHistoryCache?.()` so the picker reflects the newest scan after completion. +- **Re-scan diff** — `loadHistorySession` fetches the immediately preceding session's items after rendering the current session. Items present in the previous session but absent from the current one (compared by `id`) are tagged `_resolved: true` and appended after a `.resolved-divider` separator. `appendCard` in `results.js` adds `.card-resolved` (opacity 0.6), a green `✓ Resolved` badge, and hides the delete button for resolved items. `_setHistoryBanner` accepts an optional `resolvedCount` parameter and appends it to the banner label. Resolved items are NOT added to `S.flaggedData` — they are grid-only and cannot be bulk-selected or exported. - **Auto-load on page load** — `results.js` calls `window.loadHistorySession?.(null)` once when the SSE watchdog confirms `!status.running`. `null` resolves to the latest completed session via `_fetchSessions()[0].ref_scan_id`. The `_initialStatusChecked` guard ensures this fires at most once per page load. - **Mode transitions** — `startScan()` calls `window.exitHistoryMode?.()` before clearing the grid, so any history banner is dismissed and `S._historyRefScanId` is reset before SSE events start arriving. diff --git a/gdpr_db.py b/gdpr_db.py index b3776d2..ffd6cac 100644 --- a/gdpr_db.py +++ b/gdpr_db.py @@ -202,6 +202,7 @@ _MIGRATIONS: list[tuple[int, str]] = [ (6, "ALTER TABLE flagged_items ADD COLUMN full_path TEXT NOT NULL DEFAULT ''"), (8, "ALTER TABLE flagged_items ADD COLUMN email_count INTEGER NOT NULL DEFAULT 0"), (9, "ALTER TABLE flagged_items ADD COLUMN phone_count INTEGER NOT NULL DEFAULT 0"), + (10, "ALTER TABLE flagged_items ADD COLUMN body_excerpt TEXT NOT NULL DEFAULT ''"), (7, """CREATE TABLE IF NOT EXISTS schedule_runs ( id INTEGER PRIMARY KEY AUTOINCREMENT, started_at REAL NOT NULL, @@ -314,8 +315,8 @@ class ScanDB: url, drive_id, size_kb, modified, cpr_count, risk, thumb_b64, thumb_mime, attachments, user_role, transfer_risk, special_category, face_count, exif_json, full_path, - email_count, phone_count, scanned_at) - VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)""", + email_count, phone_count, body_excerpt, scanned_at) + VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)""", ( card.get("id", ""), scan_id, @@ -341,6 +342,7 @@ class ScanDB: card.get("full_path", ""), card.get("email_count", 0), card.get("phone_count", 0), + card.get("body_excerpt", ""), now, ), ) diff --git a/routes/database.py b/routes/database.py index f46aba5..83c153d 100644 --- a/routes/database.py +++ b/routes/database.py @@ -344,6 +344,29 @@ def db_import(): return jsonify({"error": str(e)}), 500 +def _excerpt_page(excerpt: str, item_meta: dict) -> str: + """Minimal HTML page showing a stored body excerpt as a preview fallback.""" + import html as _html + subject = _html.escape(item_meta.get("name", "")) + modified = item_meta.get("modified", "") + account = _html.escape(item_meta.get("account_name", "")) + body = "
" + _html.escape(excerpt) + "
" + note = "

Stored excerpt — connect to reload the full message.

" + return ( + "" + "" + f"
" + + (f"
From: {account}
" if account else "") + + (f"
Date: {_html.escape(modified)}
" if modified else "") + + (f"
Subject: {subject}
" if subject else "") + + f"
{body}{note}" + ) + + @bp.route("/api/preview/") def get_preview(item_id): """Return a preview URL or HTML for a flagged item.""" @@ -541,7 +564,11 @@ def get_preview(item_id): try: if source_type == "email": + excerpt = item_meta.get("body_excerpt", "") if not state.connector: + if excerpt: + import html as _html + return jsonify({"type": "html", "html": _excerpt_page(excerpt, item_meta)}) return jsonify({"error": "not authenticated"}), 401 uid = account_id try: @@ -550,6 +577,8 @@ def get_preview(item_id): {"$select": "subject,from,receivedDateTime,body"} ) except Exception as e: + if excerpt: + return jsonify({"type": "html", "html": _excerpt_page(excerpt, item_meta)}) return jsonify({"error": f"Could not load email: {e}"}) sender = msg.get("from", {}).get("emailAddress", {}) @@ -619,23 +648,33 @@ def get_preview(item_id): return jsonify({"type": "iframe", "url": f"https://drive.google.com/file/d/{fid}/preview"}) # Fallback: generic Drive embed return jsonify({"type": "iframe", "url": item_url.replace("/view", "/preview")}) - # Gmail — not embeddable; show link card - icon = "✉️" if source_type == "gmail" else "☁️" - label = "Open in Gmail" if source_type == "gmail" else "Open in Google Drive" + # Gmail — not embeddable; show link card + stored body excerpt if available + icon = "✉️" if source_type == "gmail" else "☁️" + label = "Open in Gmail" if source_type == "gmail" else "Open in Google Drive" + excerpt = item_meta.get("body_excerpt", "") link_html = ( f'' f'{label}' ) if item_url else "" - html_out = ( - f'
' - f'
{icon}
' - f'
{_html_esc(name)}
' - f'
No inline preview available for this item
' - f'{link_html}' - f'
' - ) + if excerpt and source_type == "gmail": + html_out = _excerpt_page(excerpt, item_meta) + if item_url: + # Inject the "Open in Gmail" link before + html_out = html_out.replace( + "", + f'
{link_html}
' + ) + else: + html_out = ( + f'
' + f'
{icon}
' + f'
{_html_esc(name)}
' + f'
No inline preview available for this item
' + f'{link_html}' + f'
' + ) return jsonify({"type": "html", "html": html_out}) else: diff --git a/routes/export.py b/routes/export.py index 77773ff..cc29b40 100644 --- a/routes/export.py +++ b/routes/export.py @@ -1158,6 +1158,7 @@ def export_article30(): return jsonify({"error": str(e)}), 500 +@bp.route("/api/delete_item", methods=["POST"]) def delete_item(): """Delete a single flagged item. Returns {ok, error}.""" if not state.connector: @@ -1200,6 +1201,104 @@ def delete_item(): return jsonify({"ok": False, "error": str(e)}) +_REDACT_EXTS = {".docx", ".xlsx", ".csv", ".txt"} + + +@bp.route("/api/redact_item", methods=["POST"]) +def redact_item(): + """Redact CPR numbers in-place in a local file. Returns {ok, redacted}.""" + from pathlib import Path as _Path + import tempfile as _tempfile + import shutil as _shutil + + data = request.get_json() or {} + item_id = data.get("id", "") + if not item_id: + return jsonify({"ok": False, "error": "id required"}), 400 + + # Resolve item meta: in-memory first (active scan), then DB (history) + item_meta = next((x for x in state.flagged_items if x.get("id") == item_id), None) + if item_meta is None: + _db = _get_db() if DB_OK else None + if _db: + row = _db._connect().execute( + "SELECT * FROM flagged_items WHERE id=? LIMIT 1", (item_id,) + ).fetchone() + item_meta = dict(row) if row else {} + else: + item_meta = {} + + source_type = item_meta.get("source_type", "") + if source_type not in ("local",): + return jsonify({"ok": False, "error": "Redaction is only supported for local files"}), 400 + + full_path = item_meta.get("full_path", "") + if not full_path: + return jsonify({"ok": False, "error": "File path not available — rescan to enable redaction"}), 400 + + path = _Path(full_path).expanduser() + if not path.exists(): + return jsonify({"ok": False, "error": f"File not found: {full_path}"}), 404 + + ext = path.suffix.lower() + if ext not in _REDACT_EXTS: + return jsonify({"ok": False, "error": f"Redaction not supported for {ext or 'this'} files. Supported: DOCX, XLSX, CSV, TXT"}), 400 + + tmp_path = None + try: + from document_scanner import ( + scan_docx, redact_docx, + scan_xlsx, redact_xlsx, + redact_csv, + find_pii_spans_in_text, + ) + + with _tempfile.NamedTemporaryFile(suffix=ext, delete=False, dir=path.parent) as tmp: + tmp_path = _Path(tmp.name) + + if ext == ".docx": + results = scan_docx(path) + redacted = redact_docx(path, tmp_path, results, use_ner=False) + elif ext == ".xlsx": + results = scan_xlsx(path) + redacted = redact_xlsx(path, tmp_path, results, use_ner=False) + elif ext == ".csv": + redacted = redact_csv(path, tmp_path, use_ner=False) + else: # .txt + text = path.read_text(encoding="utf-8", errors="replace") + spans = [(s, e, l) for s, e, l in find_pii_spans_in_text(text, use_ner=False) if l == "CPR"] + chars = list(text) + for s, e, _ in sorted(spans, reverse=True): + chars[s:e] = ["█"] * (e - s) + tmp_path.write_text("".join(chars), encoding="utf-8") + redacted = len(spans) + + _shutil.move(str(tmp_path), str(path)) + tmp_path = None + + state.flagged_items[:] = [x for x in state.flagged_items if x.get("id") != item_id] + _db = _get_db() if DB_OK else None + if _db: + try: + _db.log_deletion(item_meta, reason="redacted") + _db.delete_item_record(item_id) + except Exception: + pass + + logger.info("[redact] %s — %d CPR span(s) redacted", path.name, redacted) + return jsonify({"ok": True, "redacted": redacted}) + + except Exception as e: + logger.error("[redact] failed: %s", e) + return jsonify({"ok": False, "error": str(e)}) + finally: + if tmp_path and tmp_path.exists(): + try: + tmp_path.unlink() + except Exception: + pass + + @bp.route("/api/delete_bulk", methods=["POST"]) def delete_bulk(): """Delete multiple items matching criteria. Streams progress as SSE.""" diff --git a/routes/google_scan.py b/routes/google_scan.py index 6399a75..3c375e6 100644 --- a/routes/google_scan.py +++ b/routes/google_scan.py @@ -255,6 +255,7 @@ def _run_google_scan(options: dict): "special_category": [], "face_count": 0, "exif": {}, + "body_excerpt": item_meta.get("_body_excerpt", ""), } flagged_items.append(card) _google_flagged.append(card) @@ -305,6 +306,14 @@ def _run_google_scan(options: dict): try: meta["_account"] = _display_name meta["_source_type"] = "gmail" + # Extract a plain-text excerpt before scanning (body is discarded after) + try: + import re as _re + _raw = data[:3000].decode("utf-8", errors="replace") + _plain = _re.sub(r"<[^>]+>", " ", _raw) + meta["_body_excerpt"] = " ".join(_plain.split())[:500] + except Exception: + meta["_body_excerpt"] = "" result = _scan_bytes(data, meta.get("name", "msg.txt")) except Exception as e: broadcast("scan_error", {"file": meta.get("name", ""), "error": str(e)}) diff --git a/scan_engine.py b/scan_engine.py index 64b79c3..8e1953a 100644 --- a/scan_engine.py +++ b/scan_engine.py @@ -549,6 +549,7 @@ def run_scan(options: dict): "special_category": item_meta.get("_special_category", []), "face_count": item_meta.get("_face_count", 0), "exif": item_meta.get("_exif", {}), + "body_excerpt": item_meta.get("_body_excerpt", ""), } _state.flagged_items.append(card) broadcast("scan_file_flagged", _with_disposition(card, _db)) @@ -1153,6 +1154,8 @@ def run_scan(options: dict): meta["_transfer_risk"] = _check_transfer_risk(meta) meta["_special_category"] = _check_special_category( body_text if scan_email_body else "", all_cprs) + # Store a short excerpt so preview still works if Graph is unavailable + meta["_body_excerpt"] = body_text[:500].strip() if body_text else "" _broadcast_card(meta, all_cprs, pii_counts=_email_pii) del body_text # free email text — may be large for HTML-rich emails diff --git a/static/js/history.js b/static/js/history.js index 5b001f6..58bf8bb 100644 --- a/static/js/history.js +++ b/static/js/history.js @@ -82,6 +82,31 @@ async function loadHistorySession(refScanId) { try { window.markOverdueCards(); } catch(_) {} try { window.loadTrend(); } catch(_) {} _setHistoryBanner(true, resolvedRef); + + // ── Re-scan diff: append items from previous session no longer present ──── + const allSessions = _sessions !== null ? _sessions : await _fetchSessions(); + const idx = allSessions.findIndex(s => s.ref_scan_id === resolvedRef); + if (idx !== -1 && idx + 1 < allSessions.length) { + const prevRef = allSessions[idx + 1].ref_scan_id; + try { + const pr = await fetch('/api/db/flagged?ref=' + prevRef); + const prevItems = await pr.json(); + if (Array.isArray(prevItems) && prevItems.length) { + const currentIds = new Set(items.map(f => f.id)); + const resolved = prevItems.filter(f => !currentIds.has(f.id)); + if (resolved.length) { + const divider = document.createElement('div'); + divider.className = 'resolved-divider'; + divider.textContent = resolved.length + ' ' + t('history_resolved_label', 'items no longer present'); + document.getElementById('grid')?.appendChild(divider); + resolved.forEach(f => { f._resolved = true; window.appendCard(f); }); + _setHistoryBanner(true, resolvedRef, resolved.length); + } + } + } catch(e) { + console.warn('[history] diff failed:', e); + } + } } catch(e) { console.error('[history] failed to load session:', e); } @@ -89,7 +114,7 @@ async function loadHistorySession(refScanId) { // ── Banner ──────────────────────────────────────────────────────────────────── -function _setHistoryBanner(visible, resolvedRef) { +function _setHistoryBanner(visible, resolvedRef, resolvedCount) { const banner = document.getElementById('historyBanner'); const bannerTxt = document.getElementById('historyBannerText'); const latestBtn = document.getElementById('historyLatestBtn'); @@ -107,6 +132,7 @@ function _setHistoryBanner(visible, resolvedRef) { label = date + ' ' + time + (srcStr ? ' · ' + srcStr : '') + ' · ' + sess.flagged_count + ' ' + t('history_items', 'items'); + if (resolvedCount) label += ' · ' + resolvedCount + ' ' + t('history_resolved_badge', 'resolved'); } else { label = S.flaggedData.length + ' ' + t('history_items', 'items'); } diff --git a/static/js/results.js b/static/js/results.js index 6940022..c0f0cca 100644 --- a/static/js/results.js +++ b/static/js/results.js @@ -24,7 +24,7 @@ function appendCard(f) { : '/api/thumb?name=' + encodeURIComponent(f.name) + '&type=' + encodeURIComponent(f.source_type); const card = document.createElement('div'); - card.className = 'card' + (S.isListView ? ' list-view' : '') + (S._selectedIds.has(f.id) ? ' card-selected-bulk' : ''); + card.className = 'card' + (S.isListView ? ' list-view' : '') + (S._selectedIds.has(f.id) ? ' card-selected-bulk' : '') + (f._resolved ? ' card-resolved' : ''); card.dataset.id = f.id; card.onclick = (e) => { if (S._selectMode) { toggleCardSelect(f.id, e); } else { openPreview(f); } }; @@ -35,7 +35,11 @@ function appendCard(f) { cb.onclick = (e) => { e.stopPropagation(); toggleCardSelect(f.id, e); }; card.appendChild(cb); - const delBtn = window.VIEWER_MODE ? '' : ``; + const delBtn = (window.VIEWER_MODE || f._resolved) ? '' : ``; + const _redactExts = new Set(['.docx', '.xlsx', '.txt', '.csv']); + const _redactable = !window.VIEWER_MODE && !f._resolved && f.source_type === 'local' && f.cpr_count > 0 + && _redactExts.has((f.name || '').substring((f.name || '').lastIndexOf('.')).toLowerCase()); + const redactBtn = _redactable ? `` : ''; if (S.isListView) { card.innerHTML = ` @@ -50,8 +54,8 @@ function appendCard(f) { ${f.phone_count > 0 ? '' + f.phone_count + ' ' + t('m365_badge_phones', 'tlf.') + ' ' : ''} ${f.face_count > 0 ? '' + f.face_count + ' ' + t('m365_badge_faces', f.face_count === 1 ? 'face' : 'faces') + ' ' : ''} ${f.exif && f.exif.gps ? '🌍 GPS ' : ''} - ${f.special_category && f.special_category.length ? '⚠ Art.9 — ' + f.special_category.filter(function(s){return s !== 'gps_location' && s !== 'exif_pii';}).join(', ') + ' ' : ''}${f.overdue ? '🗓 Overdue' : ''} - ${delBtn}`; + ${f.special_category && f.special_category.length ? '⚠ Art.9 — ' + f.special_category.filter(function(s){return s !== 'gps_location' && s !== 'exif_pii';}).join(', ') + ' ' : ''}${f._resolved ? '✓ ' + t('history_resolved_badge', 'Resolved') + ' ' : ''}${f.overdue ? '🗓 Overdue' : ''} + ${delBtn}${redactBtn}`; } else { card.innerHTML = `
${f.name}
@@ -60,9 +64,9 @@ function appendCard(f) {
${f.size_kb} KB · ${f.modified || ''}
${f.folder ? `
📂 ${f.folder}
` : ''}
${label}${f.account_name ? ' ' : ''}${f.transfer_risk === "external-recipient" ? ' ⚠ Ext.' : f.transfer_risk ? ' 🔗' : ''}
- ${f.cpr_count} CPR${f.email_count > 0 ? ' ' + f.email_count + ' ' + t('m365_badge_emails', 'e-mail') + '' : ''}${f.phone_count > 0 ? ' ' + f.phone_count + ' ' + t('m365_badge_phones', 'tlf.') + '' : ''}${f.face_count > 0 ? ' ' + f.face_count + ' ' + t('m365_badge_faces', f.face_count === 1 ? 'face' : 'faces') + '' : ''}${f.exif && f.exif.gps ? ' 🌍 GPS' : ''}${f.overdue ? ' 🗓 Overdue' : ''} + ${f.cpr_count} CPR${f.email_count > 0 ? ' ' + f.email_count + ' ' + t('m365_badge_emails', 'e-mail') + '' : ''}${f.phone_count > 0 ? ' ' + f.phone_count + ' ' + t('m365_badge_phones', 'tlf.') + '' : ''}${f.face_count > 0 ? ' ' + f.face_count + ' ' + t('m365_badge_faces', f.face_count === 1 ? 'face' : 'faces') + '' : ''}${f.exif && f.exif.gps ? ' 🌍 GPS' : ''}${f._resolved ? ' ✓ ' + t('history_resolved_badge', 'Resolved') + '' : ''}${f.overdue ? ' 🗓 Overdue' : ''} - ${delBtn}`; + ${delBtn}${redactBtn}`; } grid.appendChild(card); } @@ -594,6 +598,32 @@ async function deleteItem(f, cardEl) { } } +async function redactItem(f, cardEl) { + if (!confirm(t('redact_confirm', 'Redact all CPR numbers in') + ' "' + f.name + '"?\n\n' + t('redact_warning', 'CPR numbers will be replaced with █ characters. This cannot be undone.'))) return; + if (cardEl) { cardEl.style.opacity = '0.5'; cardEl.style.pointerEvents = 'none'; } + try { + const r = await fetch('/api/redact_item', { + method: 'POST', headers: {'Content-Type': 'application/json'}, + body: JSON.stringify({id: f.id, source_type: f.source_type}) + }); + const d = await r.json(); + if (d.ok) { + S.flaggedData = S.flaggedData.filter(x => x.id !== f.id); + S.filteredData = S.filteredData.filter(x => x.id !== f.id); + if (cardEl) cardEl.remove(); + updateStats(); + log(t('redact_done', 'Redacted') + ' ' + f.name + ' (' + (d.redacted || 0) + ' ' + t('redact_spans', 'CPR spans') + ')', 'ok'); + if (_previewItemId === f.id) closePreview(); + } else { + if (cardEl) { cardEl.style.opacity = ''; cardEl.style.pointerEvents = ''; } + log(t('redact_failed', 'Redaction failed:') + ' ' + (d.error || '?'), 'err'); + } + } catch(e) { + if (cardEl) { cardEl.style.opacity = ''; cardEl.style.pointerEvents = ''; } + log(t('redact_failed', 'Redaction failed:') + ' ' + e.message, 'err'); + } +} + // ── Bulk delete modal ───────────────────────────────────────────────────────── function openBulkDelete() { @@ -1049,6 +1079,7 @@ window.loadDisposition = loadDisposition; window.saveDisposition = saveDisposition; window.closePreview = closePreview; window.deleteItem = deleteItem; +window.redactItem = redactItem; window.openBulkDelete = openBulkDelete; window.closeBulkDelete = closeBulkDelete; window._bdFilters = _bdFilters; diff --git a/static/style.css b/static/style.css index 17578b3..53d0dca 100644 --- a/static/style.css +++ b/static/style.css @@ -253,6 +253,9 @@ .card-delete-btn { position:absolute; top:6px; right:6px; background:rgba(0,0,0,0.45); color:#fff; border:none; border-radius:50%; width:22px; height:22px; font-size:13px; line-height:22px; text-align:center; cursor:pointer; opacity:0.35; transition:opacity .15s; padding:0; z-index:1; } .card:hover .card-delete-btn { opacity:1; } .card.list-view .card-delete-btn { position:static; opacity:1; background:transparent; color:var(--muted); flex-shrink:0; } + .card-redact-btn { position:absolute; top:6px; right:32px; background:rgba(0,80,40,0.55); color:#7effc0; border:none; border-radius:50%; width:22px; height:22px; font-size:12px; line-height:22px; text-align:center; cursor:pointer; opacity:0; transition:opacity .15s; padding:0; z-index:1; } + .card:hover .card-redact-btn { opacity:1; } + .card.list-view .card-redact-btn { position:static; opacity:1; background:transparent; color:#7effc0; flex-shrink:0; } /* Per-card checkbox (select mode) */ .card-cb { position:absolute; top:6px; left:6px; width:16px; height:16px; margin:0; cursor:pointer; z-index:2; @@ -491,6 +494,12 @@ .overdue-badge { font-size: 9px; padding: 1px 5px; border-radius: 10px; background: #7c3200; color: #ffb347; font-weight: 600; white-space: nowrap; } [data-theme="light"] .overdue-badge { background: #fff3e0; color: #c55a00; } + .resolved-badge { font-size: 9px; padding: 1px 5px; border-radius: 10px; + background: #1a3a28; color: #7effc0; font-weight: 600; white-space: nowrap; } + [data-theme="light"] .resolved-badge { background: #d0f5ea; color: #005a3a; } + .card-resolved { opacity: 0.6; } + .resolved-divider { grid-column: 1 / -1; padding: 8px 2px; font-size: 11px; + color: var(--muted); border-top: 1px dashed var(--border); text-align: center; } .email-badge { font-size: 9px; padding: 1px 5px; border-radius: 10px; background: #1a3a5c; color: #7ec8f0; font-weight: 500; white-space: nowrap; } [data-theme="light"] .email-badge { background: #d0eaff; color: #004a80; } diff --git a/tests/test_google_scan.py b/tests/test_google_scan.py new file mode 100644 index 0000000..55888b4 --- /dev/null +++ b/tests/test_google_scan.py @@ -0,0 +1,311 @@ +""" +Route and engine tests for the Google Workspace scan module. + +Covers: + - GET /api/google/scan/users — auth guard, user list, error propagation + - POST /api/google/scan/start — auth guard, concurrency lock, successful start, lock release + - POST /api/google/scan/cancel — abort signal + - _run_google_scan — no-connector broadcast, CPR hit flagging, source_type tagging +""" +from __future__ import annotations +import threading +import time +from unittest.mock import MagicMock + +import pytest + + +# ── Fixtures ────────────────────────────────────────────────────────────────── + +@pytest.fixture(scope="module") +def flask_app(): + import gdpr_scanner + gdpr_scanner.app.config["TESTING"] = True + gdpr_scanner.app.config["WTF_CSRF_ENABLED"] = False + return gdpr_scanner.app + + +@pytest.fixture() +def client(flask_app): + with flask_app.test_client() as c: + yield c + + +@pytest.fixture() +def mock_google_connector(monkeypatch): + from routes import state + conn = MagicMock() + conn.list_users.return_value = [] + monkeypatch.setattr(state, "google_connector", conn) + return conn + + +@pytest.fixture(autouse=True) +def clean_google_state(): + yield + from routes import state + # Release the Google scan lock if a test left it acquired + acquired = state._google_scan_lock.acquire(blocking=False) + if acquired: + state._google_scan_lock.release() + state._google_scan_abort.clear() + + +# ── GET /api/google/scan/users ──────────────────────────────────────────────── + +class TestGoogleScanUsers: + def test_not_connected_returns_401(self, client, monkeypatch): + from routes import state + monkeypatch.setattr(state, "google_connector", None) + r = client.get("/api/google/scan/users") + assert r.status_code == 401 + assert r.json["error"] == "not connected" + + def test_returns_user_list(self, client, mock_google_connector): + mock_google_connector.list_users.return_value = [ + {"id": "1", "email": "alice@test.dk", "displayName": "Alice", "userRole": "student"}, + ] + r = client.get("/api/google/scan/users") + assert r.status_code == 200 + assert len(r.json["users"]) == 1 + assert r.json["users"][0]["email"] == "alice@test.dk" + + def test_returns_empty_list_when_no_users(self, client, mock_google_connector): + mock_google_connector.list_users.return_value = [] + r = client.get("/api/google/scan/users") + assert r.status_code == 200 + assert r.json["users"] == [] + + def test_connector_error_returns_500(self, client, mock_google_connector): + mock_google_connector.list_users.side_effect = Exception("Admin SDK unavailable") + r = client.get("/api/google/scan/users") + assert r.status_code == 500 + assert "error" in r.json + + +# ── POST /api/google/scan/start ─────────────────────────────────────────────── + +class TestGoogleScanStart: + def test_not_connected_returns_401(self, client, monkeypatch): + from routes import state + monkeypatch.setattr(state, "google_connector", None) + r = client.post("/api/google/scan/start", json={}) + assert r.status_code == 401 + assert "not connected" in r.json["error"] + + def test_already_running_returns_409(self, client, mock_google_connector): + from routes import state + state._google_scan_lock.acquire() + try: + r = client.post("/api/google/scan/start", json={}) + assert r.status_code == 409 + assert "already running" in r.json["error"] + finally: + state._google_scan_lock.release() + + def test_starts_successfully(self, client, mock_google_connector, monkeypatch): + import routes.google_scan + monkeypatch.setattr(routes.google_scan, "_run_google_scan", lambda opts: None) + r = client.post("/api/google/scan/start", json={}) + assert r.status_code == 200 + assert r.json["status"] == "started" + + def test_abort_event_cleared_on_start(self, client, mock_google_connector, monkeypatch): + import routes.google_scan + from routes import state + state._google_scan_abort.set() + monkeypatch.setattr(routes.google_scan, "_run_google_scan", lambda opts: None) + client.post("/api/google/scan/start", json={}) + assert not state._google_scan_abort.is_set() + + def test_lock_released_after_scan_completes(self, client, mock_google_connector, monkeypatch): + import routes.google_scan + from routes import state + done = threading.Event() + + def _fake_scan(opts): + time.sleep(0.02) + done.set() + + monkeypatch.setattr(routes.google_scan, "_run_google_scan", _fake_scan) + r = client.post("/api/google/scan/start", json={}) + assert r.status_code == 200 + assert done.wait(timeout=3), "Scan thread did not complete in time" + time.sleep(0.05) # allow finally block to run + acquired = state._google_scan_lock.acquire(blocking=False) + assert acquired, "Lock was not released after scan completed" + state._google_scan_lock.release() + + @pytest.mark.filterwarnings("ignore::pytest.PytestUnhandledThreadExceptionWarning") + def test_lock_released_on_scan_exception(self, client, mock_google_connector, monkeypatch): + import routes.google_scan + from routes import state + done = threading.Event() + + def _failing_scan(opts): + done.set() + raise RuntimeError("simulated crash") + + monkeypatch.setattr(routes.google_scan, "_run_google_scan", _failing_scan) + r = client.post("/api/google/scan/start", json={}) + assert r.status_code == 200 + assert done.wait(timeout=3), "Scan thread did not complete in time" + time.sleep(0.05) + acquired = state._google_scan_lock.acquire(blocking=False) + assert acquired, "Lock was not released after scan raised an exception" + state._google_scan_lock.release() + + +# ── POST /api/google/scan/cancel ───────────────────────────────────────────── + +class TestGoogleScanCancel: + def test_sets_abort_event(self, client): + from routes import state + state._google_scan_abort.clear() + r = client.post("/api/google/scan/cancel") + assert r.status_code == 200 + assert r.json["status"] == "cancelling" + assert state._google_scan_abort.is_set() + + def test_idempotent_when_not_running(self, client): + r = client.post("/api/google/scan/cancel") + assert r.status_code == 200 + assert r.json["status"] == "cancelling" + + +# ── _run_google_scan engine ─────────────────────────────────────────────────── + +class TestRunGoogleScan: + """ + Unit-tests for _run_google_scan() called synchronously with all heavy + dependencies mocked: broadcast, _scan_bytes, DB, checkpoint I/O. + """ + + def _setup_mocks(self, monkeypatch, conn, scan_bytes_result=None): + import gdpr_scanner + import checkpoint + import scan_engine + import gdpr_db + from routes import state + + events = [] + monkeypatch.setattr(state, "google_connector", conn) + monkeypatch.setattr(gdpr_scanner, "broadcast", + lambda evt, data=None: events.append((evt, data or {}))) + monkeypatch.setattr(gdpr_scanner, "_scan_bytes", + lambda data, name: scan_bytes_result or { + "cprs": [], "pii_counts": None, "emails": [], "phones": [] + }) + monkeypatch.setattr(checkpoint, "_load_checkpoint", lambda *a, **kw: None) + monkeypatch.setattr(checkpoint, "_save_checkpoint", lambda *a, **kw: None) + monkeypatch.setattr(checkpoint, "_clear_checkpoint", lambda *a, **kw: None) + monkeypatch.setattr(checkpoint, "_load_delta_tokens", lambda: {}) + monkeypatch.setattr(checkpoint, "_save_delta_tokens", lambda *a: None) + monkeypatch.setattr(scan_engine, "_with_disposition", lambda card, db: card) + monkeypatch.setattr(gdpr_db, "get_db", lambda *a, **kw: None) + + gdpr_scanner.flagged_items.clear() + return events + + def _run(self, monkeypatch, conn, options, scan_bytes_result=None): + import gdpr_scanner + import routes.google_scan as gs + events = self._setup_mocks(monkeypatch, conn, scan_bytes_result) + gs._run_google_scan(options) + gdpr_scanner.flagged_items.clear() + return events + + def test_no_connector_broadcasts_error_and_done(self, monkeypatch): + import gdpr_scanner + import routes.google_scan as gs + from routes import state + events = [] + monkeypatch.setattr(state, "google_connector", None) + monkeypatch.setattr(gdpr_scanner, "broadcast", + lambda evt, data=None: events.append((evt, data or {}))) + gs._run_google_scan({"sources": ["gmail"], "user_emails": ["a@b.dk"], "options": {}}) + + assert any(evt == "scan_error" for evt, _ in events) + assert any(evt == "google_scan_done" for evt, _ in events) + + def test_gmail_item_with_cpr_is_flagged(self, monkeypatch): + conn = MagicMock() + conn.list_users.return_value = [] + conn.iter_gmail_messages.return_value = [ + ({"id": "msg1", "name": "report.txt", "size": 1024, "lastModifiedDateTime": "2026-01-01"}, b"content"), + ] + cpr_result = {"cprs": [{"formatted": "010101-1234"}], "pii_counts": None, "emails": [], "phones": []} + events = self._run(monkeypatch, conn, + {"sources": ["gmail"], "user_emails": ["a@test.dk"], "options": {}}, + scan_bytes_result=cpr_result) + + flagged = [d for evt, d in events if evt == "scan_file_flagged"] + assert len(flagged) == 1 + + def test_gmail_item_source_type_is_gmail(self, monkeypatch): + conn = MagicMock() + conn.list_users.return_value = [] + conn.iter_gmail_messages.return_value = [ + ({"id": "msg2", "name": "invoice.txt", "size": 512, "lastModifiedDateTime": "2026-01-01"}, b"data"), + ] + cpr_result = {"cprs": [{"formatted": "020202-2345"}], "pii_counts": None, "emails": [], "phones": []} + events = self._run(monkeypatch, conn, + {"sources": ["gmail"], "user_emails": ["a@test.dk"], "options": {}}, + scan_bytes_result=cpr_result) + + flagged = [d for evt, d in events if evt == "scan_file_flagged"] + assert flagged[0]["source_type"] == "gmail" + + def test_gmail_item_without_pii_not_flagged(self, monkeypatch): + conn = MagicMock() + conn.list_users.return_value = [] + conn.iter_gmail_messages.return_value = [ + ({"id": "msg3", "name": "memo.txt", "size": 100}, b"hello world"), + ] + events = self._run(monkeypatch, conn, + {"sources": ["gmail"], "user_emails": ["a@test.dk"], "options": {}}) + + assert not any(evt == "scan_file_flagged" for evt, _ in events) + + def test_gdrive_item_source_type_is_gdrive(self, monkeypatch): + conn = MagicMock() + conn.list_users.return_value = [] + conn.iter_gmail_messages.return_value = [] + conn.iter_drive_files.return_value = [ + ({"id": "file1", "name": "doc.docx", "size": 2048, "lastModifiedDateTime": "2026-01-01"}, b"data"), + ] + cpr_result = {"cprs": [{"formatted": "030303-3456"}], "pii_counts": None, "emails": [], "phones": []} + events = self._run(monkeypatch, conn, + {"sources": ["gmail", "gdrive"], "user_emails": ["a@test.dk"], "options": {}}, + scan_bytes_result=cpr_result) + + gdrive = [d for evt, d in events if evt == "scan_file_flagged" and d.get("source_type") == "gdrive"] + assert len(gdrive) == 1 + + def test_scan_done_always_broadcast(self, monkeypatch): + conn = MagicMock() + conn.list_users.return_value = [] + conn.iter_gmail_messages.return_value = [] + events = self._run(monkeypatch, conn, + {"sources": ["gmail"], "user_emails": ["a@test.dk"], "options": {}}) + + done = [d for evt, d in events if evt == "google_scan_done"] + assert len(done) == 1 + assert "flagged_count" in done[0] + assert "total_scanned" in done[0] + + def test_scan_done_counts_are_correct(self, monkeypatch): + conn = MagicMock() + conn.list_users.return_value = [] + conn.iter_gmail_messages.return_value = [ + ({"id": "m1", "name": "a.txt", "size": 100}, b"x"), + ({"id": "m2", "name": "b.txt", "size": 100}, b"y"), + ] + cpr_result = {"cprs": [{"formatted": "040404-4567"}], "pii_counts": None, "emails": [], "phones": []} + events = self._run(monkeypatch, conn, + {"sources": ["gmail"], "user_emails": ["a@test.dk"], "options": {}}, + scan_bytes_result=cpr_result) + + done = next(d for evt, d in events if evt == "google_scan_done") + assert done["total_scanned"] == 2 + assert done["flagged_count"] == 2