diff --git a/CLAUDE.md b/CLAUDE.md index d958ebb..f5a5c6f 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -18,7 +18,9 @@ python -m pytest tests/ -q **Google Drive delta scan** — `routes/google_scan.py` reads `scan_opts.get("delta", False)` (same flag as M365). Per user, delta key is `f"gdrive:{user_email}"` stored in `~/.gdprscanner/delta.json` alongside M365 tokens. First delta-enabled scan fetches all files then records a Changes API start page token via `conn.get_drive_start_token(user_email)`. Subsequent scans call `conn.get_drive_changes(user_email, token)` (Changes API) and update the token. Token save loads the current file fresh before writing (`{**current_tokens, **_new_drive_tokens}`) to avoid overwriting M365 tokens written by a concurrent scan thread. Invalid/expired tokens fall back to full scan automatically. `google_scan_done` now includes `"delta": bool` and `"delta_sources": int`. -**SFTP connector** — `sftp_connector.py` provides `SFTPScanner` with the same `iter_files()` interface as `FileScanner`. `run_file_scan()` in `scan_engine.py` checks `source.get("source_type") == "sftp"` and instantiates `SFTPScanner`; all other file-scan code (SSE, DB, cards) is unchanged. Auth: `"password"` stores credential via `store_sftp_password()` in OS keychain; `"key"` loads the private key from `~/.gdprscanner/sftp_keys/` with an optional keychain passphrase. Key files are uploaded via `POST /api/file_sources/upload_key` (paramiko validates format). `SFTP_OK` flag guards graceful degradation if `paramiko` is not installed. Do not add `source_type="sftp"` handling anywhere except `scan_engine.py` — the rest of the pipeline is source-agnostic. +**Google connector write-back** — `google_connector.py` exposes three methods on both `GoogleWorkspaceConnector` and `PersonalGoogleConnector` for in-place Drive file redaction: `get_drive_file_mime(user_email, file_id) → str`, `download_drive_file_by_id(user_email, file_id) → bytes`, `update_drive_file(user_email, file_id, content, mime_type)`. Backed by module-level helpers (`_get_drive_file_mime`, `_download_drive_file_by_id`, `_update_drive_file_content`). These use `DRIVE_WRITE_SCOPES` (`drive`, not `drive.readonly`) — the service-account delegation must include this scope or the call raises 403. Do not use `DRIVE_SCOPES` for write operations. + +**SFTP connector** — `sftp_connector.py` provides `SFTPScanner` with the same `iter_files()` interface as `FileScanner`. `run_file_scan()` in `scan_engine.py` checks `source.get("source_type") == "sftp"` and instantiates `SFTPScanner`; all other file-scan code (SSE, DB, cards) is unchanged. Auth: `"password"` stores credential via `store_sftp_password()` in OS keychain; `"key"` loads the private key from `~/.gdprscanner/sftp_keys/` with an optional keychain passphrase. Key files are uploaded via `POST /api/file_sources/upload_key` (paramiko validates format). `SFTP_OK` flag guards graceful degradation if `paramiko` is not installed. Do not add `source_type="sftp"` handling anywhere except `scan_engine.py` — the rest of the pipeline is source-agnostic. Additional single-file I/O methods: `_ssh_connect()` (shared connection builder), `read_file(remote_path) → bytes`, `write_file(remote_path, content)` — used by the redaction route; do not duplicate SSH setup outside these methods. **Shared content processing** — all three scan engines (M365, Google, file) funnel downloaded bytes through a single function: `cpr_detector._scan_bytes(content, filename)`. It dispatches to the correct parser by file extension. `scan_engine.py` uses the `_scan_bytes_timeout` wrapper for PDFs (subprocess + hard timeout). `routes/google_scan.py` uses `_scan_bytes` directly. Do not duplicate file-type handling in per-source code. @@ -48,7 +50,7 @@ python -m pytest tests/ -q ## Tests -182 tests in `tests/`. No integration tests for live M365/Google connections. +201 tests in `tests/`. No integration tests for live M365/Google connections. **`tests/test_google_scan.py`** — 19 tests for the Google Workspace scan module. Route tests for `GET /api/google/scan/users`, `POST /api/google/scan/start`, `POST /api/google/scan/cancel`. Engine tests for `_run_google_scan` using synchronous invocation with mocked `broadcast`, `_scan_bytes`, `checkpoint.*`, `scan_engine._with_disposition`, and `gdpr_db.get_db`. The `clean_google_state` autouse fixture releases `_google_scan_lock` and clears `_google_scan_abort` after each test. @@ -195,6 +197,8 @@ Allows reviewing results from any past scan session without running a new scan. - **Scheduled scans** — `S._userStartedScan` is false for scheduler-triggered runs, so the SSE connection is never closed and future scheduler events continue to arrive. - **`scan_start` is M365-only** — `run_scan()` broadcasts `scan_start`; `run_file_scan()` and `routes/google_scan.py` must NOT. The `scan_start` handler in `_attachSchedulerListeners` unconditionally sets `S._m365ScanRunning = true`. If a file scan emits `scan_start`, the flag is set without a matching `scan_done` to clear it, and `file_scan_done` refuses to re-enable the scan button because `!S._m365ScanRunning` is false. Use `scan_phase` (file) and `google_scan_phase` (google) instead — these are routed correctly by the phase-source detection logic in `_attachScanListeners`. - **Two separate abort events** — `state._scan_abort` (M365 + file) and `state._google_scan_abort` (Google). `POST /api/scan/stop` sets **both**. `_check_abort()` inside `_run_google_scan` must use the module-level `_scan_abort` alias (`= state._google_scan_abort`), not `gdpr_scanner._scan_abort` (which is the M365 event). Do not conflate them — a Google-only scan must react to Stop, and `gdpr_scanner._scan_abort` is not the right event for that path. +- **`_check_abort()` emits `google_scan_done`, not `scan_cancelled`** — when the Google thread detects the abort signal it broadcasts `google_scan_done` (with `cancelled: True`). Do NOT change this back to `scan_cancelled`. The `scan_cancelled` handler in the frontend unconditionally closes the SSE connection; if a new M365/file scan started while the old Google thread was still winding down, that would drop all remaining events from the new scan. `google_scan_done` is safe because its handler checks `!S._m365ScanRunning && !S._fileScanRunning` before closing the SSE. +- **Google Drive uses a lazy generator, not `list()`** — `iter_drive_files()` is iterated directly (no `list()` wrapping) in both the non-delta path and the delta-fallback path. Wrapping in `list()` would block the thread — and the abort check — for the entire file enumeration. The delta start token is recorded *before* the iteration loop so it captures state at the right moment regardless of when the loop is interrupted. ## Email sending — routes/email.py + m365_connector.py diff --git a/routes/google_scan.py b/routes/google_scan.py index 3293ee3..8b973a5 100644 --- a/routes/google_scan.py +++ b/routes/google_scan.py @@ -226,7 +226,17 @@ def _run_google_scan(options: dict): def _check_abort(): if _scan_abort.is_set(): - broadcast("scan_cancelled", {"completed": total_scanned}) + # Emit google_scan_done (not scan_cancelled) so that the frontend + # google_scan_done handler can decide whether to close the SSE based + # on whether other scan types (M365, file) are still running. + # scan_cancelled would unconditionally close the SSE connection, + # dropping events from a concurrently running new scan. + broadcast("google_scan_done", { + "flagged_count": total_flagged, + "total_scanned": total_scanned, + "elapsed_seconds": round(_time.monotonic() - t_start, 1), + "cancelled": True, + }) return True return False @@ -355,19 +365,23 @@ def _run_google_scan(options: dict): except Exception as delta_err: broadcast("scan_phase", {"phase": f"{user_email} — Google Drive (delta token invalid — full scan)"}) logger.warning("[gdrive delta] %s: %s — falling back to full scan", user_email, delta_err) - drive_items = list(conn.iter_drive_files(user_email, max_files=max_files, max_file_mb=max_file_mb)) + # Record start token BEFORE iterating so the next delta starts from here try: _new_drive_tokens[delta_key] = conn.get_drive_start_token(user_email) except Exception: pass + # Use a lazy generator (no list()) so _check_abort() fires between items + drive_items = conn.iter_drive_files(user_email, max_files=max_files, max_file_mb=max_file_mb) else: broadcast("scan_phase", {"phase": f"{user_email} — Google Drive"}) - drive_items = list(conn.iter_drive_files(user_email, max_files=max_files, max_file_mb=max_file_mb)) + # Record start token BEFORE iterating so the next delta starts from here if delta_enabled: try: _new_drive_tokens[delta_key] = conn.get_drive_start_token(user_email) except Exception: pass + # Use a lazy generator (no list()) so _check_abort() fires between items + drive_items = conn.iter_drive_files(user_email, max_files=max_files, max_file_mb=max_file_mb) for meta, data in drive_items: if _check_abort(): diff --git a/static/js/scan.js b/static/js/scan.js index b6c54ca..7d6f9da 100644 --- a/static/js/scan.js +++ b/static/js/scan.js @@ -579,6 +579,22 @@ function startScan(resume) { S._userStartedScan = true; _ensureSSE(); + // Revert to idle if every scan type that was supposed to start got rejected. + // Called after each 409 so we don't leave the UI stuck in "running" state + // while the previous scan's thread finishes winding down. + function _onScanConflict(label) { + log(label + ' ' + t('scan_already_running_err', 'already running — previous scan still stopping. Please wait and try again.'), 'err'); + if (label === 'm365') S._m365ScanRunning = false; + if (label === 'file') S._fileScanRunning = false; + if (label === 'google') S._googleScanRunning = false; + if (!S._m365ScanRunning && !S._googleScanRunning && !S._fileScanRunning) { + document.getElementById('scanBtn').disabled = false; + document.getElementById('stopBtn').style.display = 'none'; + if (S.es) { S.es.close(); S.es = null; } + S._userStartedScan = false; + } + } + setTimeout(() => { // Fire M365 scan if any M365 sources are selected if (sources.length > 0) { @@ -587,7 +603,7 @@ function startScan(resume) { body: JSON.stringify({sources, user_ids, options, resume: !!resume, profile_id: S._activeProfileId || null}) }).then(r => { - if (r.status === 409) { log('Scan already running', 'err'); } + if (r.status === 409) { _onScanConflict('m365'); } }).catch(e => { log('Scan start failed: ' + e, 'err'); }); } @@ -608,6 +624,8 @@ function startScan(resume) { scan_emails: options.scan_emails || false, scan_phones: options.scan_phones || false, })) + }).then(r => { + if (r.status === 409) { _onScanConflict('file'); } }).catch(e => { log('File scan error: ' + e, 'err'); }); }); @@ -630,7 +648,7 @@ function startScan(resume) { options: options }) }).then(r => { - if (r.status === 409) { log('Google scan already running', 'err'); } + if (r.status === 409) { _onScanConflict('google'); } }).catch(e => { log('Google scan error: ' + e, 'err'); }); }