Fix Google scan not stopping cleanly before a new scan starts

This commit is contained in:
StyxX65 2026-05-29 04:53:42 +02:00
parent 66986a16f9
commit 4e5a8934d7
3 changed files with 43 additions and 7 deletions

View File

@ -18,7 +18,9 @@ python -m pytest tests/ -q
**Google Drive delta scan** — `routes/google_scan.py` reads `scan_opts.get("delta", False)` (same flag as M365). Per user, delta key is `f"gdrive:{user_email}"` stored in `~/.gdprscanner/delta.json` alongside M365 tokens. First delta-enabled scan fetches all files then records a Changes API start page token via `conn.get_drive_start_token(user_email)`. Subsequent scans call `conn.get_drive_changes(user_email, token)` (Changes API) and update the token. Token save loads the current file fresh before writing (`{**current_tokens, **_new_drive_tokens}`) to avoid overwriting M365 tokens written by a concurrent scan thread. Invalid/expired tokens fall back to full scan automatically. `google_scan_done` now includes `"delta": bool` and `"delta_sources": int`. **Google Drive delta scan** — `routes/google_scan.py` reads `scan_opts.get("delta", False)` (same flag as M365). Per user, delta key is `f"gdrive:{user_email}"` stored in `~/.gdprscanner/delta.json` alongside M365 tokens. First delta-enabled scan fetches all files then records a Changes API start page token via `conn.get_drive_start_token(user_email)`. Subsequent scans call `conn.get_drive_changes(user_email, token)` (Changes API) and update the token. Token save loads the current file fresh before writing (`{**current_tokens, **_new_drive_tokens}`) to avoid overwriting M365 tokens written by a concurrent scan thread. Invalid/expired tokens fall back to full scan automatically. `google_scan_done` now includes `"delta": bool` and `"delta_sources": int`.
**SFTP connector** — `sftp_connector.py` provides `SFTPScanner` with the same `iter_files()` interface as `FileScanner`. `run_file_scan()` in `scan_engine.py` checks `source.get("source_type") == "sftp"` and instantiates `SFTPScanner`; all other file-scan code (SSE, DB, cards) is unchanged. Auth: `"password"` stores credential via `store_sftp_password()` in OS keychain; `"key"` loads the private key from `~/.gdprscanner/sftp_keys/<uuid>` with an optional keychain passphrase. Key files are uploaded via `POST /api/file_sources/upload_key` (paramiko validates format). `SFTP_OK` flag guards graceful degradation if `paramiko` is not installed. Do not add `source_type="sftp"` handling anywhere except `scan_engine.py` — the rest of the pipeline is source-agnostic. **Google connector write-back** — `google_connector.py` exposes three methods on both `GoogleWorkspaceConnector` and `PersonalGoogleConnector` for in-place Drive file redaction: `get_drive_file_mime(user_email, file_id) → str`, `download_drive_file_by_id(user_email, file_id) → bytes`, `update_drive_file(user_email, file_id, content, mime_type)`. Backed by module-level helpers (`_get_drive_file_mime`, `_download_drive_file_by_id`, `_update_drive_file_content`). These use `DRIVE_WRITE_SCOPES` (`drive`, not `drive.readonly`) — the service-account delegation must include this scope or the call raises 403. Do not use `DRIVE_SCOPES` for write operations.
**SFTP connector** — `sftp_connector.py` provides `SFTPScanner` with the same `iter_files()` interface as `FileScanner`. `run_file_scan()` in `scan_engine.py` checks `source.get("source_type") == "sftp"` and instantiates `SFTPScanner`; all other file-scan code (SSE, DB, cards) is unchanged. Auth: `"password"` stores credential via `store_sftp_password()` in OS keychain; `"key"` loads the private key from `~/.gdprscanner/sftp_keys/<uuid>` with an optional keychain passphrase. Key files are uploaded via `POST /api/file_sources/upload_key` (paramiko validates format). `SFTP_OK` flag guards graceful degradation if `paramiko` is not installed. Do not add `source_type="sftp"` handling anywhere except `scan_engine.py` — the rest of the pipeline is source-agnostic. Additional single-file I/O methods: `_ssh_connect()` (shared connection builder), `read_file(remote_path) → bytes`, `write_file(remote_path, content)` — used by the redaction route; do not duplicate SSH setup outside these methods.
**Shared content processing** — all three scan engines (M365, Google, file) funnel downloaded bytes through a single function: `cpr_detector._scan_bytes(content, filename)`. It dispatches to the correct parser by file extension. `scan_engine.py` uses the `_scan_bytes_timeout` wrapper for PDFs (subprocess + hard timeout). `routes/google_scan.py` uses `_scan_bytes` directly. Do not duplicate file-type handling in per-source code. **Shared content processing** — all three scan engines (M365, Google, file) funnel downloaded bytes through a single function: `cpr_detector._scan_bytes(content, filename)`. It dispatches to the correct parser by file extension. `scan_engine.py` uses the `_scan_bytes_timeout` wrapper for PDFs (subprocess + hard timeout). `routes/google_scan.py` uses `_scan_bytes` directly. Do not duplicate file-type handling in per-source code.
@ -48,7 +50,7 @@ python -m pytest tests/ -q
## Tests ## Tests
182 tests in `tests/`. No integration tests for live M365/Google connections. 201 tests in `tests/`. No integration tests for live M365/Google connections.
**`tests/test_google_scan.py`** — 19 tests for the Google Workspace scan module. Route tests for `GET /api/google/scan/users`, `POST /api/google/scan/start`, `POST /api/google/scan/cancel`. Engine tests for `_run_google_scan` using synchronous invocation with mocked `broadcast`, `_scan_bytes`, `checkpoint.*`, `scan_engine._with_disposition`, and `gdpr_db.get_db`. The `clean_google_state` autouse fixture releases `_google_scan_lock` and clears `_google_scan_abort` after each test. **`tests/test_google_scan.py`** — 19 tests for the Google Workspace scan module. Route tests for `GET /api/google/scan/users`, `POST /api/google/scan/start`, `POST /api/google/scan/cancel`. Engine tests for `_run_google_scan` using synchronous invocation with mocked `broadcast`, `_scan_bytes`, `checkpoint.*`, `scan_engine._with_disposition`, and `gdpr_db.get_db`. The `clean_google_state` autouse fixture releases `_google_scan_lock` and clears `_google_scan_abort` after each test.
@ -195,6 +197,8 @@ Allows reviewing results from any past scan session without running a new scan.
- **Scheduled scans**`S._userStartedScan` is false for scheduler-triggered runs, so the SSE connection is never closed and future scheduler events continue to arrive. - **Scheduled scans**`S._userStartedScan` is false for scheduler-triggered runs, so the SSE connection is never closed and future scheduler events continue to arrive.
- **`scan_start` is M365-only** — `run_scan()` broadcasts `scan_start`; `run_file_scan()` and `routes/google_scan.py` must NOT. The `scan_start` handler in `_attachSchedulerListeners` unconditionally sets `S._m365ScanRunning = true`. If a file scan emits `scan_start`, the flag is set without a matching `scan_done` to clear it, and `file_scan_done` refuses to re-enable the scan button because `!S._m365ScanRunning` is false. Use `scan_phase` (file) and `google_scan_phase` (google) instead — these are routed correctly by the phase-source detection logic in `_attachScanListeners`. - **`scan_start` is M365-only** — `run_scan()` broadcasts `scan_start`; `run_file_scan()` and `routes/google_scan.py` must NOT. The `scan_start` handler in `_attachSchedulerListeners` unconditionally sets `S._m365ScanRunning = true`. If a file scan emits `scan_start`, the flag is set without a matching `scan_done` to clear it, and `file_scan_done` refuses to re-enable the scan button because `!S._m365ScanRunning` is false. Use `scan_phase` (file) and `google_scan_phase` (google) instead — these are routed correctly by the phase-source detection logic in `_attachScanListeners`.
- **Two separate abort events**`state._scan_abort` (M365 + file) and `state._google_scan_abort` (Google). `POST /api/scan/stop` sets **both**. `_check_abort()` inside `_run_google_scan` must use the module-level `_scan_abort` alias (`= state._google_scan_abort`), not `gdpr_scanner._scan_abort` (which is the M365 event). Do not conflate them — a Google-only scan must react to Stop, and `gdpr_scanner._scan_abort` is not the right event for that path. - **Two separate abort events**`state._scan_abort` (M365 + file) and `state._google_scan_abort` (Google). `POST /api/scan/stop` sets **both**. `_check_abort()` inside `_run_google_scan` must use the module-level `_scan_abort` alias (`= state._google_scan_abort`), not `gdpr_scanner._scan_abort` (which is the M365 event). Do not conflate them — a Google-only scan must react to Stop, and `gdpr_scanner._scan_abort` is not the right event for that path.
- **`_check_abort()` emits `google_scan_done`, not `scan_cancelled`** — when the Google thread detects the abort signal it broadcasts `google_scan_done` (with `cancelled: True`). Do NOT change this back to `scan_cancelled`. The `scan_cancelled` handler in the frontend unconditionally closes the SSE connection; if a new M365/file scan started while the old Google thread was still winding down, that would drop all remaining events from the new scan. `google_scan_done` is safe because its handler checks `!S._m365ScanRunning && !S._fileScanRunning` before closing the SSE.
- **Google Drive uses a lazy generator, not `list()`**`iter_drive_files()` is iterated directly (no `list()` wrapping) in both the non-delta path and the delta-fallback path. Wrapping in `list()` would block the thread — and the abort check — for the entire file enumeration. The delta start token is recorded *before* the iteration loop so it captures state at the right moment regardless of when the loop is interrupted.
## Email sending — routes/email.py + m365_connector.py ## Email sending — routes/email.py + m365_connector.py

View File

@ -226,7 +226,17 @@ def _run_google_scan(options: dict):
def _check_abort(): def _check_abort():
if _scan_abort.is_set(): if _scan_abort.is_set():
broadcast("scan_cancelled", {"completed": total_scanned}) # Emit google_scan_done (not scan_cancelled) so that the frontend
# google_scan_done handler can decide whether to close the SSE based
# on whether other scan types (M365, file) are still running.
# scan_cancelled would unconditionally close the SSE connection,
# dropping events from a concurrently running new scan.
broadcast("google_scan_done", {
"flagged_count": total_flagged,
"total_scanned": total_scanned,
"elapsed_seconds": round(_time.monotonic() - t_start, 1),
"cancelled": True,
})
return True return True
return False return False
@ -355,19 +365,23 @@ def _run_google_scan(options: dict):
except Exception as delta_err: except Exception as delta_err:
broadcast("scan_phase", {"phase": f"{user_email} — Google Drive (delta token invalid — full scan)"}) broadcast("scan_phase", {"phase": f"{user_email} — Google Drive (delta token invalid — full scan)"})
logger.warning("[gdrive delta] %s: %s — falling back to full scan", user_email, delta_err) logger.warning("[gdrive delta] %s: %s — falling back to full scan", user_email, delta_err)
drive_items = list(conn.iter_drive_files(user_email, max_files=max_files, max_file_mb=max_file_mb)) # Record start token BEFORE iterating so the next delta starts from here
try: try:
_new_drive_tokens[delta_key] = conn.get_drive_start_token(user_email) _new_drive_tokens[delta_key] = conn.get_drive_start_token(user_email)
except Exception: except Exception:
pass pass
# Use a lazy generator (no list()) so _check_abort() fires between items
drive_items = conn.iter_drive_files(user_email, max_files=max_files, max_file_mb=max_file_mb)
else: else:
broadcast("scan_phase", {"phase": f"{user_email} — Google Drive"}) broadcast("scan_phase", {"phase": f"{user_email} — Google Drive"})
drive_items = list(conn.iter_drive_files(user_email, max_files=max_files, max_file_mb=max_file_mb)) # Record start token BEFORE iterating so the next delta starts from here
if delta_enabled: if delta_enabled:
try: try:
_new_drive_tokens[delta_key] = conn.get_drive_start_token(user_email) _new_drive_tokens[delta_key] = conn.get_drive_start_token(user_email)
except Exception: except Exception:
pass pass
# Use a lazy generator (no list()) so _check_abort() fires between items
drive_items = conn.iter_drive_files(user_email, max_files=max_files, max_file_mb=max_file_mb)
for meta, data in drive_items: for meta, data in drive_items:
if _check_abort(): if _check_abort():

View File

@ -579,6 +579,22 @@ function startScan(resume) {
S._userStartedScan = true; S._userStartedScan = true;
_ensureSSE(); _ensureSSE();
// Revert to idle if every scan type that was supposed to start got rejected.
// Called after each 409 so we don't leave the UI stuck in "running" state
// while the previous scan's thread finishes winding down.
function _onScanConflict(label) {
log(label + ' ' + t('scan_already_running_err', 'already running — previous scan still stopping. Please wait and try again.'), 'err');
if (label === 'm365') S._m365ScanRunning = false;
if (label === 'file') S._fileScanRunning = false;
if (label === 'google') S._googleScanRunning = false;
if (!S._m365ScanRunning && !S._googleScanRunning && !S._fileScanRunning) {
document.getElementById('scanBtn').disabled = false;
document.getElementById('stopBtn').style.display = 'none';
if (S.es) { S.es.close(); S.es = null; }
S._userStartedScan = false;
}
}
setTimeout(() => { setTimeout(() => {
// Fire M365 scan if any M365 sources are selected // Fire M365 scan if any M365 sources are selected
if (sources.length > 0) { if (sources.length > 0) {
@ -587,7 +603,7 @@ function startScan(resume) {
body: JSON.stringify({sources, user_ids, options, resume: !!resume, body: JSON.stringify({sources, user_ids, options, resume: !!resume,
profile_id: S._activeProfileId || null}) profile_id: S._activeProfileId || null})
}).then(r => { }).then(r => {
if (r.status === 409) { log('Scan already running', 'err'); } if (r.status === 409) { _onScanConflict('m365'); }
}).catch(e => { log('Scan start failed: ' + e, 'err'); }); }).catch(e => { log('Scan start failed: ' + e, 'err'); });
} }
@ -608,6 +624,8 @@ function startScan(resume) {
scan_emails: options.scan_emails || false, scan_emails: options.scan_emails || false,
scan_phones: options.scan_phones || false, scan_phones: options.scan_phones || false,
})) }))
}).then(r => {
if (r.status === 409) { _onScanConflict('file'); }
}).catch(e => { log('File scan error: ' + e, 'err'); }); }).catch(e => { log('File scan error: ' + e, 'err'); });
}); });
@ -630,7 +648,7 @@ function startScan(resume) {
options: options options: options
}) })
}).then(r => { }).then(r => {
if (r.status === 409) { log('Google scan already running', 'err'); } if (r.status === 409) { _onScanConflict('google'); }
}).catch(e => { log('Google scan error: ' + e, 'err'); }); }).catch(e => { log('Google scan error: ' + e, 'err'); });
} }