Fix Google scan not stopping cleanly before a new scan starts
This commit is contained in:
parent
66986a16f9
commit
4e5a8934d7
@ -18,7 +18,9 @@ python -m pytest tests/ -q
|
|||||||
|
|
||||||
**Google Drive delta scan** — `routes/google_scan.py` reads `scan_opts.get("delta", False)` (same flag as M365). Per user, delta key is `f"gdrive:{user_email}"` stored in `~/.gdprscanner/delta.json` alongside M365 tokens. First delta-enabled scan fetches all files then records a Changes API start page token via `conn.get_drive_start_token(user_email)`. Subsequent scans call `conn.get_drive_changes(user_email, token)` (Changes API) and update the token. Token save loads the current file fresh before writing (`{**current_tokens, **_new_drive_tokens}`) to avoid overwriting M365 tokens written by a concurrent scan thread. Invalid/expired tokens fall back to full scan automatically. `google_scan_done` now includes `"delta": bool` and `"delta_sources": int`.
|
**Google Drive delta scan** — `routes/google_scan.py` reads `scan_opts.get("delta", False)` (same flag as M365). Per user, delta key is `f"gdrive:{user_email}"` stored in `~/.gdprscanner/delta.json` alongside M365 tokens. First delta-enabled scan fetches all files then records a Changes API start page token via `conn.get_drive_start_token(user_email)`. Subsequent scans call `conn.get_drive_changes(user_email, token)` (Changes API) and update the token. Token save loads the current file fresh before writing (`{**current_tokens, **_new_drive_tokens}`) to avoid overwriting M365 tokens written by a concurrent scan thread. Invalid/expired tokens fall back to full scan automatically. `google_scan_done` now includes `"delta": bool` and `"delta_sources": int`.
|
||||||
|
|
||||||
**SFTP connector** — `sftp_connector.py` provides `SFTPScanner` with the same `iter_files()` interface as `FileScanner`. `run_file_scan()` in `scan_engine.py` checks `source.get("source_type") == "sftp"` and instantiates `SFTPScanner`; all other file-scan code (SSE, DB, cards) is unchanged. Auth: `"password"` stores credential via `store_sftp_password()` in OS keychain; `"key"` loads the private key from `~/.gdprscanner/sftp_keys/<uuid>` with an optional keychain passphrase. Key files are uploaded via `POST /api/file_sources/upload_key` (paramiko validates format). `SFTP_OK` flag guards graceful degradation if `paramiko` is not installed. Do not add `source_type="sftp"` handling anywhere except `scan_engine.py` — the rest of the pipeline is source-agnostic.
|
**Google connector write-back** — `google_connector.py` exposes three methods on both `GoogleWorkspaceConnector` and `PersonalGoogleConnector` for in-place Drive file redaction: `get_drive_file_mime(user_email, file_id) → str`, `download_drive_file_by_id(user_email, file_id) → bytes`, `update_drive_file(user_email, file_id, content, mime_type)`. Backed by module-level helpers (`_get_drive_file_mime`, `_download_drive_file_by_id`, `_update_drive_file_content`). These use `DRIVE_WRITE_SCOPES` (`drive`, not `drive.readonly`) — the service-account delegation must include this scope or the call raises 403. Do not use `DRIVE_SCOPES` for write operations.
|
||||||
|
|
||||||
|
**SFTP connector** — `sftp_connector.py` provides `SFTPScanner` with the same `iter_files()` interface as `FileScanner`. `run_file_scan()` in `scan_engine.py` checks `source.get("source_type") == "sftp"` and instantiates `SFTPScanner`; all other file-scan code (SSE, DB, cards) is unchanged. Auth: `"password"` stores credential via `store_sftp_password()` in OS keychain; `"key"` loads the private key from `~/.gdprscanner/sftp_keys/<uuid>` with an optional keychain passphrase. Key files are uploaded via `POST /api/file_sources/upload_key` (paramiko validates format). `SFTP_OK` flag guards graceful degradation if `paramiko` is not installed. Do not add `source_type="sftp"` handling anywhere except `scan_engine.py` — the rest of the pipeline is source-agnostic. Additional single-file I/O methods: `_ssh_connect()` (shared connection builder), `read_file(remote_path) → bytes`, `write_file(remote_path, content)` — used by the redaction route; do not duplicate SSH setup outside these methods.
|
||||||
|
|
||||||
**Shared content processing** — all three scan engines (M365, Google, file) funnel downloaded bytes through a single function: `cpr_detector._scan_bytes(content, filename)`. It dispatches to the correct parser by file extension. `scan_engine.py` uses the `_scan_bytes_timeout` wrapper for PDFs (subprocess + hard timeout). `routes/google_scan.py` uses `_scan_bytes` directly. Do not duplicate file-type handling in per-source code.
|
**Shared content processing** — all three scan engines (M365, Google, file) funnel downloaded bytes through a single function: `cpr_detector._scan_bytes(content, filename)`. It dispatches to the correct parser by file extension. `scan_engine.py` uses the `_scan_bytes_timeout` wrapper for PDFs (subprocess + hard timeout). `routes/google_scan.py` uses `_scan_bytes` directly. Do not duplicate file-type handling in per-source code.
|
||||||
|
|
||||||
@ -48,7 +50,7 @@ python -m pytest tests/ -q
|
|||||||
|
|
||||||
## Tests
|
## Tests
|
||||||
|
|
||||||
182 tests in `tests/`. No integration tests for live M365/Google connections.
|
201 tests in `tests/`. No integration tests for live M365/Google connections.
|
||||||
|
|
||||||
**`tests/test_google_scan.py`** — 19 tests for the Google Workspace scan module. Route tests for `GET /api/google/scan/users`, `POST /api/google/scan/start`, `POST /api/google/scan/cancel`. Engine tests for `_run_google_scan` using synchronous invocation with mocked `broadcast`, `_scan_bytes`, `checkpoint.*`, `scan_engine._with_disposition`, and `gdpr_db.get_db`. The `clean_google_state` autouse fixture releases `_google_scan_lock` and clears `_google_scan_abort` after each test.
|
**`tests/test_google_scan.py`** — 19 tests for the Google Workspace scan module. Route tests for `GET /api/google/scan/users`, `POST /api/google/scan/start`, `POST /api/google/scan/cancel`. Engine tests for `_run_google_scan` using synchronous invocation with mocked `broadcast`, `_scan_bytes`, `checkpoint.*`, `scan_engine._with_disposition`, and `gdpr_db.get_db`. The `clean_google_state` autouse fixture releases `_google_scan_lock` and clears `_google_scan_abort` after each test.
|
||||||
|
|
||||||
@ -195,6 +197,8 @@ Allows reviewing results from any past scan session without running a new scan.
|
|||||||
- **Scheduled scans** — `S._userStartedScan` is false for scheduler-triggered runs, so the SSE connection is never closed and future scheduler events continue to arrive.
|
- **Scheduled scans** — `S._userStartedScan` is false for scheduler-triggered runs, so the SSE connection is never closed and future scheduler events continue to arrive.
|
||||||
- **`scan_start` is M365-only** — `run_scan()` broadcasts `scan_start`; `run_file_scan()` and `routes/google_scan.py` must NOT. The `scan_start` handler in `_attachSchedulerListeners` unconditionally sets `S._m365ScanRunning = true`. If a file scan emits `scan_start`, the flag is set without a matching `scan_done` to clear it, and `file_scan_done` refuses to re-enable the scan button because `!S._m365ScanRunning` is false. Use `scan_phase` (file) and `google_scan_phase` (google) instead — these are routed correctly by the phase-source detection logic in `_attachScanListeners`.
|
- **`scan_start` is M365-only** — `run_scan()` broadcasts `scan_start`; `run_file_scan()` and `routes/google_scan.py` must NOT. The `scan_start` handler in `_attachSchedulerListeners` unconditionally sets `S._m365ScanRunning = true`. If a file scan emits `scan_start`, the flag is set without a matching `scan_done` to clear it, and `file_scan_done` refuses to re-enable the scan button because `!S._m365ScanRunning` is false. Use `scan_phase` (file) and `google_scan_phase` (google) instead — these are routed correctly by the phase-source detection logic in `_attachScanListeners`.
|
||||||
- **Two separate abort events** — `state._scan_abort` (M365 + file) and `state._google_scan_abort` (Google). `POST /api/scan/stop` sets **both**. `_check_abort()` inside `_run_google_scan` must use the module-level `_scan_abort` alias (`= state._google_scan_abort`), not `gdpr_scanner._scan_abort` (which is the M365 event). Do not conflate them — a Google-only scan must react to Stop, and `gdpr_scanner._scan_abort` is not the right event for that path.
|
- **Two separate abort events** — `state._scan_abort` (M365 + file) and `state._google_scan_abort` (Google). `POST /api/scan/stop` sets **both**. `_check_abort()` inside `_run_google_scan` must use the module-level `_scan_abort` alias (`= state._google_scan_abort`), not `gdpr_scanner._scan_abort` (which is the M365 event). Do not conflate them — a Google-only scan must react to Stop, and `gdpr_scanner._scan_abort` is not the right event for that path.
|
||||||
|
- **`_check_abort()` emits `google_scan_done`, not `scan_cancelled`** — when the Google thread detects the abort signal it broadcasts `google_scan_done` (with `cancelled: True`). Do NOT change this back to `scan_cancelled`. The `scan_cancelled` handler in the frontend unconditionally closes the SSE connection; if a new M365/file scan started while the old Google thread was still winding down, that would drop all remaining events from the new scan. `google_scan_done` is safe because its handler checks `!S._m365ScanRunning && !S._fileScanRunning` before closing the SSE.
|
||||||
|
- **Google Drive uses a lazy generator, not `list()`** — `iter_drive_files()` is iterated directly (no `list()` wrapping) in both the non-delta path and the delta-fallback path. Wrapping in `list()` would block the thread — and the abort check — for the entire file enumeration. The delta start token is recorded *before* the iteration loop so it captures state at the right moment regardless of when the loop is interrupted.
|
||||||
|
|
||||||
## Email sending — routes/email.py + m365_connector.py
|
## Email sending — routes/email.py + m365_connector.py
|
||||||
|
|
||||||
|
|||||||
@ -226,7 +226,17 @@ def _run_google_scan(options: dict):
|
|||||||
|
|
||||||
def _check_abort():
|
def _check_abort():
|
||||||
if _scan_abort.is_set():
|
if _scan_abort.is_set():
|
||||||
broadcast("scan_cancelled", {"completed": total_scanned})
|
# Emit google_scan_done (not scan_cancelled) so that the frontend
|
||||||
|
# google_scan_done handler can decide whether to close the SSE based
|
||||||
|
# on whether other scan types (M365, file) are still running.
|
||||||
|
# scan_cancelled would unconditionally close the SSE connection,
|
||||||
|
# dropping events from a concurrently running new scan.
|
||||||
|
broadcast("google_scan_done", {
|
||||||
|
"flagged_count": total_flagged,
|
||||||
|
"total_scanned": total_scanned,
|
||||||
|
"elapsed_seconds": round(_time.monotonic() - t_start, 1),
|
||||||
|
"cancelled": True,
|
||||||
|
})
|
||||||
return True
|
return True
|
||||||
return False
|
return False
|
||||||
|
|
||||||
@ -355,19 +365,23 @@ def _run_google_scan(options: dict):
|
|||||||
except Exception as delta_err:
|
except Exception as delta_err:
|
||||||
broadcast("scan_phase", {"phase": f"{user_email} — Google Drive (delta token invalid — full scan)"})
|
broadcast("scan_phase", {"phase": f"{user_email} — Google Drive (delta token invalid — full scan)"})
|
||||||
logger.warning("[gdrive delta] %s: %s — falling back to full scan", user_email, delta_err)
|
logger.warning("[gdrive delta] %s: %s — falling back to full scan", user_email, delta_err)
|
||||||
drive_items = list(conn.iter_drive_files(user_email, max_files=max_files, max_file_mb=max_file_mb))
|
# Record start token BEFORE iterating so the next delta starts from here
|
||||||
try:
|
try:
|
||||||
_new_drive_tokens[delta_key] = conn.get_drive_start_token(user_email)
|
_new_drive_tokens[delta_key] = conn.get_drive_start_token(user_email)
|
||||||
except Exception:
|
except Exception:
|
||||||
pass
|
pass
|
||||||
|
# Use a lazy generator (no list()) so _check_abort() fires between items
|
||||||
|
drive_items = conn.iter_drive_files(user_email, max_files=max_files, max_file_mb=max_file_mb)
|
||||||
else:
|
else:
|
||||||
broadcast("scan_phase", {"phase": f"{user_email} — Google Drive"})
|
broadcast("scan_phase", {"phase": f"{user_email} — Google Drive"})
|
||||||
drive_items = list(conn.iter_drive_files(user_email, max_files=max_files, max_file_mb=max_file_mb))
|
# Record start token BEFORE iterating so the next delta starts from here
|
||||||
if delta_enabled:
|
if delta_enabled:
|
||||||
try:
|
try:
|
||||||
_new_drive_tokens[delta_key] = conn.get_drive_start_token(user_email)
|
_new_drive_tokens[delta_key] = conn.get_drive_start_token(user_email)
|
||||||
except Exception:
|
except Exception:
|
||||||
pass
|
pass
|
||||||
|
# Use a lazy generator (no list()) so _check_abort() fires between items
|
||||||
|
drive_items = conn.iter_drive_files(user_email, max_files=max_files, max_file_mb=max_file_mb)
|
||||||
|
|
||||||
for meta, data in drive_items:
|
for meta, data in drive_items:
|
||||||
if _check_abort():
|
if _check_abort():
|
||||||
|
|||||||
@ -579,6 +579,22 @@ function startScan(resume) {
|
|||||||
S._userStartedScan = true;
|
S._userStartedScan = true;
|
||||||
_ensureSSE();
|
_ensureSSE();
|
||||||
|
|
||||||
|
// Revert to idle if every scan type that was supposed to start got rejected.
|
||||||
|
// Called after each 409 so we don't leave the UI stuck in "running" state
|
||||||
|
// while the previous scan's thread finishes winding down.
|
||||||
|
function _onScanConflict(label) {
|
||||||
|
log(label + ' ' + t('scan_already_running_err', 'already running — previous scan still stopping. Please wait and try again.'), 'err');
|
||||||
|
if (label === 'm365') S._m365ScanRunning = false;
|
||||||
|
if (label === 'file') S._fileScanRunning = false;
|
||||||
|
if (label === 'google') S._googleScanRunning = false;
|
||||||
|
if (!S._m365ScanRunning && !S._googleScanRunning && !S._fileScanRunning) {
|
||||||
|
document.getElementById('scanBtn').disabled = false;
|
||||||
|
document.getElementById('stopBtn').style.display = 'none';
|
||||||
|
if (S.es) { S.es.close(); S.es = null; }
|
||||||
|
S._userStartedScan = false;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
setTimeout(() => {
|
setTimeout(() => {
|
||||||
// Fire M365 scan if any M365 sources are selected
|
// Fire M365 scan if any M365 sources are selected
|
||||||
if (sources.length > 0) {
|
if (sources.length > 0) {
|
||||||
@ -587,7 +603,7 @@ function startScan(resume) {
|
|||||||
body: JSON.stringify({sources, user_ids, options, resume: !!resume,
|
body: JSON.stringify({sources, user_ids, options, resume: !!resume,
|
||||||
profile_id: S._activeProfileId || null})
|
profile_id: S._activeProfileId || null})
|
||||||
}).then(r => {
|
}).then(r => {
|
||||||
if (r.status === 409) { log('Scan already running', 'err'); }
|
if (r.status === 409) { _onScanConflict('m365'); }
|
||||||
}).catch(e => { log('Scan start failed: ' + e, 'err'); });
|
}).catch(e => { log('Scan start failed: ' + e, 'err'); });
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -608,6 +624,8 @@ function startScan(resume) {
|
|||||||
scan_emails: options.scan_emails || false,
|
scan_emails: options.scan_emails || false,
|
||||||
scan_phones: options.scan_phones || false,
|
scan_phones: options.scan_phones || false,
|
||||||
}))
|
}))
|
||||||
|
}).then(r => {
|
||||||
|
if (r.status === 409) { _onScanConflict('file'); }
|
||||||
}).catch(e => { log('File scan error: ' + e, 'err'); });
|
}).catch(e => { log('File scan error: ' + e, 'err'); });
|
||||||
});
|
});
|
||||||
|
|
||||||
@ -630,7 +648,7 @@ function startScan(resume) {
|
|||||||
options: options
|
options: options
|
||||||
})
|
})
|
||||||
}).then(r => {
|
}).then(r => {
|
||||||
if (r.status === 409) { log('Google scan already running', 'err'); }
|
if (r.status === 409) { _onScanConflict('google'); }
|
||||||
}).catch(e => { log('Google scan error: ' + e, 'err'); });
|
}).catch(e => { log('Google scan error: ' + e, 'err'); });
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user