- Scan history browser (history.js, GET /api/db/sessions, get_sessions(), get_session_items(ref_scan_id)) — review any past session without rescanning - User-scoped viewer tokens (#34) — scope by individual employee across M365 and GWS; autocomplete from Accounts list; dual-email support - Fix: GWS scan never marked finished (end_scan → finish_scan) and emitted wrong SSE event (scan_done → google_scan_done), excluding GWS items from all exports - Fix: file scan begin_scan called with wrong keyword args (TypeError swallowed), so local/SMB items were never written to DB - Fix: Graph sendMail reported failure on success — _post() now returns {} on empty 202 response instead of raising JSONDecodeError - Fix: Graph error hidden behind generic "No SMTP host" message when both Graph and SMTP were unavailable - Fix: Gmail vs Google Workspace SMTP error messages distinguished by username domain; Workspace errors point to admin console, not personal security settings - Docs: update README, MANUAL-EN, MANUAL-DA, CLAUDE.md, TODO.md, CHANGELOG.md Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
6.6 KiB
TODO — Pending features and sustainability
Quick overview of what's still to be done.
Recently completed
Memory exhaustion during large M365 scans ✅
Six root causes fixed in scan_engine.py and document_scanner.py:
- Email body HTML stripped at collection time (
bodykey deleted from each message dict before it enterswork_items; plain text stored as_precomputed_bodyinstead) work_itemslist converted to adequebefore processing so each item is released immediately afterpopleft()del contentadded in file-processing branch as soon as raw bytes are no longer needed (before NER/PII counting)del body_textadded after email body is fully consumed- PDF OCR page images (
PIL.Image) nulled out one by one after OCR instead of holding all pages in RAM - Memory guard using
psutilskips file downloads when < 300 MB RAM is available
Still open: The collection phase itself is still a "gather all, then process" loop. For very large tenants (>500k emails) the pre-extracted plain text in work_items could still be significant. The complete fix is to process each user's emails/files inline as they are fetched (generator/streaming pattern) rather than accumulating them into work_items first — estimated 1–2 days of refactor.
Pending
#15 — Scan profiles ✅
Named, reusable scan configurations. Full spec in SUGGESTIONS.md §15.
Size: Large · Priority: High
#23 — Google Workspace role classification + cross-platform identity mapping ✅
Full spec in SUGGESTIONS.md §23.
Size: Large · Priority: Medium
#27 — Migrate i18n format from .lang to JSON ✅
Full spec in SUGGESTIONS.md §27.
Size: Medium · Priority: Low
#29 — Rename skus/ → classification/ ✅
Full spec in SUGGESTIONS.md §29.
Size: Small · Priority: Low
#33 — Read-only viewer mode with PIN/token URL ✅
A shareable URL (token-protected) or numeric PIN that gives a DPO, school principal, or compliance coordinator read-only access to the results grid — with disposition tagging but without scan controls, credentials, or delete access. Full spec in SUGGESTIONS.md §33.
Size: Medium · Priority: Medium
OneDrive 404 errors — investigate and handle appropriately ✅
404 on drive/root/delta during delta scans was being broadcast as a red scan_error. Root cause: _get() hit raise_for_status() for 404s, which fell through to the generic except Exception handler in _scan_user_onedrive. The full-scan path silently swallowed the same 404 via except Exception: return in _iter_drive_folder_for.
Fixed by adding M365DriveNotFound(M365Error) exception, raising it from _get() on 404, and catching it explicitly in _scan_user_onedrive with a lower-severity scan_phase broadcast ("OneDrive (user): not provisioned — skipped") instead of a red error card.
#34 — User-scoped viewer tokens ✅
Viewer token scope extended to {"user": ["m365@…", "gws@…"], "display_name": "Alice Smith"}, filtering flagged_items by account_id IN (list). Lets a single employee see only their own flagged files across both M365 and Google Workspace.
Implemented:
- Scope format —
useris a list of email strings (one per platform);display_namestored for UI display. Legacy single-string format coerced to list automatically. - Token creation UI — scope-type selector (
All/Role/User) reveals either the role select or a searchable name autocomplete. Autocomplete filtersS._allUsersby display name or email; rows show name + both emails for dual-platform users. Selected user's full name fills the input; both emails stored in the scope. GET /api/db/flagged— filtersWHERE account_id IN (scope.user set), covering items from both platforms.- Viewer header —
#viewerIdentityBadgeshowsscope.display_name(full name);#filterRolehidden. POST /api/viewer/tokens— validates all entries inscope.usercontain@; rejects combinedrole+userscope.- Token list — shows display name badge; falls back to emails joined with
,.
Size: Small · Priority: Medium
Scan history browser ✅
Review results from any past scan session without running a new scan.
Implemented:
gdpr_db.py—get_sessions(limit=50, window_seconds=300): groupsscansrows into 300 s windows (same logic asget_session_items), returns newest-first list withref_scan_id(highest scan_id in group), timestamps, sources set, flagged count, total scanned, and a delta flag.gdpr_db.py—get_session_items(ref_scan_id=N): whenref_scan_idgiven, anchors the 300 s window to that scan'sstarted_atinstead of the latest scan.GET /api/db/sessions(new endpoint inroutes/database.py) — returns the sessions list; viewer-mode sessions share the sameGET /api/db/flagged?ref=Nendpoint with scope enforcement intact.static/js/history.js(new module) —loadHistorySession(refScanId),openHistoryPicker(),closeHistoryPicker(),exitHistoryMode(),invalidateHistoryCache()all exposed onwindow.*. Session cache (_sessions) invalidated by all*_doneSSE handlers so the picker stays fresh after a new scan.- History banner (
#historyBanner) — shows session date/time, sources, item count; "Sessions" button opens picker dropdown; "Latest scan" button appears only when not already viewing the latest. - Auto-load on page load —
results.jscallswindow.loadHistorySession?.(null)when the SSE watchdog detects!status.running;nullresolves to the latest completed session. - Live→history transition: clicking a session in the picker sets
S._historyRefScanIdand shows the banner. History→live transition:startScan()callswindow.exitHistoryMode?.().
Gmail SMTP error message when App Password already in use ✅
The 535 auth error from Gmail fires for wrong app password, revoked app password, spaces in the 16-char code, and wrong username — all indistinguishable at the SMTP level. The old message unconditionally told users to "create an App Password", which is unhelpful when they already have one. Both the smtp_test and send_report error handlers now emit a Gmail-specific message that lists the three common causes and links to the App Password page for regeneration.
#32 — Windowed mode for Profiles, Sources, and Settings ✗ Won't do
The workflow is sequential (configure → scan → review), not parallel — there is no realistic scenario where a modal and the results grid need to be open simultaneously. The Sources panel is already visible in the sidebar. Option A (the least-work path) still loads the full 3800-line JS stack twice. Closed.