StyxX65 d542357855 docs: add #34 user-scoped viewer tokens, remove SUGGESTIONS.md

- CLAUDE.md: document planned user-scoped token scope (account_id filter)
- TODO.md: add #34 spec, drop stale SUGGESTIONS.md reference
- SUGGESTIONS.md: deleted — fully superseded by TODO.md + CLAUDE.md

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-04-12 14:28:32 +02:00

4.1 KiB

Raw Blame History

TODO — Pending features and sustainability

Quick overview of what's still to be done.

Recently completed

Memory exhaustion during large M365 scans ✅

Six root causes fixed in scan_engine.py and document_scanner.py:

Email body HTML stripped at collection time (body key deleted from each message dict before it enters work_items; plain text stored as _precomputed_body instead)
work_items list converted to a deque before processing so each item is released immediately after popleft()
del content added in file-processing branch as soon as raw bytes are no longer needed (before NER/PII counting)
del body_text added after email body is fully consumed
PDF OCR page images (PIL.Image) nulled out one by one after OCR instead of holding all pages in RAM
Memory guard using psutil skips file downloads when < 300 MB RAM is available

Still open: The collection phase itself is still a "gather all, then process" loop. For very large tenants (>500k emails) the pre-extracted plain text in work_items could still be significant. The complete fix is to process each user's emails/files inline as they are fetched (generator/streaming pattern) rather than accumulating them into work_items first — estimated 1–2 days of refactor.

Pending

#15 — Scan profiles ✅

Named, reusable scan configurations. Full spec in SUGGESTIONS.md §15.
Size: Large · Priority: High

#23 — Google Workspace role classification + cross-platform identity mapping ✅

Full spec in SUGGESTIONS.md §23.
Size: Large · Priority: Medium

#27 — Migrate i18n format from `.lang` to JSON ✅

Full spec in SUGGESTIONS.md §27.
Size: Medium · Priority: Low

#29 — Rename `skus/` → `classification/` ✅

Full spec in SUGGESTIONS.md §29.
Size: Small · Priority: Low

#33 — Read-only viewer mode with PIN/token URL ✅

A shareable URL (token-protected) or numeric PIN that gives a DPO, school principal, or compliance coordinator read-only access to the results grid — with disposition tagging but without scan controls, credentials, or delete access. Full spec in SUGGESTIONS.md §33.
Size: Medium · Priority: Medium

OneDrive 404 errors — investigate and handle appropriately ✅

404 on drive/root/delta during delta scans was being broadcast as a red scan_error. Root cause: _get() hit raise_for_status() for 404s, which fell through to the generic except Exception handler in _scan_user_onedrive. The full-scan path silently swallowed the same 404 via except Exception: return in _iter_drive_folder_for.

Fixed by adding M365DriveNotFound(M365Error) exception, raising it from _get() on 404, and catching it explicitly in _scan_user_onedrive with a lower-severity scan_phase broadcast ("OneDrive (user): not provisioned — skipped") instead of a red error card.

#34 — User-scoped viewer tokens

Extend viewer token scope from {"role": "student"|"staff"} to also support {"user": "alice@school.dk"}, filtering flagged_items by account_id. Lets a single employee see only their own flagged files.

Infrastructure already in place: account_id is an indexed column on flagged_items, populated for M365 (UPN) and Google (email). File-scan items have account_id = "" and won't appear in user-scoped views — document this in the token-creation UI.

Changes needed:

Token creation UI — add a "specific user" option (email input) alongside the role dropdown
GET /api/db/flagged — filter by account_id when session["viewer_scope"].get("user") is set (same pattern as existing role filter)
Viewer header — show locked identity (similar to locked #filterRole for role-scoped tokens)

Size: Small · Priority: Medium

#32 — Windowed mode for Profiles, Sources, and Settings ✗ Won't do

The workflow is sequential (configure → scan → review), not parallel — there is no realistic scenario where a modal and the results grid need to be open simultaneously. The Sources panel is already visible in the sidebar. Option A (the least-work path) still loads the full 3800-line JS stack twice. Closed.

4.1 KiB Raw Blame History Unescape Escape