GDPRScanner/MAINTAINER.md
2026-04-11 04:38:11 +02:00

206 lines
8.4 KiB
Markdown

# Maintainer Guide
*Written for future Henrik — assuming Python proficiency, returning after time away.*
---
## The short version
When something breaks, the structure tells you where to look.
When you want to add something, `SUGGESTIONS.md` has the context.
When you're unsure if a change broke anything, run `pytest tests/`.
---
## Project structure
```
gdpr_scanner.py Entry point. Flask app, route definitions, blueprint
registration, CLI argument handling. Thin coordinator —
it imports from the modules below and re-exports them.
sse.py Server-Sent Events. broadcast(), the SSE queues, and
the replay buffer. Touch this if live progress breaks.
checkpoint.py Scan checkpoint and delta token persistence. Touch this
if resume/incremental scanning breaks.
app_config.py Everything configuration: i18n loading, Article 9
keywords, admin PIN, scan profiles, SMTP config, file
source definitions, Fernet encryption. Touch this if
settings, language, or profiles break.
cpr_detector.py CPR detection engine. _scan_bytes() dispatches to the
right scanner by file type. Touch this if detection
accuracy changes or file type support is needed.
scan_engine.py M365 and file-system scan orchestration. run_scan() and
run_file_scan(). The most complex file — ~1000 lines.
Touch this for scan behaviour, collection logic, or
new M365 sources.
gdpr_db.py SQLite persistence layer. ScanDB class. Touch this for
DB schema changes, new tables, or query logic.
document_scanner.py CPR regex, NER, OCR, face detection, PDF/DOCX/XLSX
scanning. Pre-existing module — treat as a dependency.
Avoid modifying unless you really need to.
m365_connector.py Microsoft Graph API client. Auth, token refresh, all
the iter_* fetchers. Touch this for M365 API changes.
google_connector.py Google Workspace connector. Service account auth, Gmail
and Drive iterators. Touch this for Google API changes.
routes/ Flask blueprints — one file per functional area.
auth.py M365 sign-in / sign-out / device code flow
scan.py /api/scan/start, /api/scan/stop, /api/scan/status
export.py Excel and Article 30 Word export
database.py DB query endpoints (stats, trend, overdue, subject lookup)
users.py User listing, role classification, SKU debug
sources.py File source management (local and SMB)
profiles.py Scan profile CRUD
email.py Email report sending via SMTP / Graph API
scheduler.py APScheduler integration
google_auth.py Google service account connect / disconnect
google_scan.py Google Workspace scan start / cancel / users
app_routes.py Misc: about, language selector, settings, delta status
tests/ pytest test suite — 112 tests, all should pass.
test_document_scanner.py CPR detection accuracy and false positive checks
test_app_config.py i18n, keywords, config, profiles, encryption
test_checkpoint.py Checkpoint and delta token persistence
test_db.py Database round-trips, CPR hashing, dispositions
```
---
## When something breaks
**Scan finds nothing / wrong count**
`cpr_detector.py``_scan_bytes()` and `_scan_text_direct()`
`scan_engine.py``run_scan()` for M365, `run_file_scan()` for files
**Progress bar / live log not updating**
`sse.py``broadcast()`
`gdpr_scanner.py``scan_stream()` — check `sse._current_scan_id`
`static/app.js``_attachScanListeners()` and `scan_progress` handler
**Cards not appearing after scan**
`static/app.js``scan_file_flagged` handler → calls `applyFilters()`
`static/app.js``scan_done` handler → shows `filterBar`
**Export (Excel / Art.30) fails**
`routes/export.py` → checks `state.flagged_items`, falls back to DB
→ If DB is empty, a scan has not been run or results were cleared
**Authentication / sign-in issues**
`routes/auth.py` for M365
`routes/google_auth.py` for Google Workspace
`gdpr_scanner.py``_connector = _state.connector = ...` must stay dual-assigned
**Settings stats show 0 (Scanned / Flagged / Scans)**
`routes/database.py``db_stats()` — queries `flagged_items` and `scans` directly
→ Stats populate from existing DB on app start — no re-scan needed
→ If still 0 after a completed scan: check `~/.gdpr_scanner.db` exists and is not empty
**File scan results not persisting to DB**
`scan_engine.py``run_file_scan()` — must call `_db.begin_scan()` not `start_scan()`
→ Check terminal output for `[db] begin_scan failed` to confirm
**Settings / profiles / language not loading**
`app_config.py`
→ Config files live in `~/` — see the migration shim in `gdpr_scanner.py` for paths
**Scheduled scans not running or not showing in UI**
`scan_scheduler.py` / `scheduler.py`
`routes/scheduler.py`
→ Schedule config: `~/.gdpr_scanner_schedule.json`
---
## Running the tests
```bash
cd GDPRScanner_v1.6.x
pytest tests/
```
Run this before every release and after any change to:
- `document_scanner.py` — CPR detection
- `cpr_detector.py` — file type dispatch
- `gdpr_db.py` — database layer
A failing CPR detection test is a compliance issue, not just a software bug.
---
## Key data files (all in `~/`)
All data files live in **`~/.gdprscanner/`** (created automatically on first run).
Existing `~/.gdpr_scanner_*` files are migrated automatically.
| File | Contents |
|---|---|
| `scanner.db` | SQLite — all scan results, CPR index, dispositions, history |
| `config.json` | Azure client ID / tenant ID |
| `settings.json` | Last-used scan options |
| `schedule.json` | Scheduled scan configuration |
| `token.json` | Cached MSAL token (delegated mode) |
| `delta.json` | Microsoft Graph delta tokens |
| `checkpoint.json` | Mid-scan checkpoint (deleted on completion) |
| `smtp.json` | SMTP config (password Fernet-encrypted) |
| `machine_id` | Fernet key for SMTP password — never move without this |
| `role_overrides.json` | Manual staff/student role overrides |
| `google_sa.json` | Google service account key (chmod 600) |
| `google.json` | Google admin email and source toggle state |
| `src_toggles.json` | Source panel toggle state (Email, OneDrive, Gmail, etc.) |
---
## The files you will rarely touch
- `document_scanner.py` — treat as a dependency
- `build_gdpr.py` — only when adding new `.py` files to the project (bundle the new file in the `datas` list)
- `install_windows.ps1` / `install_macos.sh` — only when adding new pip dependencies
---
## Adding a new pip dependency
1. Add to `requirements.txt` with a version pin and a comment
2. Add to `install_windows.ps1` (the packages array)
3. Add to `install_macos.sh` (the packages array)
4. If building the app: no change needed — PyInstaller follows imports automatically
---
## The documents that have the history
| Document | What it contains |
|---|---|
| `SUGGESTIONS.md` | Every feature idea, why it was or wasn't implemented, current status |
| `CHANGELOG.md` | What changed in each version, including root causes of bugs fixed |
| `CONTRIBUTING.md` | How to contribute, code style, translation guide |
| `DEPENDENCIES.md` | What each dependency is for and why it was chosen |
When you're unsure why something was done a certain way, read `SUGGESTIONS.md` first.
When you're debugging a regression, read `CHANGELOG.md` for the version where it appeared.
---
## The one thing to know about the module split
`gdpr_scanner.py` imports from all five sub-modules and re-exports them.
The Flask blueprints in `routes/` use `__getattr__` to lazily resolve names
from `gdpr_scanner` — so they work unchanged even though the code moved.
If you add a new function to `app_config.py` or `cpr_detector.py` and need
it accessible from a route blueprint, add it to the `from app_config import (...)`
block near the top of `gdpr_scanner.py`.
---
*This project was built by Henrik Højmark with AI assistance (Claude by Anthropic)
as a pair-programming tool. All design decisions were made by the author.*