GDPRScanner/EFFORT_ESTIMATE.md
2026-04-11 04:38:11 +02:00

68 lines
4.6 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# GDPRScanner — Build Effort Estimate
Estimated man-hours to build this project from scratch, based on static analysis of v1.6.13.
---
## Codebase Stats
| Metric | Count |
|---|---|
| Source files (excl. dist / build / venv) | ~70 |
| Lines of code (Python + JS + HTML + CSS) | ~25,400 |
| Test lines | ~1,280 (128 tests) |
| Language files | ~2,300 lines (DA / EN / DE) |
| Current version | v1.6.13 |
---
## Estimate by Component
| Component | Key Files | LOC | Hours |
|---|---|---|---|
| **CPR detector** — regex, modulo-11 validation, context filtering, false-positive suppression | `cpr_detector.py` | 446 | 4060 |
| **Document scanner** — PDF text + OCR, Word, Excel, PowerPoint, images; memory-safe page-by-page processing | `document_scanner.py` | 2,659 | 160240 |
| **Microsoft 365 connector** — Exchange mail, OneDrive, SharePoint, Teams, delta sync, Microsoft Graph API, MSAL auth | `m365_connector.py`, `scan_engine.py`, `m365_launcher.py` | 2,748 | 240320 |
| **Google Workspace connector** — Gmail, Google Drive, service account + OAuth 2.0 flows | `google_connector.py`, `routes/google_scan.py`, `routes/google_auth.py` | 1,300 | 120160 |
| **File / SMB scanner** — local filesystem and network share scanning | `file_scanner.py` | 600 | 4080 |
| **Database layer** — SQLite schema, migrations, scan sessions, dispositions, delta tracking | `gdpr_db.py` | 954 | 80120 |
| **Export system** — formatted Excel reports, GDPR Article 30 Word documents | `routes/export.py` | 1,222 | 120160 |
| **Flask app + SSE + orchestration** — server-sent events, scan threading, checkpointing, resume | `gdpr_scanner.py`, `sse.py`, `checkpoint.py` | 2,400 | 120160 |
| **Frontend SPA** — 11 ES modules, real-time progress, results viewer, profiles, sources panel, viewer mode | `static/js/*.js`, `templates/index.html`, `static/style.css` | 7,800 | 200280 |
| **App config + persistence + encryption** — profiles, settings, SMTP, Fernet key, viewer tokens + PIN | `app_config.py` | 794 | 4080 |
| **Desktop app builder** — PyInstaller packaging for macOS and Windows, embedded webview | `build_gdpr.py` | 1,095 | 80120 |
| **Scheduler** — cron-like scheduled scans, background thread management | `scan_scheduler.py`, `routes/scheduler.py`, `static/js/scheduler.js` | 1,084 | 4080 |
| **Auth + viewer mode + roles** — M365 / Google OAuth, viewer tokens, PIN brute-force protection, SKU role classification | `routes/auth.py`, `routes/viewer.py`, `static/js/auth.js`, `static/js/viewer.js` | 750 | 80120 |
| **Multi-language support** — Danish, English, German UI strings | `lang/da.json`, `lang/en.json`, `lang/de.json` | 2,300 | 4060 |
| **Test suite** — 128 unit tests | `tests/` | 1,282 | 4080 |
| **Documentation + CI/CD + install scripts** — GitHub Actions, macOS / Windows installers, user manuals | `docs/`, `.github/`, `*.sh`, `*.ps1` | — | 4060 |
---
## Total Estimate
| Scenario | Hours | Calendar time (1 dev, 40 hrs/wk) | Calendar time (2-person team) |
|---|---|---|---|
| **Low** | ~1,500 | ~9 months | ~5 months |
| **Mid** | ~2,000 | ~12 months | ~6 months |
| **High** | ~2,500 | ~15 months | ~8 months |
The mid estimate (~2,000 hours) is the most realistic for a single senior developer building iteratively toward a v1.6 release.
---
## Complexity Drivers
These factors push the estimate beyond what raw line counts suggest:
- **Microsoft Graph API** — Exchange, SharePoint, and Teams scanning involve underdocumented API behaviour, throttling, delta-token management, and permission edge cases. Research and debugging overhead is substantial.
- **CPR validation domain knowledge** — Danish modulo-11 rules, context-aware false-positive filtering, and handling of anonymised or test numbers requires specialised understanding.
- **Memory management at scale** — The `deque`-drain pattern, page-by-page OCR image freeing, and pre-scan memory guards (`psutil`) are non-obvious and emerged through iteration on large tenants.
- **Cross-platform desktop packaging** — Producing a signed `.app` for macOS and an `.exe` for Windows via PyInstaller, with an embedded webview, is a significant and ongoing maintenance burden.
- **SSE + Flask threading** — Correct scan locking, SSE fan-out, and safe state sharing across threads is difficult to get right without subtle race conditions.
- **Version iteration** — v1.6.13 represents at least 13 significant release cycles. The first working prototype likely consumed roughly half the total hours; the accumulated refinement accounts for the rest.
---
*Generated 2026-04-11 based on static analysis of GDPRScanner v1.6.13.*