# GDPRScanner Scans Microsoft 365, Google Workspace, local/network file systems, and SFTP servers for Danish CPR numbers and personal data (PII). Produces GDPR compliance reports and supports Article 30 record-keeping obligations. --- > **Work in progress — not ready for production use.** > This project is under active development and has not been formally tested or audited for production deployment. It is shared publicly for transparency and collaboration. Use at your own risk. --- **Developed by Henrik Højmark** This project was built with substantial assistance from AI (Claude by Anthropic), used as a pair-programming tool throughout development. All design decisions, requirements, testing, and validation were made by the author. The AI generated code under direction — the same way a developer might use a senior colleague or an IDE with intelligent completion. The result is the author's work. --- `gdpr_scanner.py` scans Microsoft 365 cloud sources — Exchange email (including all subfolders), OneDrive, SharePoint, and Teams — for Danish CPR numbers and PII. It connects to the Microsoft Graph API and does not require local file access. ### What it does (M365) - **Scans Exchange mailboxes** — email body and attachments, across **all folders and subfolders** recursively (Inbox, custom folders, nested folders). System folders (Deleted Items, Junk, Drafts, Sent, etc.) are automatically skipped using Exchange `wellKnownName` identifiers (language-independent — works correctly for Danish, German, and other locales) - **OneDrive, SharePoint, Teams** — scans files in all connected sources - **Subfolder prioritisation** — custom subfolders are scanned before Inbox to prevent a large Inbox from exhausting the per-user email cap - **EML attachment preview** — email attachments with CPR hits are listed in the preview panel with per-attachment CPR counts - **Folder path in results** — each email result shows its full folder path (e.g. `Inbox / Ansøgninger pædagog SFO`) in the card and in Excel export - **Delete items** — flagged results can be deleted directly from the UI, individually or in bulk - **CPR false-positive reduction** — strict CPR validation - **Excel export** — multi-tab `.xlsx` report with per-source breakdown, auto-filters, and URL hyperlinks. Columns include: Name, CPR Hits, Face count, GPS (✔ if GPS in EXIF), Special category, EXIF author, Folder, Account, Role, Disposition, Date Modified, Size (KB), URL. A dedicated **GPS locations** sheet lists all items with GPS coordinates including a Google Maps link. Separate tabs for Outlook (Exchange), OneDrive, SharePoint, Teams, Gmail, Google Drive, local folders, SMB/network shares, and SFTP. Summary sheet shows counts by source and GPS item total. When M365, Google Workspace, and file scans run concurrently, all results are captured in the export — not just the last completed scan - **Progressive streaming** — results stream card-by-card via Server-Sent Events as the scan runs - **Token auto-refresh** — expired tokens are detected and silently refreshed mid-scan without interrupting the UI - **Incremental / resumable scans** — interrupted scans save a checkpoint; the next run resumes from where it stopped rather than starting over - **Delta scan** — uses Graph `/delta` endpoints to fetch only changed items since the last scan, cutting API quota usage and scan time on large tenants - **Headless / scheduled mode** — `--headless` flag runs a non-interactive scan and writes an Excel report to disk; combine with cron or Windows Task Scheduler for fully automated compliance scans. **Settings → Scheduler** supports multiple named scan jobs, each with its own frequency (daily/weekly/monthly), time, profile, auto-email, and retention settings. Enable/disable each job with an inline toggle. In application mode, scheduled jobs reconnect automatically without requiring the browser to be open - **EXIF metadata extraction** — GPS coordinates, author, description, device extracted from all scanned images. GPS badge on cards when location data is present. Collapsible EXIF panel in local file previews. No extra dependencies — uses `Pillow` which is already required. - **`--purge`** — permanently deletes all data files created by the scanner (database, credentials, cache); use before decommissioning - **`--export-db`** / **`--import-db`** — export the database to a ZIP archive or restore from one; supports `--import-mode merge` (default) and `--import-mode replace` - **`--reset-db`** — wipe and recreate the database; also clears the checkpoint and delta tokens - **Email report** — send the Excel report by email directly from the UI or via `--email-to` in headless mode. Prefers **Microsoft Graph API** when connected to M365 (no SMTP AUTH needed — requires `Mail.Send` permission). Falls back to `smtplib` SMTP with STARTTLS/SSL support. A **Test** button verifies end-to-end delivery. - **Account name on cards** — when scanning multiple users, each card displays the owner's display name so results from different mailboxes are instantly distinguishable - **Retention policy enforcement** — flag items older than a configurable retention period with a Overdue badge; supports both rolling and fiscal-year-aligned cutoffs (e.g. Bogføringsloven Dec 31); headless auto-delete via `--retention-years` - **Data subject lookup** — find all flagged items containing a specific CPR number across all scans; CPR is SHA-256 hashed before querying — never stored in plaintext - **CPR cross-referencing** — clicking any flagged card with CPR hits shows a "Related documents" section listing other items from the same scan session that share at least one CPR number, ordered by number of shared CPRs. Clicking any entry opens it in the preview panel. Works in live mode and history mode. Powered by a SQL self-join on the `cpr_index` table — no new data collection required - **Disposition tagging** — compliance officers can tag each flagged item with a legal basis (retain / delete-scheduled / deleted) directly from the preview panel; **bulk disposition tagging** lets you select multiple cards with checkboxes and apply a disposition to all of them at once. A stats bar above the grid shows total · unreviewed · retain · delete counts and the percentage reviewed - **Interface PIN** — optional session-level PIN that gates the main scanner interface (`/`). Set a 4–8 digit PIN in **Settings → Security → Interface PIN**; unauthenticated visitors are redirected to `/login`. The `/view` viewer route and all viewer API endpoints are exempt — reviewers are unaffected. Salted SHA-256 hash; brute-force protection (5 attempts / 5 min per IP) - **Read-only viewer mode** — share scan results with a DPO or manager via a secure token URL (`/view?token=…`) or a numeric PIN; viewers see the full results grid and disposition panel but cannot scan, delete, or change settings. Tokens can be **role-scoped** (Ansatte / Elever) so a recipient only sees items for their group, or **user-scoped** so an individual employee only sees their own flagged files (supports dual M365 + Google Workspace identity) - **Article 30 report** — one-click export of a structured Word document (`.docx`) satisfying the GDPR Article 30 register of processing activities obligation - **SQLite results database** — scan results, CPR index, PII breakdown, disposition decisions, and scan history are persisted to `~/.gdprscanner/scanner.db` alongside the JSON cache, enabling cross-scan queries and trend tracking - **Built-in user manual** — click the **?** button in the top bar to open the manual in a dedicated window. Available in Danish and English. Printable via the browser's print function. Served from `MANUAL-DA.md` / `MANUAL-EN.md` at `/manual?lang=da|en` — always in sync with the installed version, no internet required. In the packaged desktop app the manual opens as a native pywebview window; in the browser it opens as a popup. --- ## Microsoft 365 See [M365_SETUP.md](docs/setup/M365_SETUP.md) for step-by-step instructions — app registration, permissions, authentication modes, and headless configuration. --- ### M365 Web UI ``` python gdpr_scanner.py [--port PORT] ``` > The scanner expects `templates/` and `static/` in the same directory as `gdpr_scanner.py`. Flask serves `templates/index.html` as the UI. The JavaScript is split across 12 ES modules in `static/js/` (`state.js` + 11 feature modules loaded as `