GDPRScanner/README.md
StyxX65 c83d9c8ed5 Docs: update CHANGELOG and
README for macOS CI build + Windows artifact fix

  Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-11 10:34:20 +02:00

40 KiB
Raw Blame History

GDPRScanner

Scans Microsoft 365, Google Workspace, and local/network file systems for Danish CPR numbers and personal data (PII). Produces GDPR compliance reports and supports Article 30 record-keeping obligations.


Developed by Henrik Højmark

This project was built with substantial assistance from AI (Claude by Anthropic), used as a pair-programming tool throughout development. All design decisions, requirements, testing, and validation were made by the author. The AI generated code under direction — the same way a developer might use a senior colleague or an IDE with intelligent completion. The result is the author's work.


gdpr_scanner.py scans Microsoft 365 cloud sources — Exchange email (including all subfolders), OneDrive, SharePoint, and Teams — for Danish CPR numbers and PII. It connects to the Microsoft Graph API and does not require local file access.

What it does (M365)

  • Scans Exchange mailboxes — email body and attachments, across all folders and subfolders recursively (Inbox, custom folders, nested folders). System folders (Deleted Items, Junk, Drafts, Sent, etc.) are automatically skipped using Exchange wellKnownName identifiers (language-independent — works correctly for Danish, German, and other locales)
  • OneDrive, SharePoint, Teams — scans files in all connected sources
  • Subfolder prioritisation — custom subfolders are scanned before Inbox to prevent a large Inbox from exhausting the per-user email cap
  • EML attachment preview — email attachments with CPR hits are listed in the preview panel with per-attachment CPR counts
  • Folder path in results — each email result shows its full folder path (e.g. Inbox / Ansøgninger pædagog SFO) in the card and in Excel export
  • Delete items — flagged results can be deleted directly from the UI, individually or in bulk
  • CPR false-positive reduction — strict CPR validation
  • Excel export — multi-tab .xlsx report with per-source breakdown, auto-filters, and URL hyperlinks. Columns include: Name, CPR Hits, Face count, GPS (✔ if GPS in EXIF), Special category, EXIF author, Folder, Account, Role, Disposition, Date Modified, Size (KB), URL. A dedicated GPS locations sheet lists all items with GPS coordinates including a Google Maps link. Separate tabs for Outlook (Exchange), OneDrive, SharePoint, Teams, Gmail, Google Drive, local folders, and SMB/network shares. Summary sheet shows counts by source and GPS item total. When M365, Google Workspace, and file scans run concurrently, all results are captured in the export — not just the last completed scan
  • Progressive streaming — results stream card-by-card via Server-Sent Events as the scan runs
  • Token auto-refresh — expired tokens are detected and silently refreshed mid-scan without interrupting the UI
  • Incremental / resumable scans — interrupted scans save a checkpoint; the next run resumes from where it stopped rather than starting over
  • Delta scan — uses Graph /delta endpoints to fetch only changed items since the last scan, cutting API quota usage and scan time on large tenants
  • Headless / scheduled mode--headless flag runs a non-interactive scan and writes an Excel report to disk; combine with cron or Windows Task Scheduler for fully automated compliance scans. Settings → Scheduler supports multiple named scan jobs, each with its own frequency (daily/weekly/monthly), time, profile, auto-email, and retention settings. Enable/disable each job with an inline toggle. In application mode, scheduled jobs reconnect automatically without requiring the browser to be open
  • EXIF metadata extraction — GPS coordinates, author, description, device extracted from all scanned images. GPS badge on cards when location data is present. Collapsible EXIF panel in local file previews. No extra dependencies — uses Pillow which is already required.
  • --purge — permanently deletes all data files created by the scanner (database, credentials, cache); use before decommissioning
  • --export-db / --import-db — export the database to a ZIP archive or restore from one; supports --import-mode merge (default) and --import-mode replace
  • --reset-db — wipe and recreate the database; also clears the checkpoint and delta tokens
  • Email report — send the Excel report by email directly from the UI or via --email-to in headless mode. Prefers Microsoft Graph API when connected to M365 (no SMTP AUTH needed — requires Mail.Send permission). Falls back to smtplib SMTP with STARTTLS/SSL support. A Test button verifies end-to-end delivery.
  • Account name on cards — when scanning multiple users, each card displays the owner's display name so results from different mailboxes are instantly distinguishable
  • Retention policy enforcement — flag items older than a configurable retention period with a Overdue badge; supports both rolling and fiscal-year-aligned cutoffs (e.g. Bogføringsloven Dec 31); headless auto-delete via --retention-years
  • Data subject lookup — find all flagged items containing a specific CPR number across all scans; CPR is SHA-256 hashed before querying — never stored in plaintext
  • Disposition tagging — compliance officers can tag each flagged item with a legal basis (retain / delete-scheduled / deleted) directly from the preview panel
  • Read-only viewer mode — share scan results with a DPO or manager via a secure token URL (/view?token=…) or a numeric PIN; viewers see the full results grid and disposition panel but cannot scan, delete, or change settings
  • Article 30 report — one-click export of a structured Word document (.docx) satisfying the GDPR Article 30 register of processing activities obligation
  • SQLite results database — scan results, CPR index, PII breakdown, disposition decisions, and scan history are persisted to ~/.gdprscanner/scanner.db alongside the JSON cache, enabling cross-scan queries and trend tracking
  • Built-in user manual — click the ? button in the top bar to open the manual in a dedicated window. Available in Danish and English. Printable via the browser's print function. Served from MANUAL-DA.md / MANUAL-EN.md at /manual?lang=da|en — always in sync with the installed version, no internet required. In the packaged desktop app the manual opens as a native pywebview window; in the browser it opens as a popup.

Microsoft 365

See M365_SETUP.md for step-by-step instructions — app registration, permissions, authentication modes, and headless configuration.


M365 Web UI

python gdpr_scanner.py [--port PORT]

The scanner expects templates/ and static/ in the same directory as gdpr_scanner.py. Flask serves templates/index.html as the UI. The JavaScript is split across 12 ES modules in static/js/ (state.js + 11 feature modules loaded as <script type="module">). All API routes live in routes/ as Flask Blueprints registered at startup.

Default port: 5100. If that port is already in use the server auto-increments (5101, 5102, …) and logs which port was chosen. Override with --port N. Only one instance may run at a time — a second launch exits immediately with an error rather than corrupting the shared database.

Sources panel

The sidebar sources panel lists all configured scan sources. Click Sources to open the unified Source Management modal. The panel is collapsible (▾/▸ toggle, state persisted) and resizable — drag the handle at the bottom edge to shrink it; the maximum height is automatically capped to show all available sources with no empty space.

Microsoft 365 tab — Azure credentials (Client ID, Tenant ID, Client Secret), auth mode (Application / Delegated), and per-source visibility toggles (Email, OneDrive, SharePoint, Teams). Sources toggled off are hidden from the sidebar panel and excluded from scans.

Google Workspace tab — Two authentication modes: Workspace (service account with domain-wide delegation — scans all users) and Personal account (OAuth 2.0 device-code flow — scans the signed-in account only). Once connected, per-source toggles control whether Gmail and/or Google Drive appear in the sidebar panel and are included in scans. See GOOGLE_SETUP.md for setup instructions.

File sources tab — Add local folder paths or SMB/CIFS network shares with a name, path, and optional SMB credentials. Each saved source appears as a checkbox in the sidebar panel (local, SMB/network). Use the Edit button on each row to update credentials or rename a source without deleting it.

Skipped automatically: .recycle, .sync, .btsync, .trash, .git, node_modules, System Volume Information, and other system/sync folders. Hidden directories (. prefix) are skipped too.

PDF scanning in file scans: PDFs are scanned in a dedicated subprocess spawned via multiprocessing.get_context("spawn") with a 60-second hard timeout. If a PDF's OCR (Tesseract/Poppler) stalls, the subprocess is terminated and the file is skipped with an error card — the scan thread is never blocked. The spawn context is required on macOS + Flask to avoid duplicating the server socket.

Preview panel — opens to the right of the results grid when a card is clicked. The panel is resizable: drag the left edge to adjust its width (min 280 px, max 70% of window). Width is remembered for the session. Click × to close.

Local file preview — clicking a result card renders the file content inline:

Type Preview
PDF First 5 pages as text via pdfplumber, CPR numbers highlighted
XLSX / XLSM / CSV First 50 rows as a table (up to 3 sheets for Excel)
DOCX / DOC First 80 paragraphs as text, CPR numbers highlighted
Images Inline image + collapsible EXIF metadata panel (GPS, author, device, datetime)
TXT / EML / MD / log Full text with CPR highlights

Sources from all tabs can be selected independently in the sidebar before scanning. The selection is saved as part of scan profiles.

User accounts panel

In Delegated mode, accounts are added via the device code flow. In Application mode, the scanner fetches all users in the tenant. Users are listed with checkboxes — all unchecked by default. Use All / None to select or deselect everyone, filter by name with the search field, or add a user manually by email with the + button.

Role classification — users are automatically classified as Student or Staff based on their Microsoft 365 licence. Role badges appear on every account row, on result cards, and in the Article 30 report (separate Staff and Student inventory tables).

Role detection works in two passes:

  1. skuPartNumber fragment match (preferred) — strings like STANDARDWOFFPACK_FACULTY are stable across all Microsoft licensing generations (EA, A1/A3/A5, new commerce/CSP). Runs first whenever part numbers are available.
  2. SKU ID lookup from classification/m365_skus.json — fallback for when part numbers are unavailable or for licences with no recognisable fragment (e.g. Power Automate Free assigned to faculty).

Filter buttonsAll / Ansat / Elev filter the accounts list before selecting who to scan.

SKU debug — the magnifying-glass button next to the role filters opens a modal listing every unique SKU ID in the tenant, colour-coded student / staff / unknown. Unknown IDs can be copied directly into classification/m365_skus.json and take effect on the next restart.

Manual role override — if auto-classification is wrong for a specific user, click the role badge (role badge) on their row to cycle through student → staff → other → (clear). Overrides are stored in ~/.gdpr_scanner_role_overrides.json and persist across restarts. A pencil indicator appears on overridden rows. Click through until the pencil disappears to revert to auto-detection.

classification/m365_skus.json — the SKU ID and fragment file lives in the classification/ folder alongside lang/ and keywords/. Edit it to add new or tenant-specific SKU IDs without any code change; the file is reloaded on every restart.

Date filter

A date-from picker limits the scan to items modified after the selected date. Quick presets: 1 yr / 2 yr / 5 yr / 10 yr / Any. Selecting "Any" sets the date to today (no cutoff).

Options

Option Default Description
Scan email body On Scan the plain-text body of each email
Scan attachments On Scan PDF/Word/Excel attachments inside emails
Max attachment size 20 MB Skip attachments larger than this threshold
Max emails per user 2000 Cap per mailbox to avoid very long scans
Δ Delta scan Off Fetch only changed items since the last scan (see Delta scan below)
Δ Delta scan Off Fetch only changed items since the last scan — hover the ? for details (see Delta scan below)
** Scan photos for faces** Off Detect faces in image files and flag as Art. 9 biometric data — hover the ? for details (see Photo scanning below)
Retention policy Off Flag items older than N years — hover the ? for details (see Retention policy)

Results grid

Each flagged item appears as a card showing:

  • File / subject name
  • CPR hit count badge
  • Source badge (Email / OneDrive / SharePoint / Teams)
  • Source account with role badge (Student / Staff)
  • Modified / received date
  • Folder path — shown for emails (e.g. Inbox / Ansøgninger pædagog SFO)
  • Account name — owner's display name shown on every card when scanning multiple users
  • Overdue badge — amber badge on items exceeding the configured retention cutoff
  • Art.9 badge — purple pill listing detected Article 9 special categories (health, criminal, biometric, etc.)
  • ** N faces** badge — teal pill on image files where face detection found identifiable persons (biometric data)
  • Ext. / **** badge — external email recipient or externally shared file (Art. 4446 transfer risk)
  • delete button — appears on hover (grid view) or always visible (list view)

Filter bar — always visible above both the results grid and the preview panel. Narrow results by source, disposition, transfer risk, and risk level:

Filter Options
Source All / Email / OneDrive / SharePoint / Teams
Disposition All / Unreviewed / Retain (legal/legitimate/contract) / Delete-scheduled / Deleted
Transfer risk All / External recipient / External share / Shared
Risk level All risk levels / Art. 9 special category / Photos / biometric

Delete items

Individual items can be deleted directly from their card (hover to reveal , confirm). Emails are moved to Deleted Items; files go to the recycle bin.

The Delete button in the filter bar opens the Bulk Delete modal, which lets you filter by:

Criterion Description
Source type Email / OneDrive / SharePoint / Teams / All
Min CPR hits Only delete items with at least N CPR numbers found
Older than date Only delete items older than a given date

The Filter overdue quick button pre-populates the date filter with the exact retention cutoff from the database, making it one click to select all overdue items for deletion.

A live preview shows how many items match before you confirm. Errors are reported per-item in the log panel.

Requires write permissions — see Azure permissions above.

Excel export

The ⬇ Excel button exports all current results to a .xlsx file (m365_scan_YYYYMMDD_HHMMSS.xlsx) with five sheets:

Sheet Contents
Summary Scan timestamp, total count, per-source breakdown
Email Flagged emails — Name/Subject, CPR Hits, Folder, Source Account, Date Modified, Size, URL
OneDrive Flagged OneDrive files
SharePoint Flagged SharePoint files
Teams Flagged Teams files

In macOS app builds, the export opens a native Save dialog instead of a browser download.

The Art.30 button generates a GDPR Article 30 Register of Processing Activities as a structured Word document (.docx). See Article 30 report below.

Email report

Configure email delivery in Settings → Email report. Click Save to store your SMTP settings, Test to send a real test email to the configured recipients, and Send now to dispatch the latest scan report. When connected to Microsoft 365, the scanner sends via the Graph API (Mail.Send permission required — add it in Azure AD → App registrations → API permissions). SMTP is used as a fallback when Graph is unavailable.

Field Description
SMTP host e.g. smtp.office365.com, smtp.gmail.com
Port 587 for STARTTLS (default), 465 for SMTPS/SSL
Username SMTP login — usually your sender email address
Password Saved to ~/.gdpr_scanner_smtp.json (permissions 600). Encrypted at rest using Fernet — key in ~/.gdpr_scanner_machine_id (chmod 0o600, never share)
Graph API When connected to M365, email is sent via /me/sendMail (delegated) or /users/{sender}/sendMail (app mode) — no SMTP password needed. Requires Mail.Send Graph permission with admin consent.
From address Sender address (defaults to username if blank)
STARTTLS Enable STARTTLS on port 587 (recommended)
SSL Use SMTPS on port 465 instead
Recipients Comma or semicolon separated list of addresses

Click Save to persist the settings. The password is stored separately from scan settings and never returned to the browser — subsequent loads show "(password saved)". Click Send now to email the report immediately with the current results.

No extra dependencies — uses Python's built-in smtplib. Works with Office 365, Gmail, and any standard SMTP server.

About

Click About in the sidebar footer to see app version, Python version, MSAL version, Requests version, and openpyxl version.


Google Workspace

See GOOGLE_SETUP.md for step-by-step instructions — service account creation, domain-wide delegation, OAuth scopes, and OU-based role classification.


Incremental / resumable scans

If a scan is stopped (via ■ Stop or by closing the app) before it finishes, a checkpoint is saved to ~/.gdpr_scanner_checkpoint.json. The next time you click ▶ Scan with the same configuration, a banner appears above the progress bar:

⏸  Previous scan interrupted — 847 scanned, 12 found  [Resume]  [Start fresh]
  • Resume — skips the 847 already-scanned items, re-emits the 12 previously found cards immediately, and continues from where it left off
  • Start fresh — discards the checkpoint and starts a new full scan

The checkpoint is keyed by a hash of the scan configuration (sources + users + date cutoff). Changing any of those settings automatically starts fresh. The checkpoint is deleted automatically when a scan completes successfully.


Delta scan

Delta scan uses the Microsoft Graph /delta API to fetch only items that have changed since the last scan, dramatically reducing Graph API quota usage and scan time on large tenants.

How it works

  1. Run one full scan first (Delta checkbox off) — this establishes baseline delta tokens
  2. Tick Δ Delta scan and run again — only items added, modified, or deleted since the previous scan are fetched and CPR-scanned
  3. Delta tokens are saved automatically to ~/.gdpr_scanner_delta.json after each successful scan
  4. To force a full rescan, click Clear tokens under the checkbox (or delete the file)

Delta tokens are stored per-source:

Token key Covers
onedrive:{user_id} One user's OneDrive drive
sharepoint:{drive_id} One SharePoint document library
teams:{drive_id} One Teams channel file store
email:{user_id}:{folder_id} One mail folder for one user

If a token expires (Graph returns HTTP 410 Gone), that source falls back to a full collection automatically and a fresh token is saved. Other sources are unaffected.

Deleted items returned by delta (items with a deleted or @removed marker) are skipped during CPR scanning.

After each delta scan, the log panel shows:

Scan complete — 3 flagged of 41  (Δ delta — 6 source(s) indexed)

Delta in headless mode

Pass "delta": true inside the options block of your --settings JSON to enable delta for scheduled scans:

{
  "options": { "delta": true, "older_than_days": 365 }
}

Headless mode (scheduled / automated scans)

Note: The scheduler engine lives in scan_scheduler.py.

Run the scanner without a browser UI for cron jobs and Windows Task Scheduler:

python gdpr_scanner.py --headless --output ~/Reports/ --settings settings.json

See M365_SETUP.md for the full settings file format, CLI flags, and SMTP configuration.


SQLite results database

Scan results are persisted to ~/.gdprscanner/scanner.db (SQLite) automatically after every scan, alongside the existing JSON session cache. The database enables cross-scan queries, trend tracking, and compliance workflows that are impractical with JSON alone.

Tables:

Table Contents
scans One row per completed scan run — sources, user count, options, delta flag
flagged_items One row per flagged file or email — full card data
cpr_index (SHA-256(cpr), item_id, scan_id) — CPR numbers stored as hashes only, never plaintext
pii_hits Per-type PII counts per item (phone, IBAN, name, address, etc.)
dispositions Compliance officer decisions per item
scan_history Aggregated stats per scan for trend tracking

API endpoints: GET /api/db/stats, GET /api/db/trend, GET /api/db/scans, POST /api/db/subject, GET /api/db/overdue, POST /api/db/disposition, GET /api/db/disposition/<id>

If gdpr_db.py is not present, the scanner falls back to JSON-only mode silently.


Data subject lookup

The Data subject lookup button in the sidebar opens a modal where you can search for all flagged items containing a specific CPR number across all scans.

  • Enter a CPR number in DDMMYY-XXXX format and press Enter or click Search
  • Results show file/email name, source type, date, and CPR hit count
  • Delete all for this person button triggers bulk deletion of all matching items and refreshes the grid
  • The CPR number is SHA-256 hashed before querying — it is never stored in plaintext in the database or logs

This directly supports the GDPR right of access (Article 15) and right to erasure (Article 17).


Disposition tagging

Every flagged item can be tagged with a compliance decision from the preview panel. Open any card, and the Disposition dropdown appears below the metadata strip.

Value Meaning
Unreviewed Default — not yet assessed
Retain — legal obligation Must keep (e.g. Bogføringsloven)
Retain — legitimate interest Justified retention, documented
Retain — contract Part of an active contract
Delete — scheduled Mark for deletion at next cleanup run
Deleted Already actioned

Dispositions are saved to the dispositions table in the SQLite database and included in the Article 30 report.


Retention policy enforcement

Enable Retention policy in the options panel to flag items that exceed your retention threshold.

Settings:

Setting Description
Retention years How many years to retain (default: 5)
Fiscal year end Rolling (from today) / 31 Dec (Bogføringsloven) / 30 Jun / 31 Mar

Two cutoff modes:

  • Rolling — exactly N years before today. Correct for GDPR general data minimisation.
  • Fiscal year — N years before the last completed fiscal year end. Correct for Bogføringsloven, which requires records for 5 years from the end of the financial year. A document from January 2020 with a Dec 31 FY must be kept until 31 December 2025, not just until January 2025.

A live hint below the settings shows the exact cutoff date before you scan.

After scanning, items older than the cutoff receive an amber Overdue badge on their card. In the bulk-delete modal, Filter overdue pre-fills the date filter with the exact cutoff for one-click selection.

Headless mode:

python gdpr_scanner.py --headless --output ~/Reports/   --retention-years 5 --fiscal-year-end 12-31

Non-interactive (cron): deletes automatically. Interactive (TTY): prompts for confirmation.


Scan profiles

Named, reusable scan configurations — save the current sidebar state as a profile, then load it in one click or run it headlessly by name.

  • Save — prompts for a name and saves all current settings (sources, options, user selection, retention) as a profile
  • Profile dropdown — switch between saved profiles; applying a profile populates the entire sidebar instantly
  • Profiles button — opens the profile management modal to rename, edit description, duplicate, or delete profiles
  • Profiles persist across restarts in ~/.gdprscanner/settings.json

Headless profile usage:

python gdpr_scanner.py --headless --profile "Nightly email scan"
python gdpr_scanner.py --list-profiles
python gdpr_scanner.py --save-profile "Weekly full scan" --sources email onedrive
python gdpr_scanner.py --delete-profile "Old scan"

Photo / biometric scanning

Enable ** Scan photos for faces** in the Options panel to detect photographs of identifiable persons in OneDrive, SharePoint, and Teams files.

  • Formats: .jpg, .jpeg, .png, .bmp, .tiff, .webp, .heic, .heif
  • Face detection: OpenCV Haar cascade (minNeighbors=8, min_size=80px — conservative; requires " Scan photos for faces" opt-in)
  • EXIF extraction — always-on for images regardless of the face detection toggle:
    • GPS coordinates — extracted and converted to decimal degrees; GPS badge on cards; Google Maps link in preview
    • PII fields — Author, Artist, Copyright, Description, UserComment, Keywords checked for content
    • Device — camera make/model
    • Images with GPS or PII-bearing EXIF are flagged even without CPR hits
    • special_category gains gps_location and/or exif_pii entries
  • GDPR classification: Images with detected faces are automatically tagged as Art. 9 biometric data — the same heightened protection as health or criminal records
  • ** N faces badge** — teal pill on cards; filterable via " Photos / biometric" in the Risk level dropdown
  • Article 30 report — dedicated section listing all photo items with a 4-bullet retention guidance block (purpose limitation, pupil consent under Databeskyttelsesloven §6, website removal, archiving)
  • Excel export — Face count column added
  • Performance: Slower than CPR scanning — opt-in only. Recommended for targeted scans of known image folders rather than full-tenant scans

Datatilsynet guidance: Danish schools have received enforcement actions specifically for unlawful retention of pupil photographs. Pupils under 15 require parental consent (Databeskyttelsesloven §6).


Article 9 special categories

The scanner detects keywords from nine GDPR Article 9 special categories in proximity to CPR numbers:

Category Examples
Health diagnose, sygemelding, behandling, medicin, psykiatri
Mental health depression, angst, stress, selvskade
Criminal records straffeoplysning, dom, straffeattest, sigtelse
Trade union fagforening, tillidsrepræsentant, overenskomst
Religion kirke, moské, religiøs, konfirmation
Ethnicity nationalitet, herkomst, etnicitet
Political opinions politisk, parti, valgkreds
Biometric fingeraftryk, ansigtsgenkendelse, biometrisk
Sexual orientation seksuel orientering

Keywords are loaded from keywords/da.json (Danish). English (en.json) and German (de.json) files can be added without code changes. Detection uses compiled per-category regex patterns for efficient matching.


Database export / import

Export and Import buttons in the sidebar ** Database** section back up or restore the entire compliance record.

# CLI equivalents
python gdpr_scanner.py --export-db ~/compliance/gdpr_export_2026.zip
python gdpr_scanner.py --import-db ~/compliance/gdpr_export_2026.zip
python gdpr_scanner.py --import-db ~/compliance/gdpr_export_2026.zip --import-mode replace --yes

Export ZIP contents:

File Contents
export_meta.json Export date, schema version, row counts
scans.json Scan run summaries
flagged_items.json Flagged items — thumbnails stripped
cpr_index.json CPR hashes (SHA-256 only)
pii_hits.json Per-type PII counts
dispositions.json Compliance decisions with legal basis
scan_history.json Aggregated trend data
deletion_log.json Full deletion audit trail

Import modes: merge (default — adds dispositions and deletion log only, safe on live DB) or replace (full restore, requires --yes).


Article 30 report

The Art.30 button in the filter bar generates a GDPR Article 30 Register of Processing Activities as a Word document (.docx).

Document sections:

Section Contents
Summary Scan date, items scanned, flagged count, CPR hits, estimated data subjects, overdue count, Art. 9 item count, photo/biometric count; per-source breakdown
Data categories Every detected PII type with hit counts and GDPR classification (Art. 9 vs Art. 4)
Data inventory Full item list sorted overdue-first; separate Staff and Student tables; name, source, account, date, CPR hits, disposition
Retention analysis Separate table of overdue items (if any)
Art. 9 special categories Item list with detected category breakdown (if any)
Photographs / biometric data Photo item list with face counts and 4-bullet retention guidance (if photo scanning was enabled)
Compliance trend Last 10 scans with flagged/overdue counts (if scan history exists)
Deletion audit log Every deletion with timestamp, actor, reason, and legal basis
Methodology Scanning approach and GDPR articles referenced (Art. 5, 9, 15, 17, 30)

The document is dated and can be stored as evidence of ongoing compliance activity for supervisory authorities.

Requires python-docx — included in requirements.txt.


Building the desktop app

build_gdpr.py packages gdpr_scanner.py + m365_connector.py + lang/ into a standalone native app using PyInstaller + pywebview.

python build_gdpr.py              # build for the current platform
python build_gdpr.py --icons-only # regenerate icon_gdpr.icns / icon_gdpr.ico
Platform Output Native window
macOS dist/GDPRScanner.app WKWebView
Windows dist/GDPRScanner/GDPRScanner.exe WebView2 (Edge)
Linux dist/GDPRScanner/GDPRScanner GTK WebKit

Cross-compilation is not supported — build on the target platform, or use the pre-built binaries from the GitHub Releases page.

GitHub Actions builds all three platforms automatically on every push to main and on v* tags. Pre-built zips are attached to each release:

File Platform
GDPRScanner_windows_x64.zip Windows 10/11 x64
GDPRScanner_linux_x86_64.zip Ubuntu 22.04+ / Debian
GDPRScanner_macos_x86_64.zip macOS 12+ Intel / Apple Silicon (Rosetta)

macOS Gatekeeper: the app is unsigned. On first launch right-click → Open to bypass the security warning.


Internationalisation

Language files live in lang/ alongside the scripts. As of v1.6.3 they are JSON files:

File Language
lang/en.json English
lang/da.json Danish
lang/de.json German

Auto-detection: On macOS and Linux the system locale is read from defaults read -g AppleLocale / $LANG. The detected language is used automatically.

Manual override: Create ~/.document_scanner_lang (or ~/.m365_scanner_lang for M365) containing just the language code, e.g. da. This persists across restarts.

In-app switcher: A language selector appears in the sidebar footer. Selecting a language saves the override and applies the new translations in place — the page does not reload and scan results are preserved.

Adding a language: Copy lang/en.json, translate all values, save as e.g. lang/fr.json. The app picks it up automatically on next start.

Exchange folder names are returned by Microsoft Graph in the account's own language (e.g. "Indbakke" for Danish users) and are displayed as-is. System folders are skipped using Exchange wellKnownName identifiers which are always in English regardless of locale, so skip logic is language-independent.


Open Source

GDPR Scanner is open source software, licensed under the GNU Affero General Public License v3.0 (AGPL-3.0).

This means you are free to use, study, modify, and distribute the software. If you run a modified version as a network service (e.g. a hosted GDPR compliance tool), you must publish the source of your modifications under the same licence.

A commercial licence is available for organisations that need to deploy the software as a managed service without the AGPL source disclosure requirement. Contact the maintainers for details.

Disclaimer: This tool is intended to assist with GDPR compliance activities. It does not constitute legal advice. You are responsible for ensuring your use complies with applicable law.

Contributing

Contributions are welcome — bug fixes, new language files, performance improvements, and items from SUGGESTIONS.md.

Please read CONTRIBUTING.md before submitting a pull request. For security vulnerabilities, follow the process in SECURITY.md — do not file public issues.

# Quick start for contributors
git clone https://github.com/your-org/gdpr-scanner.git
cd gdpr-scanner
python3 -m venv venv && source venv/bin/activate
pip install -r requirements.txt
python gdpr_scanner.py    # GDPRScanner on port 5100 (auto-increments if in use)

Test suite

GDPRScanner ships with a pytest test suite covering the CPR detection engine, configuration layer, checkpoint persistence, and the SQLite database.

pip install pytest
pytest tests/

112 tests across 4 modules — all expected to pass.

Module Tests Covers
tests/test_document_scanner.py 36 is_valid_cpr, extract_matches, scan_docx, scan_xlsx, _scan_bytes — CPR detection, false-positive suppression, binary crash safety
tests/test_app_config.py 34 i18n loading, Article 9 keyword detection, config round-trip, admin PIN, profiles CRUD, Fernet encryption
tests/test_checkpoint.py 18 Checkpoint key stability, save/load/clear, wrong-key isolation, delta token round-trip
tests/test_db.py 24 Scan lifecycle, CPR hash-only storage, data subject lookup, dispositions, export/import cycle

Each new module (cpr_detector.py, app_config.py, checkpoint.py, gdpr_db.py) is importable in isolation without Flask or MSAL — tests run without any cloud credentials or a running server.

The test suite should be run before every release and after any change to document_scanner.py, cpr_detector.py, or gdpr_db.py. CPR detection is the legal core of the tool — a false negative means a real GDPR violation goes undetected.

Roadmap

See SUGGESTIONS.md for the full feature roadmap with implementation status.


Project files

File Description
gdpr_scanner.py Flask entry point — scan orchestration, SSE route (/api/scan/stream), root route
scan_engine.py M365 and local/SMB scan logic — run_scan(), run_file_scan()
app_config.py All persistence — profiles, settings, SMTP config, lang loading, Fernet encryption
sse.py SSE broadcast queue and _current_scan_id
checkpoint.py Mid-scan checkpoint save/load, _checkpoint_key()
cpr_detector.py CPR pattern matching and validation
document_scanner.py Core scanning, redaction, OCR, NER, and PII detection engine
gdpr_db.py SQLite persistence layer — scan results, CPR index, PII hits, dispositions, scan history
m365_connector.py Microsoft Graph API client — auth, token refresh, email/OneDrive/SharePoint/Teams fetchers, delete methods
google_connector.py Google Workspace API client — Gmail, Drive, Admin SDK
file_scanner.py Unified local + SMB/CIFS file iterator — FileScanner.iter_files() yields (path, bytes, metadata). SMB reads use a 1-slot sliding-window ThreadPoolExecutor (PREFETCH_WINDOW=1) with a 60-second per-file timeout.
scan_scheduler.py In-process APScheduler wrapper — multi-job scheduled scan engine
templates/index.html Single-page HTML shell — Jinja2 template. Two variables: app_version, lang_json.
static/style.css All application CSS — custom properties, layout, components, light/dark themes
static/js/state.js Shared mutable state module (export const S) — imported by all 11 feature modules
static/js/*.js 11 ES modules: ui, log, users, auth, profiles, scan, results, sources, scheduler, connector, viewer
static/app.js Archived JS monolith — no longer loaded
routes/__init__.py Blueprint package marker
routes/state.py Shared mutable state (connector, flagged_items, LANG, scan locks) — imported by all blueprints
routes/auth.py /api/auth/* — M365 connect, status, sign-out, config
routes/google_auth.py /api/google/* — Google Workspace connect, status, sign-out
routes/google_scan.py /api/google/scan/* — Google scan execution
routes/scan.py /api/scan/* — start/stop, checkpoint, settings, src toggles
routes/users.py /api/users/* — listing, role overrides, license debug
routes/sources.py /api/file_sources/* and /api/file_scan/start
routes/profiles.py /api/profiles/* and /api/delta/*
routes/scheduler.py /api/scheduler/* — job CRUD, status, history, run-now
routes/email.py /api/smtp/* and /api/send_report
routes/database.py /api/db/*, /api/admin/*, /api/preview, /api/thumb
routes/export.py /api/export_excel, /api/export_article30, /api/delete_bulk
routes/viewer.py /view, /api/viewer/tokens, /api/viewer/pin — read-only viewer mode: token + PIN auth, share-link management
routes/app_routes.py /api/about, /api/langs, /api/lang, /manual
docs/manuals/MANUAL-EN.md End-user manual in English (15 sections) — served at /manual?lang=en
docs/manuals/MANUAL-DA.md End-user manual in Danish (15 sections) — served at /manual?lang=da
docs/setup/M365_SETUP.md Step-by-step Microsoft 365 setup guide
docs/setup/GOOGLE_SETUP.md Step-by-step Google Workspace setup guide
build_gdpr.py PyInstaller build script — generates m365_launcher.py, packages desktop app
lang/en.json English translations (source of truth)
lang/da.json Danish translations (primary language)
lang/de.json German translations
keywords/da.json Danish Article 9 special-category keyword list (454 keywords, 9 categories)
classification/m365_skus.json Microsoft Education SKU IDs and part-number fragments for student/staff role classification — edit to add new SKUs without code changes
classification/google_ou_roles.json Google OU path → role mapping
requirements.txt Python dependency list — use with pip install -r requirements.txt
run_tests.sh Activates venv and runs the full test suite; forwards any extra args to pytest
install_macos.sh Bash installer — Homebrew, Python 3.12, Tesseract, Poppler, ./venv, spaCy model
install_windows.ps1 PowerShell installer — Chocolatey, Python 3.12, Tesseract, Poppler, .\\venv, spaCy model
VERSION Current version number — single source of truth
CHANGELOG.md Release history and versioning policy
LICENSE GNU Affero General Public License v3.0
CONTRIBUTING.md Development setup, code style guide, and pull request process
SECURITY.md How to report security vulnerabilities responsibly
.gitignore Excludes credentials, databases, venv, and build artifacts from version control