GDPRScanner/OSS_LANDSCAPE.md

4.0 KiB

Open Source Landscape — GDPR / PII Document Scanners

An overview of existing open source tools in the same space as GDPRScanner, and where the gaps are.


Summary

No open source project covers the same combination of M365 + Google Workspace connectors, Danish CPR detection, and GDPR Article 30 reporting in a single web UI. The closest commercial equivalent is PII Tools (closed source, SaaS).


Existing open source tools

Microsoft Presidio

A well-maintained PII detection library (not an application) from Microsoft. Supports custom recognisers — a CPR pattern could be added. Covers text, images, and structured data via NLP + regex pipelines. No M365/GWS connectors, no UI, no reports, no scheduling. You would have to build the entire scanning application around it. ~9k GitHub stars.

Octopii

Local filesystem / S3 / Apache open-directory scanner using OCR + NLP + regex. Detects passports, government IDs, emails, and addresses in image and document files. No cloud connectors, no CPR awareness, no web UI.

pdscan / piicatcher

CLI tools that scan databases and data warehouses for PII columns using column-name heuristics and NLP sampling. No file storage scanning, no email, no cloud connectors.

"GDPR scanners" on GitHub

Projects such as baudev/gdpr-checker-backend, dev4privacy/gdpr-analyzer, mammuth/gdpr-scanner, and City-of-Helsinki/GDPR-compliance-scanner are all website and cookie compliance scanners. They check whether a domain sets tracking cookies without consent — a completely different problem.

CPR libraries

Several small libraries exist for validating or generating Danish CPR numbers (mathiasvr/danish-ssn, anhoej/cprr, ekstroem/DKcpr). None of them are document or cloud-storage scanners.


Commercial products that do cover it

Product M365 GWS CPR Article 30 Open source
PII Tools
BigID
Varonis partial
Spirion

PII Tools is the most direct commercial equivalent: Graph API + GWS service account connectors, document scanning, web UI. Closed source, SaaS pricing targeted at enterprise.


Capability comparison

Capability GDPRScanner Presidio Octopii Commercial
M365 (Exchange / OneDrive / SharePoint / Teams)
Google Workspace (Gmail / Drive)
Local / SMB / SFTP partial
Danish CPR with modulus-11 validation plugin only
Email address + phone number detection
GDPR Article 30 report generation partial
Disposition tagging + bulk deletion partial
Scheduled scans
Checkpoint / resume unknown
Read-only viewer / share links partial
Web UI for non-technical staff
Danish-language UI
Open source

What makes GDPRScanner unique

The combination of Danish CPR specificity (modulus-11 validation, date sanity checks), M365 + Google Workspace connectors in a single tool, and GDPR Article 30 output is the gap no open source project fills. The Danish public-sector target audience (schools, municipalities) also drives requirements — role classification (student/staff), Danish-language UI, municipal data retention rules — that no general-purpose PII tool addresses.