68 lines
4.0 KiB
Markdown
68 lines
4.0 KiB
Markdown
# Open Source Landscape — GDPR / PII Document Scanners
|
|
|
|
An overview of existing open source tools in the same space as GDPRScanner, and where the gaps are.
|
|
|
|
---
|
|
|
|
## Summary
|
|
|
|
No open source project covers the same combination of M365 + Google Workspace connectors, Danish CPR detection, and GDPR Article 30 reporting in a single web UI. The closest commercial equivalent is [PII Tools](https://pii-tools.com) (closed source, SaaS).
|
|
|
|
---
|
|
|
|
## Existing open source tools
|
|
|
|
### [Microsoft Presidio](https://github.com/microsoft/presidio)
|
|
A well-maintained PII detection *library* (not an application) from Microsoft. Supports custom recognisers — a CPR pattern could be added. Covers text, images, and structured data via NLP + regex pipelines. No M365/GWS connectors, no UI, no reports, no scheduling. You would have to build the entire scanning application around it. ~9k GitHub stars.
|
|
|
|
### [Octopii](https://github.com/redhuntlabs/Octopii)
|
|
Local filesystem / S3 / Apache open-directory scanner using OCR + NLP + regex. Detects passports, government IDs, emails, and addresses in image and document files. No cloud connectors, no CPR awareness, no web UI.
|
|
|
|
### [pdscan](https://github.com/ankane/pdscan) / [piicatcher](https://github.com/tokern/piicatcher)
|
|
CLI tools that scan *databases* and data warehouses for PII columns using column-name heuristics and NLP sampling. No file storage scanning, no email, no cloud connectors.
|
|
|
|
### "GDPR scanners" on GitHub
|
|
Projects such as [baudev/gdpr-checker-backend](https://github.com/baudev/gdpr-checker-backend), [dev4privacy/gdpr-analyzer](https://github.com/dev4privacy/gdpr-analyzer), [mammuth/gdpr-scanner](https://github.com/mammuth/gdpr-scanner), and [City-of-Helsinki/GDPR-compliance-scanner](https://github.com/City-of-Helsinki/GDPR-compliance-scanner) are all **website and cookie compliance** scanners. They check whether a domain sets tracking cookies without consent — a completely different problem.
|
|
|
|
### CPR libraries
|
|
Several small libraries exist for validating or generating Danish CPR numbers ([mathiasvr/danish-ssn](https://github.com/mathiasvr/danish-ssn), [anhoej/cprr](https://github.com/anhoej/cprr), [ekstroem/DKcpr](https://github.com/ekstroem/DKcpr)). None of them are document or cloud-storage scanners.
|
|
|
|
---
|
|
|
|
## Commercial products that do cover it
|
|
|
|
| Product | M365 | GWS | CPR | Article 30 | Open source |
|
|
|---|---|---|---|---|---|
|
|
| [PII Tools](https://pii-tools.com) | ✅ | ✅ | ❌ | ❌ | ❌ |
|
|
| BigID | ✅ | ✅ | ❌ | ❌ | ❌ |
|
|
| Varonis | ✅ | partial | ❌ | ❌ | ❌ |
|
|
| Spirion | ✅ | ❌ | ❌ | ❌ | ❌ |
|
|
|
|
PII Tools is the most direct commercial equivalent: Graph API + GWS service account connectors, document scanning, web UI. Closed source, SaaS pricing targeted at enterprise.
|
|
|
|
---
|
|
|
|
## Capability comparison
|
|
|
|
| Capability | GDPRScanner | Presidio | Octopii | Commercial |
|
|
|---|---|---|---|---|
|
|
| M365 (Exchange / OneDrive / SharePoint / Teams) | ✅ | ❌ | ❌ | ✅ |
|
|
| Google Workspace (Gmail / Drive) | ✅ | ❌ | ❌ | ✅ |
|
|
| Local / SMB / SFTP | ✅ | ❌ | partial | ✅ |
|
|
| Danish CPR with modulus-11 validation | ✅ | plugin only | ❌ | ❌ |
|
|
| Email address + phone number detection | ✅ | ✅ | ✅ | ✅ |
|
|
| GDPR Article 30 report generation | ✅ | ❌ | ❌ | partial |
|
|
| Disposition tagging + bulk deletion | ✅ | ❌ | ❌ | partial |
|
|
| Scheduled scans | ✅ | ❌ | ❌ | ✅ |
|
|
| Checkpoint / resume | ✅ | ❌ | ❌ | unknown |
|
|
| Read-only viewer / share links | ✅ | ❌ | ❌ | partial |
|
|
| Web UI for non-technical staff | ✅ | ❌ | ❌ | ✅ |
|
|
| Danish-language UI | ✅ | ❌ | ❌ | ❌ |
|
|
| Open source | ✅ | ✅ | ✅ | ❌ |
|
|
|
|
---
|
|
|
|
## What makes GDPRScanner unique
|
|
|
|
The combination of Danish CPR specificity (modulus-11 validation, date sanity checks), M365 + Google Workspace connectors in a single tool, and GDPR Article 30 output is the gap no open source project fills. The Danish public-sector target audience (schools, municipalities) also drives requirements — role classification (student/staff), Danish-language UI, municipal data retention rules — that no general-purpose PII tool addresses.
|