Python Dependencies
All Python modules used in the GDPR Scanner project, with a short explanation of each.
Third-party packages (install via pip install -r requirements.txt)
Web server
| Module |
Purpose |
flask |
Web server and API routing for the GDPRScanner UI |
Microsoft 365 authentication and API
| Module |
Purpose |
msal |
Microsoft Authentication Library — handles OAuth2 device code flow (delegated) and client credentials (application) for Microsoft Graph API access |
requests |
HTTP client used for all Microsoft Graph API calls |
Google Workspace scanning
| Module |
Purpose |
google-auth |
Service account authentication and domain-wide delegation for Google APIs |
google-auth-httplib2 |
HTTP transport adapter for google-auth |
google-api-python-client |
Gmail API, Google Drive API, and Admin Directory API client |
SMB / file system scanning
| Module |
Purpose |
smbprotocol |
SMB2/3 network share scanning without requiring a mounted drive — used for Windows file server sources |
keyring |
OS keychain credential storage for SMB passwords |
python-dotenv |
.env file fallback for headless SMB credentials when no keychain is available |
PDF handling
| Module |
Purpose |
pdfplumber |
Text extraction from PDFs with a selectable text layer — fast and accurate for native PDFs |
pymupdf (fitz) |
Physically removes the text layer from PDFs — preferred GDPR-compliant redaction method |
pdf2image (optional) |
Converts PDF pages to images (via Poppler) for OCR processing of scanned/image-based PDFs |
pytesseract (optional) |
Python wrapper for the Tesseract OCR engine — extracts text from rasterised PDF pages and images |
pypdf (optional) |
PDF metadata reading and low-level page manipulation — used in the document_scanner.py redaction path |
reportlab (optional) |
Fallback PDF redaction via overlay rendering — used when PyMuPDF is unavailable |
Optional packages are not in requirements.txt. Install them manually if you need OCR or the standalone document_scanner.py CLI.
Document formats
| Module |
Purpose |
python-docx |
Read and write .docx Word documents; also used to generate the Article 30 Register of Processing Activities report |
openpyxl |
Read and write .xlsx Excel files — used for the scan result export workbook |
Image processing and face detection
| Module |
Purpose |
opencv-python (cv2) |
Face detection in images via Haar cascade classifiers; also used for face blurring during anonymisation |
numpy |
Array operations required internally by OpenCV |
Pillow (PIL) |
Image manipulation — thumbnail generation, format conversion, EXIF metadata extraction |
NLP / Named Entity Recognition
| Module |
Purpose |
spacy |
NLP engine for Danish Named Entity Recognition — detects person names, addresses, and organisations in text. Requires the da_core_news_lg model (~500 MB) |
Encryption
| Module |
Purpose |
cryptography |
Fernet symmetric encryption — encrypts SMTP passwords at rest in ~/.gdprscanner/smtp.json; the Fernet key is derived from ~/.gdprscanner/machine_id |
Scheduling
| Module |
Purpose |
APScheduler |
In-process background scheduler — drives the scheduled scan feature (schedule.json). Uses BackgroundScheduler with CronTrigger |
System monitoring
| Module |
Purpose |
psutil |
Available-memory probe in scan_engine.py — skips file downloads when free RAM drops below 300 MB to prevent OOM crashes on large tenants |
Desktop app packaging
| Module |
Purpose |
pywebview |
Renders the Flask web UI inside a native OS window, creating a macOS .app or Windows .exe without requiring a browser |
pystray |
System tray icon integration for the desktop app builds |
pyinstaller |
Packages the Python application and all dependencies into a standalone executable |
pyinstaller-hooks-contrib |
Community-maintained hooks that help PyInstaller correctly bundle complex packages like spaCy and OpenCV |
Standard library modules (no installation needed)
Data storage
| Module |
Purpose |
sqlite3 |
SQLite database — stores scan results, CPR index (hashed), dispositions, deletion audit log, and scan history in ~/.gdprscanner/scanner.db |
json |
Config files, checkpoint files, language files, API request/response serialisation |
zipfile |
Database export/import archive creation and reading; also used in the PyInstaller build process |
csv |
CSV file scanning support |
Security and hashing
| Module |
Purpose |
hashlib |
SHA-256 hashing of CPR numbers before storage — raw CPR values are never written to the database |
secrets |
Cryptographically secure random values — used for viewer token generation and auth state parameters |
uuid |
UUID generation for viewer tokens and scan session identifiers |
File system and paths
| Module |
Purpose |
pathlib |
Cross-platform file and directory path handling throughout the codebase |
tempfile |
Temporary files for PDF and image processing — avoids leaving artefacts on disk |
shutil |
File copy and directory tree operations used in the build scripts |
Networking and email
| Module |
Purpose |
smtplib |
SMTP email delivery for the scheduled report feature — supports STARTTLS and SMTPS/SSL |
email |
Email message construction (MIME) for the SMTP report feature |
socket |
UDP probe to determine the machine's LAN IP address — used to build routable share links for viewer tokens |
Text and pattern matching
| Module |
Purpose |
re |
Regular expression engine — CPR pattern matching, phone numbers, IBANs, email addresses, Danish bank account numbers |
Concurrency
| Module |
Purpose |
threading |
Background scan thread so the Flask web UI stays responsive during long scans |
queue |
Server-Sent Events message queue — passes scan results from the background thread to the browser |
concurrent.futures |
ProcessPoolExecutor for parallel OCR processing of multi-page PDFs |
gc |
Explicit garbage collection after large scan batches to release memory promptly |
I/O and streams
| Module |
Purpose |
io |
In-memory byte streams for generating Excel and Word documents without writing to disk |
struct |
Binary data unpacking used in some PDF processing paths |
Date and time
| Module |
Purpose |
time |
Unix timestamps for scan records, audit log entries, and token expiry tracking |
datetime |
Human-readable date/time formatting for reports, filenames, and retention cutoff calculations |
System and process
| Module |
Purpose |
platform |
Detects the operating system for macOS/Windows-specific code paths |
subprocess |
Launches Tesseract and Poppler as external processes for OCR and PDF rendering |
argparse |
CLI argument parsing for --headless, --reset-db, --export-db, --import-db, etc. |
sys |
Python runtime access — sys.exit(), sys.path, sys.version |
os |
Environment variables and low-level file operations |
logging |
Application-level logging — routes warnings and errors to stderr and rotating file handlers |
Encoding and serialisation
| Module |
Purpose |
base64 |
Encodes thumbnail images as base64 strings for embedding in JSON API responses |
External system dependencies (not Python packages)
These must be installed separately — the installers (install_windows.ps1, install_macos.sh) handle this automatically.
| Tool |
Purpose |
| Tesseract OCR |
The OCR engine called by pytesseract — required for scanning image-based PDFs |
| Tesseract language packs |
dan (Danish) and eng (English) language data files for Tesseract |
| Poppler |
PDF rendering tools (pdftoppm, pdfinfo) required by pdf2image |