GDPRScanner/DEPENDENCIES.md
2026-04-11 04:38:11 +02:00

6.3 KiB

Python Dependencies

All Python modules used in the GDPR Scanner project, with a short explanation of each.

Third-party packages (install via pip install -r requirements.txt)

Web server

Module Purpose
flask Web server and API routing for both the GDPRScanner UI

Microsoft 365 authentication and API

Module Purpose
msal Microsoft Authentication Library — handles OAuth2 device code flow (delegated) and client credentials (application) for Microsoft Graph API access
requests HTTP client used for all Microsoft Graph API calls

PDF handling

Module Purpose
pdfplumber Text extraction from PDFs with a selectable text layer — fast and accurate for native PDFs
pdf2image Converts PDF pages to images (via Poppler) for OCR processing of scanned/image-based PDFs
pytesseract Python wrapper for the Tesseract OCR engine — extracts text from rasterised PDF pages and images
pypdf PDF metadata reading and low-level page manipulation
reportlab Fallback PDF redaction via overlay rendering — used when PyMuPDF is unavailable
pymupdf (fitz) Physically removes the text layer from PDFs — preferred GDPR-compliant redaction method

Document formats

Module Purpose
python-docx Read and write .docx Word documents; also used to generate the Article 30 Register of Processing Activities report
openpyxl Read and write .xlsx Excel files — used for the scan result export workbook
img2pdf Converts images to PDF for archiving redacted output

Image processing and face detection

Module Purpose
opencv-python (cv2) Face detection in images via Haar cascade classifiers; also used for face blurring during anonymisation
numpy Array operations required internally by OpenCV
Pillow (PIL) Image manipulation — thumbnail generation, format conversion, image resizing

NLP / Named Entity Recognition

Module Purpose
spacy NLP engine for Danish Named Entity Recognition — detects person names, addresses, and organisations in text. Requires the da_core_news_lg model (~500 MB)

Archive scanning

Module Purpose
py7zr 7-Zip archive support — allows the scanner to inspect .7z compressed files

Desktop app packaging

Module Purpose
pywebview Renders the Flask web UI inside a native OS window, creating a macOS .app or Windows .exe without requiring a browser
pystray System tray icon integration for the desktop app builds
pyinstaller Packages the Python application and all dependencies into a standalone executable
pyinstaller-hooks-contrib Community-maintained hooks that help PyInstaller correctly bundle complex packages like spaCy and OpenCV

Standard library modules (no installation needed)

Data storage

Module Purpose
sqlite3 SQLite database — stores scan results, CPR index (hashed), dispositions, deletion audit log, and scan history in ~/.gdpr_scanner.db
json Config files, checkpoint files, language files, API request/response serialisation
zipfile Database export/import archive creation and reading; also used in the PyInstaller build process
csv CSV file scanning support in the Document Scanner

Security and hashing

Module Purpose
hashlib SHA-256 hashing of CPR numbers before storage — raw CPR values are never written to the database
secrets Cryptographically secure random values (used in auth state parameters)

File system and paths

Module Purpose
pathlib Cross-platform file and directory path handling throughout the codebase
tempfile Temporary files for PDF and image processing — avoids leaving artefacts on disk
shutil File copy and directory tree operations used in the build scripts

Networking and email

Module Purpose
smtplib SMTP email delivery for the headless report feature — supports STARTTLS and SMTPS/SSL
email Email message construction (MIME) for the SMTP report feature

Text and pattern matching

Module Purpose
re Regular expression engine — CPR pattern matching, phone numbers, IBANs, email addresses, Danish bank account numbers

Concurrency

Module Purpose
threading Background scan thread so the Flask web UI stays responsive during long scans
queue Server-Sent Events message queue — passes scan results from the background thread to the browser
concurrent.futures ProcessPoolExecutor for parallel OCR processing of multi-page PDFs

I/O and streams

Module Purpose
io In-memory byte streams for generating Excel and Word documents without writing to disk
struct Binary data unpacking (used in some PDF processing paths)

Date and time

Module Purpose
time Unix timestamps for scan records, audit log entries, and token expiry tracking
datetime Human-readable date/time formatting for reports, filenames, and retention cutoff calculations

System and process

Module Purpose
platform Detects the operating system for macOS/Windows-specific code paths
subprocess Launches Tesseract and Poppler as external processes for OCR and PDF rendering
argparse CLI argument parsing for --headless, --reset-db, --export-db, --import-db etc.
sys Python runtime access — sys.exit(), sys.path, sys.version
os Environment variables and low-level file operations

Encoding and serialisation

Module Purpose
base64 Encodes thumbnail images as base64 strings for embedding in JSON API responses
struct Binary format parsing used in some document processing paths

External system dependencies (not Python packages)

These must be installed separately — the installers (install_windows.ps1, install_macos.sh) handle this automatically.

Tool Purpose
Tesseract OCR The OCR engine called by pytesseract — required for scanning image-based PDFs
Tesseract language packs dan (Danish) and eng (English) language data files for Tesseract
Poppler PDF rendering tools (pdftoppm, pdfinfo) required by pdf2image