Update DEPENDENCIES.md

2026-04-12 14:53:07 +02:00 · 2026-04-12 14:53:07 +02:00 · e64d7eb958
commit e64d7eb958
parent 9c38188bb4
1 changed files with 45 additions and 17 deletions
--- a/DEPENDENCIES.md
+++ b/DEPENDENCIES.md
@ -7,7 +7,7 @@ All Python modules used in the GDPR Scanner project, with a short explanation of
 ### Web server
 | Module | Purpose |
 |---|---|
-| `flask` | Web server and API routing for both the GDPRScanner UI |
+| `flask` | Web server and API routing for the GDPRScanner UI |

 ### Microsoft 365 authentication and API
 | Module | Purpose |
@ -15,39 +15,64 @@ All Python modules used in the GDPR Scanner project, with a short explanation of
 | `msal` | Microsoft Authentication Library — handles OAuth2 device code flow (delegated) and client credentials (application) for Microsoft Graph API access |
 | `requests` | HTTP client used for all Microsoft Graph API calls |

+### Google Workspace scanning
+| Module | Purpose |
+|---|---|
+| `google-auth` | Service account authentication and domain-wide delegation for Google APIs |
+| `google-auth-httplib2` | HTTP transport adapter for `google-auth` |
+| `google-api-python-client` | Gmail API, Google Drive API, and Admin Directory API client |
+
+### SMB / file system scanning
+| Module | Purpose |
+|---|---|
+| `smbprotocol` | SMB2/3 network share scanning without requiring a mounted drive — used for Windows file server sources |
+| `keyring` | OS keychain credential storage for SMB passwords |
+| `python-dotenv` | `.env` file fallback for headless SMB credentials when no keychain is available |
+
 ### PDF handling
 | Module | Purpose |
 |---|---|
 | `pdfplumber` | Text extraction from PDFs with a selectable text layer — fast and accurate for native PDFs |
-| `pdf2image` | Converts PDF pages to images (via Poppler) for OCR processing of scanned/image-based PDFs |
-| `pytesseract` | Python wrapper for the Tesseract OCR engine — extracts text from rasterised PDF pages and images |
-| `pypdf` | PDF metadata reading and low-level page manipulation |
-| `reportlab` | Fallback PDF redaction via overlay rendering — used when PyMuPDF is unavailable |
 | `pymupdf` (fitz) | Physically removes the text layer from PDFs — preferred GDPR-compliant redaction method |
+| `pdf2image` *(optional)* | Converts PDF pages to images (via Poppler) for OCR processing of scanned/image-based PDFs |
+| `pytesseract` *(optional)* | Python wrapper for the Tesseract OCR engine — extracts text from rasterised PDF pages and images |
+| `pypdf` *(optional)* | PDF metadata reading and low-level page manipulation — used in the `document_scanner.py` redaction path |
+| `reportlab` *(optional)* | Fallback PDF redaction via overlay rendering — used when PyMuPDF is unavailable |
+
+> Optional packages are not in `requirements.txt`. Install them manually if you need OCR or the standalone `document_scanner.py` CLI.

 ### Document formats
 | Module | Purpose |
 |---|---|
 | `python-docx` | Read and write `.docx` Word documents; also used to generate the Article 30 Register of Processing Activities report |
 | `openpyxl` | Read and write `.xlsx` Excel files — used for the scan result export workbook |
-| `img2pdf` | Converts images to PDF for archiving redacted output |

 ### Image processing and face detection
 | Module | Purpose |
 |---|---|
 | `opencv-python` (cv2) | Face detection in images via Haar cascade classifiers; also used for face blurring during anonymisation |
 | `numpy` | Array operations required internally by OpenCV |
-| `Pillow` (PIL) | Image manipulation — thumbnail generation, format conversion, image resizing |
+| `Pillow` (PIL) | Image manipulation — thumbnail generation, format conversion, EXIF metadata extraction |

 ### NLP / Named Entity Recognition
 | Module | Purpose |
 |---|---|
 | `spacy` | NLP engine for Danish Named Entity Recognition — detects person names, addresses, and organisations in text. Requires the `da_core_news_lg` model (~500 MB) |

-### Archive scanning
+### Encryption
 | Module | Purpose |
 |---|---|
-| `py7zr` | 7-Zip archive support — allows the scanner to inspect `.7z` compressed files |
+| `cryptography` | Fernet symmetric encryption — encrypts SMTP passwords at rest in `~/.gdprscanner/smtp.json`; the Fernet key is derived from `~/.gdprscanner/machine_id` |
+
+### Scheduling
+| Module | Purpose |
+|---|---|
+| `APScheduler` | In-process background scheduler — drives the scheduled scan feature (`schedule.json`). Uses `BackgroundScheduler` with `CronTrigger` |
+
+### System monitoring
+| Module | Purpose |
+|---|---|
+| `psutil` | Available-memory probe in `scan_engine.py` — skips file downloads when free RAM drops below 300 MB to prevent OOM crashes on large tenants |

 ### Desktop app packaging
 | Module | Purpose |
@ -64,16 +89,17 @@ All Python modules used in the GDPR Scanner project, with a short explanation of
 ### Data storage
 | Module | Purpose |
 |---|---|
-| `sqlite3` | SQLite database — stores scan results, CPR index (hashed), dispositions, deletion audit log, and scan history in `~/.gdpr_scanner.db` |
+| `sqlite3` | SQLite database — stores scan results, CPR index (hashed), dispositions, deletion audit log, and scan history in `~/.gdprscanner/scanner.db` |
 | `json` | Config files, checkpoint files, language files, API request/response serialisation |
 | `zipfile` | Database export/import archive creation and reading; also used in the PyInstaller build process |
-| `csv` | CSV file scanning support in the Document Scanner |
+| `csv` | CSV file scanning support |

 ### Security and hashing
 | Module | Purpose |
 |---|---|
 | `hashlib` | SHA-256 hashing of CPR numbers before storage — raw CPR values are never written to the database |
-| `secrets` | Cryptographically secure random values (used in auth state parameters) |
+| `secrets` | Cryptographically secure random values — used for viewer token generation and auth state parameters |
+| `uuid` | UUID generation for viewer tokens and scan session identifiers |

 ### File system and paths
 | Module | Purpose |
@ -85,8 +111,9 @@ All Python modules used in the GDPR Scanner project, with a short explanation of
 ### Networking and email
 | Module | Purpose |
 |---|---|
-| `smtplib` | SMTP email delivery for the headless report feature — supports STARTTLS and SMTPS/SSL |
+| `smtplib` | SMTP email delivery for the scheduled report feature — supports STARTTLS and SMTPS/SSL |
 | `email` | Email message construction (MIME) for the SMTP report feature |
+| `socket` | UDP probe to determine the machine's LAN IP address — used to build routable share links for viewer tokens |

 ### Text and pattern matching
 | Module | Purpose |
@ -99,12 +126,13 @@ All Python modules used in the GDPR Scanner project, with a short explanation of
 | `threading` | Background scan thread so the Flask web UI stays responsive during long scans |
 | `queue` | Server-Sent Events message queue — passes scan results from the background thread to the browser |
 | `concurrent.futures` | `ProcessPoolExecutor` for parallel OCR processing of multi-page PDFs |
+| `gc` | Explicit garbage collection after large scan batches to release memory promptly |

 ### I/O and streams
 | Module | Purpose |
 |---|---|
 | `io` | In-memory byte streams for generating Excel and Word documents without writing to disk |
-| `struct` | Binary data unpacking (used in some PDF processing paths) |
+| `struct` | Binary data unpacking used in some PDF processing paths |

 ### Date and time
 | Module | Purpose |
@ -117,15 +145,15 @@ All Python modules used in the GDPR Scanner project, with a short explanation of
 |---|---|
 | `platform` | Detects the operating system for macOS/Windows-specific code paths |
 | `subprocess` | Launches Tesseract and Poppler as external processes for OCR and PDF rendering |
-| `argparse` | CLI argument parsing for `--headless`, `--reset-db`, `--export-db`, `--import-db` etc. |
-| `sys` | Python runtime access — sys.exit(), sys.path, sys.version |
+| `argparse` | CLI argument parsing for `--headless`, `--reset-db`, `--export-db`, `--import-db`, etc. |
+| `sys` | Python runtime access — `sys.exit()`, `sys.path`, `sys.version` |
 | `os` | Environment variables and low-level file operations |
+| `logging` | Application-level logging — routes warnings and errors to stderr and rotating file handlers |

 ### Encoding and serialisation
 | Module | Purpose |
 |---|---|
 | `base64` | Encodes thumbnail images as base64 strings for embedding in JSON API responses |
-| `struct` | Binary format parsing used in some document processing paths |

 ---