Bugfixes in media detection

This commit is contained in:
StyxX65 2026-04-21 21:42:54 +02:00
parent d42518dc81
commit 360eb1caed
4 changed files with 12 additions and 17 deletions

View File

@ -19,6 +19,8 @@ Version numbers follow [Semantic Versioning](https://semver.org/spec/v2.0.0.html
### Fixed ### Fixed
- **Audio and video files not appearing in local/SMB file scan**`file_scanner.py` maintained its own hardcoded `DEFAULT_EXTENSIONS` set that was never updated when video and audio extensions were added to `cpr_detector.SUPPORTED_EXTS`. Fixed by importing `SUPPORTED_EXTS` from `cpr_detector` directly; `DEFAULT_EXTENSIONS` is now an alias for it. `cpr_detector.SUPPORTED_EXTS` is the single source of truth for all scan sources (M365, Google Drive, local, SMB).
- **Profile copy rename not reflected in left column until modal reopen** — saving a renamed profile via the full editor (`_pmgmtSaveFullEdit`) called `loadProfiles()` to refresh `S._profiles` but never called `_renderProfileMgmt()`, so the left-column list was not repainted. The new name only appeared after closing and reopening the modal. Fixed by calling `_renderProfileMgmt()` immediately after `loadProfiles()` and re-applying the `.active` highlight to the correct row. 10 new route integration tests added for all profile API endpoints; total test count: 182. - **Profile copy rename not reflected in left column until modal reopen** — saving a renamed profile via the full editor (`_pmgmtSaveFullEdit`) called `loadProfiles()` to refresh `S._profiles` but never called `_renderProfileMgmt()`, so the left-column list was not repainted. The new name only appeared after closing and reopening the modal. Fixed by calling `_renderProfileMgmt()` immediately after `loadProfiles()` and re-applying the `.active` highlight to the correct row. 10 new route integration tests added for all profile API endpoints; total test count: 182.
--- ---

View File

@ -20,6 +20,8 @@ python -m pytest tests/ -q
**Shared content processing** — all three scan engines (M365, Google, file) funnel downloaded bytes through a single function: `cpr_detector._scan_bytes(content, filename)`. It dispatches to the correct parser by file extension. `scan_engine.py` uses the `_scan_bytes_timeout` wrapper for PDFs (subprocess + hard timeout). `routes/google_scan.py` uses `_scan_bytes` directly. Do not duplicate file-type handling in per-source code. **Shared content processing** — all three scan engines (M365, Google, file) funnel downloaded bytes through a single function: `cpr_detector._scan_bytes(content, filename)`. It dispatches to the correct parser by file extension. `scan_engine.py` uses the `_scan_bytes_timeout` wrapper for PDFs (subprocess + hard timeout). `routes/google_scan.py` uses `_scan_bytes` directly. Do not duplicate file-type handling in per-source code.
**`cpr_detector.SUPPORTED_EXTS` is the single source of truth** for which file extensions are scanned across all sources. `file_scanner.py` imports it as `DEFAULT_EXTENSIONS` so local/SMB scans stay in sync automatically. `scan_engine.py` uses it to gate M365/SharePoint/Teams file downloads. Do not maintain a separate extension list anywhere else.
**`_scan_bytes` injection pattern** — `scan_engine.py` defines a no-op stub for `_scan_bytes` / `_scan_bytes_timeout` at module level (avoids circular import). `gdpr_scanner.py` overwrites them with the real `cpr_detector` implementations at startup. `routes/google_scan.py` resolves them lazily via `gdpr_scanner.__getattr__`. This is intentional — do not try to import them directly in those modules. **`_scan_bytes` injection pattern** — `scan_engine.py` defines a no-op stub for `_scan_bytes` / `_scan_bytes_timeout` at module level (avoids circular import). `gdpr_scanner.py` overwrites them with the real `cpr_detector` implementations at startup. `routes/google_scan.py` resolves them lazily via `gdpr_scanner.__getattr__`. This is intentional — do not try to import them directly in those modules.
**Blueprints** in `routes/` — see `routes/CLAUDE.md` for state/SSE rules. **Blueprints** in `routes/` — see `routes/CLAUDE.md` for state/SSE rules.

View File

@ -658,12 +658,12 @@ See [SUGGESTIONS.md](SUGGESTIONS.md) for the full feature roadmap with implement
| `app_config.py` | All persistence — profiles, settings, SMTP config, lang loading, Fernet encryption | | `app_config.py` | All persistence — profiles, settings, SMTP config, lang loading, Fernet encryption |
| `sse.py` | SSE broadcast queue and `_current_scan_id` | | `sse.py` | SSE broadcast queue and `_current_scan_id` |
| `checkpoint.py` | Mid-scan checkpoint save/load, `_checkpoint_key()` | | `checkpoint.py` | Mid-scan checkpoint save/load, `_checkpoint_key()` |
| `cpr_detector.py` | CPR pattern matching and validation | | `cpr_detector.py` | CPR pattern matching and validation. Defines `SUPPORTED_EXTS` — the single source of truth for which file extensions are scanned across all sources (M365, Google Drive, local/SMB). Also contains `VIDEO_EXTS` and `AUDIO_EXTS` subsets and the metadata extractors `_extract_video_metadata` / `_extract_audio_metadata`. |
| `document_scanner.py` | Core scanning, redaction, OCR, NER, and PII detection engine | | `document_scanner.py` | Core scanning, redaction, OCR, NER, and PII detection engine |
| `gdpr_db.py` | SQLite persistence layer — scan results, CPR index, PII hits, dispositions, scan history | | `gdpr_db.py` | SQLite persistence layer — scan results, CPR index, PII hits, dispositions, scan history |
| `m365_connector.py` | Microsoft Graph API client — auth, token refresh, email/OneDrive/SharePoint/Teams fetchers, delete methods | | `m365_connector.py` | Microsoft Graph API client — auth, token refresh, email/OneDrive/SharePoint/Teams fetchers, delete methods |
| `google_connector.py` | Google Workspace API client — Gmail, Drive, Admin SDK | | `google_connector.py` | Google Workspace API client — Gmail, Drive, Admin SDK |
| `file_scanner.py` | Unified local + SMB/CIFS file iterator — `FileScanner.iter_files()` yields `(path, bytes, metadata)`. SMB reads use a 1-slot sliding-window `ThreadPoolExecutor` (`PREFETCH_WINDOW=1`) with a 60-second per-file timeout. | | `file_scanner.py` | Unified local + SMB/CIFS file iterator — `FileScanner.iter_files()` yields `(path, bytes, metadata)`. SMB reads use a 1-slot sliding-window `ThreadPoolExecutor` (`PREFETCH_WINDOW=1`) with a 60-second per-file timeout. `DEFAULT_EXTENSIONS` is imported from `cpr_detector.SUPPORTED_EXTS` (not a local hardcoded set) so the scannable extension list stays in sync automatically. |
| `scan_scheduler.py` | In-process APScheduler wrapper — multi-job scheduled scan engine | | `scan_scheduler.py` | In-process APScheduler wrapper — multi-job scheduled scan engine |
| `templates/index.html` | Single-page HTML shell — Jinja2 template. Two variables: `app_version`, `lang_json`. | | `templates/index.html` | Single-page HTML shell — Jinja2 template. Two variables: `app_version`, `lang_json`. |
| `static/style.css` | All application CSS — custom properties, layout, components, light/dark themes | | `static/style.css` | All application CSS — custom properties, layout, components, light/dark themes |

View File

@ -24,6 +24,8 @@ import hashlib
from pathlib import Path, PurePosixPath from pathlib import Path, PurePosixPath
from typing import Iterator from typing import Iterator
from cpr_detector import SUPPORTED_EXTS as DEFAULT_EXTENSIONS
# ── Optional dependency flags ───────────────────────────────────────────────── # ── Optional dependency flags ─────────────────────────────────────────────────
try: try:
@ -58,19 +60,8 @@ except ImportError:
KEYCHAIN_SERVICE = "gdpr-scanner-nas" KEYCHAIN_SERVICE = "gdpr-scanner-nas"
# File extensions passed through to _scan_bytes(). Matches SUPPORTED_EXTS in # DEFAULT_EXTENSIONS is imported from cpr_detector.SUPPORTED_EXTS — single source of truth.
# gdpr_scanner.py; kept here too so FileScanner can filter without importing it. # Adding a new file type to cpr_detector.py automatically extends local/SMB scans too.
DEFAULT_EXTENSIONS = {
".pdf", ".docx", ".doc", ".xlsx", ".xlsm", ".csv",
".txt", ".eml", ".msg",
".jpg", ".jpeg", ".png", ".bmp", ".tiff", ".tif", ".webp",
".heic", ".heif",
}
# Extensions for local/SMB file scans — PDFs now included; OCR runs in a spawned
# subprocess with a 60-second hard timeout via _scan_bytes_timeout so hanging
# Tesseract/Poppler processes can never block the scan thread indefinitely.
FILE_SCAN_EXTENSIONS = DEFAULT_EXTENSIONS
# Maximum file size to load into memory (bytes). Files larger than this are # Maximum file size to load into memory (bytes). Files larger than this are
# skipped with a warning — same guard used by the M365 attachment scanner. # skipped with a warning — same guard used by the M365 attachment scanner.
@ -147,7 +138,7 @@ def store_smb_password(smb_host: str, smb_user: str,
class FileScanner: class FileScanner:
"""Unified local + SMB/CIFS file iterator.""" """Unified local + SMB/CIFS file iterator."""
FILE_SCAN_EXTENSIONS = FILE_SCAN_EXTENSIONS # excludes .pdf FILE_SCAN_EXTENSIONS = DEFAULT_EXTENSIONS
"""Unified iterator over local paths and SMB/CIFS network shares. """Unified iterator over local paths and SMB/CIFS network shares.
Usage:: Usage::
@ -209,7 +200,7 @@ class FileScanner:
Args: Args:
extensions: Set of lowercase extensions to include, e.g. {".pdf", ".docx"}. extensions: Set of lowercase extensions to include, e.g. {".pdf", ".docx"}.
Defaults to DEFAULT_EXTENSIONS. Defaults to DEFAULT_EXTENSIONS (cpr_detector.SUPPORTED_EXTS).
progress_cb: Optional callable(rel_path) called before each file is read, progress_cb: Optional callable(rel_path) called before each file is read,
so the caller can update a progress indicator. so the caller can update a progress indicator.