Bugfixes in media detection
This commit is contained in:
parent
d42518dc81
commit
360eb1caed
@ -19,6 +19,8 @@ Version numbers follow [Semantic Versioning](https://semver.org/spec/v2.0.0.html
|
|||||||
|
|
||||||
### Fixed
|
### Fixed
|
||||||
|
|
||||||
|
- **Audio and video files not appearing in local/SMB file scan** — `file_scanner.py` maintained its own hardcoded `DEFAULT_EXTENSIONS` set that was never updated when video and audio extensions were added to `cpr_detector.SUPPORTED_EXTS`. Fixed by importing `SUPPORTED_EXTS` from `cpr_detector` directly; `DEFAULT_EXTENSIONS` is now an alias for it. `cpr_detector.SUPPORTED_EXTS` is the single source of truth for all scan sources (M365, Google Drive, local, SMB).
|
||||||
|
|
||||||
- **Profile copy rename not reflected in left column until modal reopen** — saving a renamed profile via the full editor (`_pmgmtSaveFullEdit`) called `loadProfiles()` to refresh `S._profiles` but never called `_renderProfileMgmt()`, so the left-column list was not repainted. The new name only appeared after closing and reopening the modal. Fixed by calling `_renderProfileMgmt()` immediately after `loadProfiles()` and re-applying the `.active` highlight to the correct row. 10 new route integration tests added for all profile API endpoints; total test count: 182.
|
- **Profile copy rename not reflected in left column until modal reopen** — saving a renamed profile via the full editor (`_pmgmtSaveFullEdit`) called `loadProfiles()` to refresh `S._profiles` but never called `_renderProfileMgmt()`, so the left-column list was not repainted. The new name only appeared after closing and reopening the modal. Fixed by calling `_renderProfileMgmt()` immediately after `loadProfiles()` and re-applying the `.active` highlight to the correct row. 10 new route integration tests added for all profile API endpoints; total test count: 182.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|||||||
@ -20,6 +20,8 @@ python -m pytest tests/ -q
|
|||||||
|
|
||||||
**Shared content processing** — all three scan engines (M365, Google, file) funnel downloaded bytes through a single function: `cpr_detector._scan_bytes(content, filename)`. It dispatches to the correct parser by file extension. `scan_engine.py` uses the `_scan_bytes_timeout` wrapper for PDFs (subprocess + hard timeout). `routes/google_scan.py` uses `_scan_bytes` directly. Do not duplicate file-type handling in per-source code.
|
**Shared content processing** — all three scan engines (M365, Google, file) funnel downloaded bytes through a single function: `cpr_detector._scan_bytes(content, filename)`. It dispatches to the correct parser by file extension. `scan_engine.py` uses the `_scan_bytes_timeout` wrapper for PDFs (subprocess + hard timeout). `routes/google_scan.py` uses `_scan_bytes` directly. Do not duplicate file-type handling in per-source code.
|
||||||
|
|
||||||
|
**`cpr_detector.SUPPORTED_EXTS` is the single source of truth** for which file extensions are scanned across all sources. `file_scanner.py` imports it as `DEFAULT_EXTENSIONS` so local/SMB scans stay in sync automatically. `scan_engine.py` uses it to gate M365/SharePoint/Teams file downloads. Do not maintain a separate extension list anywhere else.
|
||||||
|
|
||||||
**`_scan_bytes` injection pattern** — `scan_engine.py` defines a no-op stub for `_scan_bytes` / `_scan_bytes_timeout` at module level (avoids circular import). `gdpr_scanner.py` overwrites them with the real `cpr_detector` implementations at startup. `routes/google_scan.py` resolves them lazily via `gdpr_scanner.__getattr__`. This is intentional — do not try to import them directly in those modules.
|
**`_scan_bytes` injection pattern** — `scan_engine.py` defines a no-op stub for `_scan_bytes` / `_scan_bytes_timeout` at module level (avoids circular import). `gdpr_scanner.py` overwrites them with the real `cpr_detector` implementations at startup. `routes/google_scan.py` resolves them lazily via `gdpr_scanner.__getattr__`. This is intentional — do not try to import them directly in those modules.
|
||||||
|
|
||||||
**Blueprints** in `routes/` — see `routes/CLAUDE.md` for state/SSE rules.
|
**Blueprints** in `routes/` — see `routes/CLAUDE.md` for state/SSE rules.
|
||||||
|
|||||||
@ -658,12 +658,12 @@ See [SUGGESTIONS.md](SUGGESTIONS.md) for the full feature roadmap with implement
|
|||||||
| `app_config.py` | All persistence — profiles, settings, SMTP config, lang loading, Fernet encryption |
|
| `app_config.py` | All persistence — profiles, settings, SMTP config, lang loading, Fernet encryption |
|
||||||
| `sse.py` | SSE broadcast queue and `_current_scan_id` |
|
| `sse.py` | SSE broadcast queue and `_current_scan_id` |
|
||||||
| `checkpoint.py` | Mid-scan checkpoint save/load, `_checkpoint_key()` |
|
| `checkpoint.py` | Mid-scan checkpoint save/load, `_checkpoint_key()` |
|
||||||
| `cpr_detector.py` | CPR pattern matching and validation |
|
| `cpr_detector.py` | CPR pattern matching and validation. Defines `SUPPORTED_EXTS` — the single source of truth for which file extensions are scanned across all sources (M365, Google Drive, local/SMB). Also contains `VIDEO_EXTS` and `AUDIO_EXTS` subsets and the metadata extractors `_extract_video_metadata` / `_extract_audio_metadata`. |
|
||||||
| `document_scanner.py` | Core scanning, redaction, OCR, NER, and PII detection engine |
|
| `document_scanner.py` | Core scanning, redaction, OCR, NER, and PII detection engine |
|
||||||
| `gdpr_db.py` | SQLite persistence layer — scan results, CPR index, PII hits, dispositions, scan history |
|
| `gdpr_db.py` | SQLite persistence layer — scan results, CPR index, PII hits, dispositions, scan history |
|
||||||
| `m365_connector.py` | Microsoft Graph API client — auth, token refresh, email/OneDrive/SharePoint/Teams fetchers, delete methods |
|
| `m365_connector.py` | Microsoft Graph API client — auth, token refresh, email/OneDrive/SharePoint/Teams fetchers, delete methods |
|
||||||
| `google_connector.py` | Google Workspace API client — Gmail, Drive, Admin SDK |
|
| `google_connector.py` | Google Workspace API client — Gmail, Drive, Admin SDK |
|
||||||
| `file_scanner.py` | Unified local + SMB/CIFS file iterator — `FileScanner.iter_files()` yields `(path, bytes, metadata)`. SMB reads use a 1-slot sliding-window `ThreadPoolExecutor` (`PREFETCH_WINDOW=1`) with a 60-second per-file timeout. |
|
| `file_scanner.py` | Unified local + SMB/CIFS file iterator — `FileScanner.iter_files()` yields `(path, bytes, metadata)`. SMB reads use a 1-slot sliding-window `ThreadPoolExecutor` (`PREFETCH_WINDOW=1`) with a 60-second per-file timeout. `DEFAULT_EXTENSIONS` is imported from `cpr_detector.SUPPORTED_EXTS` (not a local hardcoded set) so the scannable extension list stays in sync automatically. |
|
||||||
| `scan_scheduler.py` | In-process APScheduler wrapper — multi-job scheduled scan engine |
|
| `scan_scheduler.py` | In-process APScheduler wrapper — multi-job scheduled scan engine |
|
||||||
| `templates/index.html` | Single-page HTML shell — Jinja2 template. Two variables: `app_version`, `lang_json`. |
|
| `templates/index.html` | Single-page HTML shell — Jinja2 template. Two variables: `app_version`, `lang_json`. |
|
||||||
| `static/style.css` | All application CSS — custom properties, layout, components, light/dark themes |
|
| `static/style.css` | All application CSS — custom properties, layout, components, light/dark themes |
|
||||||
|
|||||||
@ -24,6 +24,8 @@ import hashlib
|
|||||||
from pathlib import Path, PurePosixPath
|
from pathlib import Path, PurePosixPath
|
||||||
from typing import Iterator
|
from typing import Iterator
|
||||||
|
|
||||||
|
from cpr_detector import SUPPORTED_EXTS as DEFAULT_EXTENSIONS
|
||||||
|
|
||||||
# ── Optional dependency flags ─────────────────────────────────────────────────
|
# ── Optional dependency flags ─────────────────────────────────────────────────
|
||||||
|
|
||||||
try:
|
try:
|
||||||
@ -58,19 +60,8 @@ except ImportError:
|
|||||||
|
|
||||||
KEYCHAIN_SERVICE = "gdpr-scanner-nas"
|
KEYCHAIN_SERVICE = "gdpr-scanner-nas"
|
||||||
|
|
||||||
# File extensions passed through to _scan_bytes(). Matches SUPPORTED_EXTS in
|
# DEFAULT_EXTENSIONS is imported from cpr_detector.SUPPORTED_EXTS — single source of truth.
|
||||||
# gdpr_scanner.py; kept here too so FileScanner can filter without importing it.
|
# Adding a new file type to cpr_detector.py automatically extends local/SMB scans too.
|
||||||
DEFAULT_EXTENSIONS = {
|
|
||||||
".pdf", ".docx", ".doc", ".xlsx", ".xlsm", ".csv",
|
|
||||||
".txt", ".eml", ".msg",
|
|
||||||
".jpg", ".jpeg", ".png", ".bmp", ".tiff", ".tif", ".webp",
|
|
||||||
".heic", ".heif",
|
|
||||||
}
|
|
||||||
|
|
||||||
# Extensions for local/SMB file scans — PDFs now included; OCR runs in a spawned
|
|
||||||
# subprocess with a 60-second hard timeout via _scan_bytes_timeout so hanging
|
|
||||||
# Tesseract/Poppler processes can never block the scan thread indefinitely.
|
|
||||||
FILE_SCAN_EXTENSIONS = DEFAULT_EXTENSIONS
|
|
||||||
|
|
||||||
# Maximum file size to load into memory (bytes). Files larger than this are
|
# Maximum file size to load into memory (bytes). Files larger than this are
|
||||||
# skipped with a warning — same guard used by the M365 attachment scanner.
|
# skipped with a warning — same guard used by the M365 attachment scanner.
|
||||||
@ -147,7 +138,7 @@ def store_smb_password(smb_host: str, smb_user: str,
|
|||||||
class FileScanner:
|
class FileScanner:
|
||||||
"""Unified local + SMB/CIFS file iterator."""
|
"""Unified local + SMB/CIFS file iterator."""
|
||||||
|
|
||||||
FILE_SCAN_EXTENSIONS = FILE_SCAN_EXTENSIONS # excludes .pdf
|
FILE_SCAN_EXTENSIONS = DEFAULT_EXTENSIONS
|
||||||
"""Unified iterator over local paths and SMB/CIFS network shares.
|
"""Unified iterator over local paths and SMB/CIFS network shares.
|
||||||
|
|
||||||
Usage::
|
Usage::
|
||||||
@ -209,7 +200,7 @@ class FileScanner:
|
|||||||
|
|
||||||
Args:
|
Args:
|
||||||
extensions: Set of lowercase extensions to include, e.g. {".pdf", ".docx"}.
|
extensions: Set of lowercase extensions to include, e.g. {".pdf", ".docx"}.
|
||||||
Defaults to DEFAULT_EXTENSIONS.
|
Defaults to DEFAULT_EXTENSIONS (cpr_detector.SUPPORTED_EXTS).
|
||||||
progress_cb: Optional callable(rel_path) called before each file is read,
|
progress_cb: Optional callable(rel_path) called before each file is read,
|
||||||
so the caller can update a progress indicator.
|
so the caller can update a progress indicator.
|
||||||
|
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user