diff --git a/CHANGELOG.md b/CHANGELOG.md index ff72715..ab45fbd 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -7,10 +7,15 @@ Version numbers follow [Semantic Versioning](https://semver.org/spec/v2.0.0.html --- -## [Unreleased] +## [1.6.15] — 2026-04-12 ### Added +- **Scan filter options for student environments** — two new profile options reduce noise when scanning student accounts: + - **Ignore GPS in images** (`skip_gps_images`) — images whose only PII signal is an embedded GPS coordinate are not flagged. Smartphones embed location in every camera photo by default, generating large numbers of low-priority flags in school contexts. GPS data is still extracted and shown in the detail card when the image is flagged by another signal (faces, EXIF author/comment). Applies to M365, Google, and file scans. + - **Min. CPR count per file** (`min_cpr_count`, default 1) — a file is only flagged if it contains at least this many *distinct* CPR numbers. Set to 2 to avoid reporting a student's own consent form or registration document (one CPR) while still flagging class lists and grade sheets with multiple students' CPRs. Deduplication is by value — a CPR repeated 10 times counts as 1 distinct number. Applies to M365, Google, and file scans. + - Both options are saved in profiles and editable in the Profile Manager editor. + - **GitHub Actions CI/CD — macOS build** — `.github/workflows/build.yml` now also builds a macOS `.app` bundle (`macos-15`, Apple Silicon ARM64) on every push to `main` and on `v*` tags. Released as `GDPRScanner_macos_arm64.zip`. (Originally `macos-13` / Intel, changed when GitHub retired that runner.) ### Fixed diff --git a/CLAUDE.md b/CLAUDE.md index 82c2451..a53dc7d 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -61,6 +61,15 @@ Read-only access for DPOs and reviewers. Key invariants: - **Do not add a fixed `max-height` or `height` to `#sourcesPanel` in HTML** — height is controlled entirely by `_fitSourcesPanel()` at runtime. - **Do not call `_fitSourcesPanel()` before the panel has rendered** — `scrollHeight` will be 0. The call in `renderSourcesPanel()` is the correct hook; `_initSourcesResize()` only sets up the drag handler. +## Scan filter options — scan_engine.py + +Both options live in the profile `options` dict and apply to **all three scan engines** (M365, Google, file scan). + +- **`skip_gps_images` (bool, default `false`)** — When enabled, images whose only PII is GPS coordinates are not flagged. GPS data is still extracted and stored in the card `exif` field if the item is flagged by another signal (faces, EXIF author/comment). The `gps_location` special category is also suppressed. Evaluated via `_exif_has_pii` which rechecks `pii_fields` and `author` when GPS is skipped. +- **`min_cpr_count` (int, default `1`)** — Minimum number of **distinct** CPR numbers in a file before it is flagged. Deduplication uses `list(dict.fromkeys(cprs))` to preserve order. Files with faces or EXIF PII are still flagged regardless of CPR count — the threshold gates only CPR-based hits. +- **File scan** reads both from `source` dict keys (passed directly from the `/api/file_scan/start` payload). **M365 scan** reads both from `scan_opts = options.get("options", {})`. Both paths apply the same `_cpr_qualifies` / `_exif_has_pii` logic before the flagging gate. +- **UI:** sidebar controls `#optSkipGps` (toggle) and `#optMinCpr` (number); profile editor controls `#peOptSkipGps` and `#peOptMinCpr`. Both are saved/loaded by `profiles.js`. + ## Memory management — scan_engine.py Large M365 tenants can generate enormous memory pressure. Key rules to preserve: diff --git a/README.md b/README.md index 9a07116..c03bfb2 100644 --- a/README.md +++ b/README.md @@ -123,9 +123,10 @@ A date-from picker limits the scan to items modified after the selected date. Qu | Scan attachments | On | Scan PDF/Word/Excel attachments inside emails | | Max attachment size | **20 MB** | Skip attachments larger than this threshold | | Max emails per user | **2000** | Cap per mailbox to avoid very long scans | -| **Δ Delta scan** | Off | Fetch only changed items since the last scan (see [Delta scan](#delta-scan) below) | | **Δ Delta scan** | Off | Fetch only changed items since the last scan — hover the **?** for details (see [Delta scan](#delta-scan) below) | -| ** Scan photos for faces** | Off | Detect faces in image files and flag as Art. 9 biometric data — hover the **?** for details (see [Photo scanning](#photo--biometric-scanning) below) | +| **Scan photos for faces** | Off | Detect faces in image files and flag as Art. 9 biometric data — hover the **?** for details (see [Photo scanning](#photo--biometric-scanning) below) | +| **Ignore GPS in images** | Off | Skip images whose only PII signal is an embedded GPS coordinate. Useful for student scans where smartphones embed location in every camera photo. GPS is still shown in the detail card if the image is flagged for another reason (faces, EXIF author). | +| **Min. CPR count per file** | **1** | Only flag a file if it contains at least this many *distinct* CPR numbers. Set to 2 to suppress false positives in student scans (e.g. a student's own consent form with a single CPR) while still reporting class lists and grade sheets with multiple CPRs. | | **Retention policy** | Off | Flag items older than N years — hover the **?** for details (see [Retention policy](#retention-policy-enforcement)) | #### Results grid diff --git a/VERSION b/VERSION index 5577648..7e84a78 100644 --- a/VERSION +++ b/VERSION @@ -1 +1 @@ -1.6.14 +1.6.15 diff --git a/docs/manuals/MANUAL-DA.md b/docs/manuals/MANUAL-DA.md index b878bfa..2ed905c 100644 --- a/docs/manuals/MANUAL-DA.md +++ b/docs/manuals/MANUAL-DA.md @@ -496,6 +496,10 @@ Disse indstillinger findes i venstre panel under **Indstillinger**: **Søg efter ansigter i billeder** — langsommere scanning, der registrerer fotografier med genkendelige menneskelige ansigter. Markerer dem som artikel 9 biometriske data. Anbefales til skoler, der opbevarer elevfotos. +**Ignorer GPS i billeder** — når aktiveret, flagges billeder ikke, hvis GPS-koordinater i billedets metadata er det eneste PII-signal. Nyttigt ved scanning af elevkonti: smartphones indlejrer automatisk GPS-koordinater i alle kamerabilleder, hvilket ellers ville generere mange lavprioriterede fund i en skolekontekst. Hvis et billede allerede er flagget af en anden årsag (ansigter, EXIF-forfatterfelter), vises GPS-koordinaterne stadig i detaljekortet. + +**Min. CPR-antal pr. fil** — en fil flagges kun, hvis den indeholder mindst dette antal *distinkte* CPR-numre. Standardværdien er 1 (nuværende adfærd). Sæt til 2 for at undgå falske positive ved elevscanninger: en elevs samtykkeerklæring eller indmeldelsesformular indeholder typisk kun elevens eget CPR-nummer, mens en klasselist eller karakteroversigt med flere elevers CPR-numre stadig vil blive rapporteret. + **Opbevaringspolitik** — når aktiveret, markeres elementer ældre end det angivne antal år som forældet. Regnskabsårets afslutning bestemmer, hvordan skæringsdatoen beregnes: | Indstilling | Beregning af skæringsdato | diff --git a/docs/manuals/MANUAL-EN.md b/docs/manuals/MANUAL-EN.md index 07c1f6a..2927b40 100644 --- a/docs/manuals/MANUAL-EN.md +++ b/docs/manuals/MANUAL-EN.md @@ -496,6 +496,10 @@ These options are in the left sidebar under **Indstillinger**: **Scan photos for faces** — slower scan that detects photographs containing recognisable human faces. Flags them as Article 9 biometric data. Recommended for schools storing student photos. +**Ignore GPS in images** — when enabled, images whose only PII signal is an embedded GPS location are not flagged. Useful when scanning student accounts: smartphones embed GPS coordinates in every photo taken with the camera app, which would otherwise generate large numbers of flags that are low-priority for a school context. If an image is already flagged for another reason (faces, EXIF author field), the GPS coordinate is still shown in the detail card. + +**Min. CPR count per file** — only flag a file if it contains at least this many *distinct* CPR numbers. The default is 1 (current behaviour). Setting it to 2 avoids false positives in student scans: a student's own consent form or registration document typically contains only their own CPR number, while a class list or grade sheet containing multiple students' CPRs will still be reported. + **Retention policy** — when enabled, marks items older than the specified number of years as overdue. The fiscal year end setting determines how the cutoff date is calculated: | Option | Cutoff date calculation | diff --git a/lang/da.json b/lang/da.json index 719c8dc..698c5fe 100644 --- a/lang/da.json +++ b/lang/da.json @@ -559,6 +559,10 @@ "m365_db_import_run": "Importer", "m365_opt_scan_photos": "Søg efter ansigter i billeder", "m365_opt_scan_photos_hint": "Markerer billeder med registrerede ansigter som Art. 9 biometriske data. Langsommere — aktivér efter behov.", + "m365_opt_skip_gps": "Ignorer GPS i billeder", + "m365_opt_skip_gps_hint": "Billeder med GPS-koordinater flagges ikke — nyttigt ved elevscanninger, hvor smartphones indlejrer placering i alle fotos.", + "m365_opt_min_cpr": "Min. CPR-antal pr. fil", + "m365_opt_min_cpr_hint": "Filer med færre distinkte CPR-numre end denne tærskel rapporteres ikke. Sæt til 2 for at undgå falske positive, når elever har egne CPR-numre i filer.", "m365_filter_photo_only": "📷 Billeder / biometrisk", "m365_badge_faces": "ansigter", "a30_photo_items": "Billeder med registrerede ansigter (Art. 9 biometrisk)", diff --git a/lang/de.json b/lang/de.json index 70ca97c..d4ba788 100644 --- a/lang/de.json +++ b/lang/de.json @@ -559,6 +559,10 @@ "m365_db_import_run": "Importieren", "m365_opt_scan_photos": "Fotos nach Gesichtern durchsuchen", "m365_opt_scan_photos_hint": "Markiert Bilder mit erkannten Gesichtern als biometrische Daten gem. Art. 9. Langsamer — bei Bedarf aktivieren.", + "m365_opt_skip_gps": "GPS in Bildern ignorieren", + "m365_opt_skip_gps_hint": "Bilder mit GPS-Koordinaten werden nicht markiert — nützlich beim Scannen von Schüler-Konten, deren Smartphones Standort in jedes Foto einbetten.", + "m365_opt_min_cpr": "Min. CPR-Anzahl pro Datei", + "m365_opt_min_cpr_hint": "Dateien mit weniger eindeutigen CPR-Nummern als dieser Schwellenwert werden nicht gemeldet. Auf 2 setzen, um Falsch-Positive zu vermeiden, wenn Schüler eigene CPR-Nummern in Dateien haben.", "m365_filter_photo_only": "📷 Fotos / biometrisch", "m365_badge_faces": "Gesichter", "a30_photo_items": "Fotos mit erkannten Gesichtern (Art. 9 biometrisch)", diff --git a/lang/en.json b/lang/en.json index 4a970c3..600cbc6 100644 --- a/lang/en.json +++ b/lang/en.json @@ -559,6 +559,10 @@ "m365_db_import_run": "Import", "m365_opt_scan_photos": "Scan photos for faces", "m365_opt_scan_photos_hint": "Flags images with detected faces as Art. 9 biometric data. Slower — opt in.", + "m365_opt_skip_gps": "Ignore GPS in images", + "m365_opt_skip_gps_hint": "Images with GPS coordinates are not flagged — useful when scanning students whose smartphones embed location in every photo.", + "m365_opt_min_cpr": "Min. CPR count per file", + "m365_opt_min_cpr_hint": "Files with fewer distinct CPR numbers than this threshold are not reported. Set to 2 to avoid false positives when students have their own CPR in documents.", "m365_filter_photo_only": "📷 Photos / biometric", "m365_badge_faces": "faces", "a30_photo_items": "Photos with detected faces (Art. 9 biometric)", diff --git a/scan_engine.py b/scan_engine.py index d8be012..f61169d 100644 --- a/scan_engine.py +++ b/scan_engine.py @@ -164,8 +164,10 @@ def run_file_scan(source: dict): smb_domain = source.get("smb_domain") or "" keychain_key= source.get("keychain_key") or None smb_password= source.get("smb_password") or None - scan_photos = bool(source.get("scan_photos", False)) - max_mb = int(source.get("max_file_mb", 50)) + scan_photos = bool(source.get("scan_photos", False)) + skip_gps_images = bool(source.get("skip_gps_images", False)) + min_cpr_count = max(1, int(source.get("min_cpr_count", 1))) + max_mb = int(source.get("max_file_mb", 50)) if not FILE_SCANNER_OK: broadcast("scan_error", {"file": label, "error": "file_scanner.py not found"}) @@ -243,7 +245,14 @@ def run_file_scan(source: dict): _face_count = _detect_photo_faces(content, rel_path) _exif = _extract_exif(content, rel_path) - if not cprs and _face_count == 0 and not _exif.get("has_pii"): + # Apply filters: distinct CPR threshold and GPS suppression + _distinct_cprs = list(dict.fromkeys(cprs)) # preserve order, deduplicate + _cpr_qualifies = len(_distinct_cprs) >= min_cpr_count + _exif_has_pii = _exif.get("has_pii") and ( + not skip_gps_images or bool(_exif.get("pii_fields") or _exif.get("author")) + ) + + if not (_cpr_qualifies and cprs) and _face_count == 0 and not _exif_has_pii: continue # Build card metadata @@ -256,9 +265,9 @@ def run_file_scan(source: dict): _sc = _check_special_category(_file_text, cprs) if _face_count > 0 and "biometric" not in _sc: _sc = sorted(_sc + ["biometric"]) - if _exif.get("gps") and "gps_location" not in _sc: + if _exif.get("gps") and not skip_gps_images and "gps_location" not in _sc: _sc = sorted(_sc + ["gps_location"]) - if _exif.get("has_pii") and "exif_pii" not in _sc: + if _exif_has_pii and "exif_pii" not in _sc: _sc = sorted(_sc + ["exif_pii"]) # Thumbnail for images @@ -389,6 +398,8 @@ def run_scan(options: dict): max_emails = int(scan_opts.get("max_emails", 2000)) delta_enabled = bool(scan_opts.get("delta", False)) scan_photos = bool(scan_opts.get("scan_photos", False)) # biometric photo scan (#9) + skip_gps_images= bool(scan_opts.get("skip_gps_images", False)) + min_cpr_count = max(1, int(scan_opts.get("min_cpr_count", 1))) # Delta token state — loaded once, updated per-source, saved on completion delta_tokens: dict = _load_delta_tokens() if delta_enabled else {} @@ -1079,8 +1090,15 @@ def run_scan(options: dict): _face_count = _detect_photo_faces(content, name) _exif = _extract_exif(content, name) - # Flag item if CPRs found, faces detected, or EXIF PII found - if cprs or _face_count > 0 or _exif.get("has_pii"): + # Apply filters: distinct CPR threshold and GPS suppression + _distinct_cprs = list(dict.fromkeys(cprs)) # preserve order, deduplicate + _cpr_qualifies = len(_distinct_cprs) >= min_cpr_count + _exif_has_pii = _exif.get("has_pii") and ( + not skip_gps_images or bool(_exif.get("pii_fields") or _exif.get("author")) + ) + + # Flag item if CPRs found (above threshold), faces detected, or EXIF PII found + if (_cpr_qualifies and cprs) or _face_count > 0 or _exif_has_pii: # Make thumbnail if ext in {".jpg", ".jpeg", ".png"} and PIL_OK: thumb = _make_thumb(content, name) @@ -1109,9 +1127,9 @@ def run_scan(options: dict): # the category even when no CPR is present in the file. if _face_count > 0 and "biometric" not in _sc: _sc = sorted(_sc + ["biometric"]) - if _exif.get("gps") and "gps_location" not in _sc: + if _exif.get("gps") and not skip_gps_images and "gps_location" not in _sc: _sc = sorted(_sc + ["gps_location"]) - if _exif.get("has_pii") and "exif_pii" not in _sc: + if _exif_has_pii and "exif_pii" not in _sc: _sc = sorted(_sc + ["exif_pii"]) meta["_special_category"] = _sc meta["_face_count"] = _face_count diff --git a/static/js/profiles.js b/static/js/profiles.js index 5656a7b..62f2346 100644 --- a/static/js/profiles.js +++ b/static/js/profiles.js @@ -122,6 +122,16 @@ function _applyProfile(profile) { if (el) el.checked = opts.scan_photos; } + if (opts.skip_gps_images !== undefined) { + const el = document.getElementById('optSkipGps'); + if (el) el.checked = opts.skip_gps_images; + } + + if (opts.min_cpr_count !== undefined) { + const el = document.getElementById('optMinCpr'); + if (el) el.value = opts.min_cpr_count; + } + // ── Date filter ─────────────────────────────────────────────────────────── const days = opts.older_than_days; if (days !== undefined) { @@ -395,6 +405,8 @@ function _openEditorForProfile(profile) {