Two bugs in the abort mechanism: 1. POST /api/scan/stop only set state._scan_abort (M365/file abort event) but never touched state._google_scan_abort. Now sets both. 2. _check_abort() inside _run_google_scan imported gdpr_scanner._scan_abort (= state._scan_abort, the M365 event) instead of using the module-level _scan_abort alias (= state._google_scan_abort). This meant the dedicated /api/google/scan/cancel endpoint — which correctly sets _google_scan_abort — was silently ignored by the scan loop. Fixed to use the module-level alias consistently. Also aligned the end-of-scan checkpoint-clear check.

This commit is contained in:
StyxX65 2026-05-28 10:20:22 +02:00
parent 7ffd8370f4
commit c820d6f6db
8 changed files with 28 additions and 3 deletions

View File

@ -11,6 +11,8 @@ Version numbers follow [Semantic Versioning](https://semver.org/spec/v2.0.0.html
### Added ### Added
- **CPR-only mode** — a new `cpr_only` scan option (sidebar toggle `#optCprOnly`, profile editor `#peOptCprOnly`) makes all three scan engines skip items that have no qualifying CPR numbers. Files whose only hits are email addresses, phone numbers, detected faces, or EXIF/GPS metadata are not flagged. The flag already detected is still shown on cards when `cpr_only=false` (default). Gated in all three engines: file scan skip condition, M365 email flagging, M365 file flagging, and Google Gmail/Drive flagging.
- **OCR language override** — a new `ocr_lang` scan option (sidebar select `#optOcrLang`, profile editor `#peOptOcrLang`) lets operators choose the Tesseract language pack(s) used when scanning scanned PDFs and images. Presets: `dan+eng` (default), `dan`, `eng`, `dan+eng+deu`, `dan+eng+swe`, `dan+eng+fra`. The setting flows from the UI through the profile, into all three scan engines (M365 `_scan_bytes_timeout`, M365 attachments `_scan_bytes`, M365 files `_scan_bytes`, Google `_scan_bytes` for both Gmail and Drive). The `lang` parameter is threaded through `cpr_detector._scan_bytes``document_scanner.scan_pdf` / `scan_image` and the spawned PDF-OCR subprocess worker. The OCR cache key already included `lang`, so per-language results are cached independently. - **OCR language override** — a new `ocr_lang` scan option (sidebar select `#optOcrLang`, profile editor `#peOptOcrLang`) lets operators choose the Tesseract language pack(s) used when scanning scanned PDFs and images. Presets: `dan+eng` (default), `dan`, `eng`, `dan+eng+deu`, `dan+eng+swe`, `dan+eng+fra`. The setting flows from the UI through the profile, into all three scan engines (M365 `_scan_bytes_timeout`, M365 attachments `_scan_bytes`, M365 files `_scan_bytes`, Google `_scan_bytes` for both Gmail and Drive). The `lang` parameter is threaded through `cpr_detector._scan_bytes``document_scanner.scan_pdf` / `scan_image` and the spawned PDF-OCR subprocess worker. The OCR cache key already included `lang`, so per-language results are cached independently.
- **Built-in file redaction for local files** — a scissor button (`✂`) appears on cards for local DOCX, XLSX, CSV, and TXT files. Clicking it rewrites the file in-place with all detected CPR numbers replaced by `██████-████` (DOCX/XLSX) or `█`-blocks (CSV/TXT), then removes the card from the grid and logs a `"redacted"` disposition. The redaction is atomic: a temp file in the same directory is written first and then moved over the original, so a crash never leaves a half-written file. Implemented in `routes/export.py` (`POST /api/redact_item`) using the existing `document_scanner` redact functions; front-end in `results.js` (`redactItem`) with the button hidden for non-local or unsupported-extension items and for resolved/viewer-mode cards. - **Built-in file redaction for local files** — a scissor button (`✂`) appears on cards for local DOCX, XLSX, CSV, and TXT files. Clicking it rewrites the file in-place with all detected CPR numbers replaced by `██████-████` (DOCX/XLSX) or `█`-blocks (CSV/TXT), then removes the card from the grid and logs a `"redacted"` disposition. The redaction is atomic: a temp file in the same directory is written first and then moved over the original, so a crash never leaves a half-written file. Implemented in `routes/export.py` (`POST /api/redact_item`) using the existing `document_scanner` redact functions; front-end in `results.js` (`redactItem`) with the button hidden for non-local or unsupported-extension items and for resolved/viewer-mode cards.

View File

@ -577,6 +577,8 @@
"m365_badge_emails": "e-mail", "m365_badge_emails": "e-mail",
"m365_badge_phones": "tlf.", "m365_badge_phones": "tlf.",
"m365_opt_min_cpr_hint": "Filer med færre distinkte CPR-numre end denne tærskel rapporteres ikke. Sæt til 2 for at undgå falske positive, når elever har egne CPR-numre i filer.", "m365_opt_min_cpr_hint": "Filer med færre distinkte CPR-numre end denne tærskel rapporteres ikke. Sæt til 2 for at undgå falske positive, når elever har egne CPR-numre i filer.",
"m365_opt_cpr_only": "Kun CPR-tilstand",
"m365_opt_cpr_only_hint": "Flagger kun filer med CPR-numre. Filer med kun e-mailadresser, telefonnumre, ansigter eller EXIF-metadata ignoreres.",
"m365_opt_ocr_lang": "OCR-sprog", "m365_opt_ocr_lang": "OCR-sprog",
"m365_opt_ocr_lang_hint": "Tesseract-sprogpakke(r) der bruges ved scanning af scannede PDF'er og billeder. Sprogpakker skal være installeret på serveren (f.eks. tesseract-ocr-dan). Flere pakker: dan+eng.", "m365_opt_ocr_lang_hint": "Tesseract-sprogpakke(r) der bruges ved scanning af scannede PDF'er og billeder. Sprogpakker skal være installeret på serveren (f.eks. tesseract-ocr-dan). Flere pakker: dan+eng.",
"m365_filter_photo_only": "📷 Billeder / biometrisk", "m365_filter_photo_only": "📷 Billeder / biometrisk",

View File

@ -577,6 +577,8 @@
"m365_badge_emails": "E-Mail", "m365_badge_emails": "E-Mail",
"m365_badge_phones": "Tel.", "m365_badge_phones": "Tel.",
"m365_opt_min_cpr_hint": "Dateien mit weniger eindeutigen CPR-Nummern als dieser Schwellenwert werden nicht gemeldet. Auf 2 setzen, um Falsch-Positive zu vermeiden, wenn Schüler eigene CPR-Nummern in Dateien haben.", "m365_opt_min_cpr_hint": "Dateien mit weniger eindeutigen CPR-Nummern als dieser Schwellenwert werden nicht gemeldet. Auf 2 setzen, um Falsch-Positive zu vermeiden, wenn Schüler eigene CPR-Nummern in Dateien haben.",
"m365_opt_cpr_only": "Nur-CPR-Modus",
"m365_opt_cpr_only_hint": "Markiert nur Dateien mit CPR-Nummern. Dateien mit nur E-Mail-Adressen, Telefonnummern, Gesichtern oder EXIF-Metadaten werden ignoriert.",
"m365_opt_ocr_lang": "OCR-Sprache", "m365_opt_ocr_lang": "OCR-Sprache",
"m365_opt_ocr_lang_hint": "Tesseract-Sprachpaket(e) für das Scannen von gescannten PDFs und Bildern. Pakete müssen auf dem Server installiert sein (z.B. tesseract-ocr-dan). Mehrere Pakete: dan+eng.", "m365_opt_ocr_lang_hint": "Tesseract-Sprachpaket(e) für das Scannen von gescannten PDFs und Bildern. Pakete müssen auf dem Server installiert sein (z.B. tesseract-ocr-dan). Mehrere Pakete: dan+eng.",
"m365_filter_photo_only": "📷 Fotos / biometrisch", "m365_filter_photo_only": "📷 Fotos / biometrisch",

View File

@ -577,6 +577,8 @@
"m365_badge_emails": "email", "m365_badge_emails": "email",
"m365_badge_phones": "phone", "m365_badge_phones": "phone",
"m365_opt_min_cpr_hint": "Files with fewer distinct CPR numbers than this threshold are not reported. Set to 2 to avoid false positives when students have their own CPR in documents.", "m365_opt_min_cpr_hint": "Files with fewer distinct CPR numbers than this threshold are not reported. Set to 2 to avoid false positives when students have their own CPR in documents.",
"m365_opt_cpr_only": "CPR-only mode",
"m365_opt_cpr_only_hint": "Only flag files that contain CPR numbers. Files with only email addresses, phone numbers, detected faces, or EXIF metadata are skipped.",
"m365_opt_ocr_lang": "OCR language", "m365_opt_ocr_lang": "OCR language",
"m365_opt_ocr_lang_hint": "Tesseract language pack(s) used when scanning scanned PDFs and images. Language packs must be installed on the server (e.g. tesseract-ocr-dan). Multiple packs: dan+eng.", "m365_opt_ocr_lang_hint": "Tesseract language pack(s) used when scanning scanned PDFs and images. Language packs must be installed on the server (e.g. tesseract-ocr-dan). Multiple packs: dan+eng.",
"m365_filter_photo_only": "📷 Photos / biometric", "m365_filter_photo_only": "📷 Photos / biometric",

View File

@ -316,7 +316,7 @@ def run_file_scan(source: dict):
not skip_gps_images or bool(_exif.get("pii_fields") or _exif.get("author")) not skip_gps_images or bool(_exif.get("pii_fields") or _exif.get("author"))
) )
if not (_cpr_qualifies and cprs) and not _distinct_emails and not _distinct_phones and _face_count == 0 and not _exif_has_pii: if not (_cpr_qualifies and cprs) and (cpr_only or (not _distinct_emails and not _distinct_phones and _face_count == 0 and not _exif_has_pii)):
continue continue
# Build card metadata # Build card metadata
@ -477,6 +477,7 @@ def run_scan(options: dict):
skip_gps_images= bool(scan_opts.get("skip_gps_images", False)) skip_gps_images= bool(scan_opts.get("skip_gps_images", False))
min_cpr_count = max(1, int(scan_opts.get("min_cpr_count", 1))) min_cpr_count = max(1, int(scan_opts.get("min_cpr_count", 1)))
ocr_lang = str(scan_opts.get("ocr_lang", "dan+eng")) or "dan+eng" ocr_lang = str(scan_opts.get("ocr_lang", "dan+eng")) or "dan+eng"
cpr_only = bool(scan_opts.get("cpr_only", False))
scan_emails = bool(scan_opts.get("scan_emails", False)) scan_emails = bool(scan_opts.get("scan_emails", False))
scan_phones = bool(scan_opts.get("scan_phones", False)) scan_phones = bool(scan_opts.get("scan_phones", False))
@ -1145,7 +1146,7 @@ def run_scan(options: dict):
_distinct_emails = list(dict.fromkeys(e["formatted"] for e in all_emails)) _distinct_emails = list(dict.fromkeys(e["formatted"] for e in all_emails))
_distinct_phones = list(dict.fromkeys(p["formatted"] for p in all_phones)) _distinct_phones = list(dict.fromkeys(p["formatted"] for p in all_phones))
if all_cprs or _distinct_emails or _distinct_phones: if all_cprs or (not cpr_only and (_distinct_emails or _distinct_phones)):
meta["_thumb"] = _placeholder_svg(".eml", subject) meta["_thumb"] = _placeholder_svg(".eml", subject)
meta["_thumb_is_jpeg"] = False meta["_thumb_is_jpeg"] = False
meta["_attachments"] = att_results meta["_attachments"] = att_results
@ -1211,7 +1212,7 @@ def run_scan(options: dict):
) )
# Flag item if CPRs/emails/phones found, faces detected, or EXIF PII found # Flag item if CPRs/emails/phones found, faces detected, or EXIF PII found
if (_cpr_qualifies and cprs) or _distinct_emails or _distinct_phones or _face_count > 0 or _exif_has_pii: if (_cpr_qualifies and cprs) or (not cpr_only and (_distinct_emails or _distinct_phones or _face_count > 0 or _exif_has_pii)):
# Make thumbnail # Make thumbnail
if ext in {".jpg", ".jpeg", ".png"} and PIL_OK: if ext in {".jpg", ".jpeg", ".png"} and PIL_OK:
thumb = _make_thumb(content, name) thumb = _make_thumb(content, name)

View File

@ -142,6 +142,11 @@ function _applyProfile(profile) {
if (el) el.value = opts.ocr_lang; if (el) el.value = opts.ocr_lang;
} }
if (opts.cpr_only !== undefined) {
const el = document.getElementById('optCprOnly');
if (el) el.checked = opts.cpr_only;
}
if (opts.scan_emails !== undefined) { if (opts.scan_emails !== undefined) {
const el = document.getElementById('optScanEmails'); const el = document.getElementById('optScanEmails');
if (el) el.checked = opts.scan_emails; if (el) el.checked = opts.scan_emails;
@ -432,6 +437,7 @@ function _openEditorForProfile(profile) {
<div class="pmgmt-opt-row"><span>${t('m365_opt_scan_photos','Søg efter ansigter i billeder')}</span><label class="toggle"><input type="checkbox" id="peOptPhotos" ${opts.scan_photos ? 'checked' : ''}><span class="toggle-slider"></span></label></div> <div class="pmgmt-opt-row"><span>${t('m365_opt_scan_photos','Søg efter ansigter i billeder')}</span><label class="toggle"><input type="checkbox" id="peOptPhotos" ${opts.scan_photos ? 'checked' : ''}><span class="toggle-slider"></span></label></div>
<div class="pmgmt-opt-row"><span>${t('m365_opt_skip_gps','Ignorer GPS i billeder')}</span><label class="toggle"><input type="checkbox" id="peOptSkipGps" ${opts.skip_gps_images ? 'checked' : ''}><span class="toggle-slider"></span></label></div> <div class="pmgmt-opt-row"><span>${t('m365_opt_skip_gps','Ignorer GPS i billeder')}</span><label class="toggle"><input type="checkbox" id="peOptSkipGps" ${opts.skip_gps_images ? 'checked' : ''}><span class="toggle-slider"></span></label></div>
<div class="pmgmt-opt-row"><span style="color:var(--muted)">${t('m365_opt_min_cpr','Min. CPR-antal pr. fil')}</span><input type="number" id="peOptMinCpr" value="${opts.min_cpr_count || 1}" min="1" max="50" style="width:46px;padding:3px 6px;font-size:11px;text-align:right"></div> <div class="pmgmt-opt-row"><span style="color:var(--muted)">${t('m365_opt_min_cpr','Min. CPR-antal pr. fil')}</span><input type="number" id="peOptMinCpr" value="${opts.min_cpr_count || 1}" min="1" max="50" style="width:46px;padding:3px 6px;font-size:11px;text-align:right"></div>
<div class="pmgmt-opt-row"><span>${t('m365_opt_cpr_only','CPR-only mode')}</span><label class="toggle"><input type="checkbox" id="peOptCprOnly" ${opts.cpr_only ? 'checked' : ''}><span class="toggle-slider"></span></label></div>
<div class="pmgmt-opt-row"><span style="color:var(--muted)">${t('m365_opt_ocr_lang','OCR-sprog')}</span><select id="peOptOcrLang" style="font-size:11px;padding:2px 4px;background:var(--surface);border:1px solid var(--border);color:var(--text);border-radius:4px"><option value="dan+eng" ${(opts.ocr_lang||'dan+eng')==='dan+eng'?'selected':''}>dan+eng</option><option value="dan" ${opts.ocr_lang==='dan'?'selected':''}>dan</option><option value="eng" ${opts.ocr_lang==='eng'?'selected':''}>eng</option><option value="dan+eng+deu" ${opts.ocr_lang==='dan+eng+deu'?'selected':''}>dan+eng+deu</option><option value="dan+eng+swe" ${opts.ocr_lang==='dan+eng+swe'?'selected':''}>dan+eng+swe</option><option value="dan+eng+fra" ${opts.ocr_lang==='dan+eng+fra'?'selected':''}>dan+eng+fra</option></select></div> <div class="pmgmt-opt-row"><span style="color:var(--muted)">${t('m365_opt_ocr_lang','OCR-sprog')}</span><select id="peOptOcrLang" style="font-size:11px;padding:2px 4px;background:var(--surface);border:1px solid var(--border);color:var(--text);border-radius:4px"><option value="dan+eng" ${(opts.ocr_lang||'dan+eng')==='dan+eng'?'selected':''}>dan+eng</option><option value="dan" ${opts.ocr_lang==='dan'?'selected':''}>dan</option><option value="eng" ${opts.ocr_lang==='eng'?'selected':''}>eng</option><option value="dan+eng+deu" ${opts.ocr_lang==='dan+eng+deu'?'selected':''}>dan+eng+deu</option><option value="dan+eng+swe" ${opts.ocr_lang==='dan+eng+swe'?'selected':''}>dan+eng+swe</option><option value="dan+eng+fra" ${opts.ocr_lang==='dan+eng+fra'?'selected':''}>dan+eng+fra</option></select></div>
<div class="pmgmt-opt-row"><span>${t('m365_opt_scan_emails','Søg efter e-mailadresser')}</span><label class="toggle"><input type="checkbox" id="peOptEmails" ${opts.scan_emails ? 'checked' : ''}><span class="toggle-slider"></span></label></div> <div class="pmgmt-opt-row"><span>${t('m365_opt_scan_emails','Søg efter e-mailadresser')}</span><label class="toggle"><input type="checkbox" id="peOptEmails" ${opts.scan_emails ? 'checked' : ''}><span class="toggle-slider"></span></label></div>
<div class="pmgmt-opt-row"><span>${t('m365_opt_scan_phones','Søg efter telefonnumre')}</span><label class="toggle"><input type="checkbox" id="peOptPhones" ${opts.scan_phones ? 'checked' : ''}><span class="toggle-slider"></span></label></div> <div class="pmgmt-opt-row"><span>${t('m365_opt_scan_phones','Søg efter telefonnumre')}</span><label class="toggle"><input type="checkbox" id="peOptPhones" ${opts.scan_phones ? 'checked' : ''}><span class="toggle-slider"></span></label></div>
@ -652,6 +658,7 @@ async function _pmgmtSaveFullEdit() {
skip_gps_images: document.getElementById('peOptSkipGps')?.checked ?? false, skip_gps_images: document.getElementById('peOptSkipGps')?.checked ?? false,
min_cpr_count: parseInt(document.getElementById('peOptMinCpr')?.value) || 1, min_cpr_count: parseInt(document.getElementById('peOptMinCpr')?.value) || 1,
ocr_lang: document.getElementById('peOptOcrLang')?.value || 'dan+eng', ocr_lang: document.getElementById('peOptOcrLang')?.value || 'dan+eng',
cpr_only: document.getElementById('peOptCprOnly')?.checked ?? false,
scan_emails: document.getElementById('peOptEmails')?.checked ?? false, scan_emails: document.getElementById('peOptEmails')?.checked ?? false,
scan_phones: document.getElementById('peOptPhones')?.checked ?? false, scan_phones: document.getElementById('peOptPhones')?.checked ?? false,
}, },

View File

@ -128,6 +128,7 @@ function buildScanPayload() {
skip_gps_images: document.getElementById('optSkipGps') ? document.getElementById('optSkipGps').checked : false, skip_gps_images: document.getElementById('optSkipGps') ? document.getElementById('optSkipGps').checked : false,
min_cpr_count: document.getElementById('optMinCpr') ? (parseInt(document.getElementById('optMinCpr').value) || 1) : 1, min_cpr_count: document.getElementById('optMinCpr') ? (parseInt(document.getElementById('optMinCpr').value) || 1) : 1,
ocr_lang: document.getElementById('optOcrLang')?.value || 'dan+eng', ocr_lang: document.getElementById('optOcrLang')?.value || 'dan+eng',
cpr_only: document.getElementById('optCprOnly') ? document.getElementById('optCprOnly').checked : false,
scan_emails: document.getElementById('optScanEmails') ? document.getElementById('optScanEmails').checked : false, scan_emails: document.getElementById('optScanEmails') ? document.getElementById('optScanEmails').checked : false,
scan_phones: document.getElementById('optScanPhones') ? document.getElementById('optScanPhones').checked : false, scan_phones: document.getElementById('optScanPhones') ? document.getElementById('optScanPhones').checked : false,
retention_enabled: document.getElementById('optRetention') ? document.getElementById('optRetention').checked : false, retention_enabled: document.getElementById('optRetention') ? document.getElementById('optRetention').checked : false,

View File

@ -152,6 +152,14 @@ document.addEventListener('DOMContentLoaded', applyI18n);
</select> </select>
</div> </div>
<!-- CPR-only mode -->
<div class="toggle-row">
<span class="toggle-label" style="flex:1">
<span data-i18n="m365_opt_cpr_only">CPR-only mode</span><span class="hint-wrap"><span class="hint-icon" onclick="toggleHint(this)">?</span><span class="hint-bubble" data-i18n="m365_opt_cpr_only_hint">Only flag files that contain CPR numbers. Files with only email addresses, phone numbers, faces, or EXIF metadata are ignored.</span></span>
</span>
<label class="toggle"><input type="checkbox" id="optCprOnly"><span class="toggle-slider"></span></label>
</div>
<!-- Scan for email addresses --> <!-- Scan for email addresses -->
<div class="toggle-row"> <div class="toggle-row">
<span class="toggle-label" style="flex:1"> <span class="toggle-label" style="flex:1">