recap: Added email and phone number detection as opt-in scan options across all three engines, plus translation fixes. Both CHANGELOG and SUGGESTIONS are updated — everything is committed and ready to test.

This commit is contained in:
StyxX65 2026-04-25 19:33:28 +02:00
parent 56a744d896
commit 2254e00481
14 changed files with 254 additions and 42 deletions

View File

@ -11,6 +11,8 @@ Version numbers follow [Semantic Versioning](https://semver.org/spec/v2.0.0.html
### Added
- **Email address and Danish phone number detection** — all three scan engines (M365, Google Workspace, local/SMB/SFTP) can now flag files and messages containing email addresses or Danish phone numbers in addition to CPR numbers. Detection is opt-in per profile: two new toggle options **Scan for email addresses** and **Scan for phone numbers** (default off) appear in the scan options panel and profile editor. When enabled, matches are stored as `email_count` / `phone_count` on each DB row and surfaced as colour-coded badges in list view, grid view, and the preview panel. Email regex requires a structurally valid address (`local@domain.tld`); phone regex covers 8-digit Danish numbers with optional `+45`/`0045` prefix and common spacing patterns. Both are deduplicated before counting. Requires DB migration (adds two INTEGER columns to `flagged_items`; applied automatically on first startup via `_MIGRATIONS`).
- **SFTP as a 4th file connector** — SFTP servers can now be added as file sources alongside local folders, SMB shares, and cloud sources. A new `SFTPScanner` class in `sftp_connector.py` implements the same `iter_files()` interface as `FileScanner`, so `run_file_scan()`, SSE broadcasting, DB persistence, card building, scheduled scans, and exports work without changes. Supports password auth and SSH private key auth (RSA, Ed25519, ECDSA, DSS); passphrases stored in the OS keychain. Key files uploaded via `POST /api/file_sources/upload_key` and stored in `~/.gdprscanner/sftp_keys/` with `chmod 600`. SFTP sources appear with a 🔒 icon in the sources panel. Requires `paramiko>=3.4` (optional — scanner falls back gracefully if not installed). New source-type selector (Local / Network (SMB) / SFTP) replaces the SMB path-prefix auto-detection in the add-source form.
- **`POST /api/file_sources/upload_key`** — new endpoint that validates and stores an SSH private key file, returning a `key_path` for use in the source definition.

View File

@ -350,3 +350,14 @@ Write redacted copies of flagged files with CPR numbers replaced by `XXX XXXX-XX
### Email notification on scan completion (non-scheduled) ✅
Auto-email now fires on manual scans when **Email report after manual scan** is enabled in Settings → Email report. Toggle stored as `auto_email_manual` in `smtp.json`. Implemented in `routes/scan.py``_maybe_send_auto_email()` is called from the `_run()` thread after `run_scan()` returns. Same Graph-first → SMTP-fallback pattern as scheduled scans. Only fires when there are flagged items and at least one recipient is configured.
### Phase 2 PII: name-based roster lookup
Flag documents containing the full names of students or staff — even when no CPR is present. Implementation outline:
1. **Roster source** — pull names from the M365 directory (`/users?$select=displayName`), the GWS directory (`admin.list_users`), or a user-uploaded CSV. Store as a flat list of `(first, last)` pairs, minimum length threshold (~5 chars per part) to suppress common first-name noise.
2. **Multi-pattern search** — build an Aho-Corasick automaton from the roster at scan start (`pyahocorasick`, ~50 KB, optional dep). Run each extracted text through the automaton; a hit qualifies only when the match falls on a word boundary and both first + last name appear within a configurable window (e.g. 100 characters apart).
3. **Integration** — same `_find_emails_phones`-style helper in `cpr_detector.py`; roster loaded once per scan run and passed as a parameter. New `name_count` column in `flagged_items` (DB migration). New `name-badge` in the UI. Opt-in profile toggle like `scan_emails`.
4. **NER fallback** — optionally run `spaCy` `da_core_news_sm` (~200 MB) when no roster is available to detect PERSON entities. Much higher false-positive rate; only useful as a discovery tool.
**Why deferred:** requires a roster-management UI (upload CSV, choose directory source, refresh cadence), and false-positive rate depends heavily on roster quality. Name-only matches also carry lower legal weight than CPR hits. Implement after a school explicitly requests it.

View File

@ -22,6 +22,7 @@ from __future__ import annotations
import base64
import hashlib
import io
import re
import tempfile
import threading
from pathlib import Path
@ -505,55 +506,139 @@ def _detect_photo_faces(content: bytes, filename: str) -> int:
return 0
_EMAIL_RE = re.compile(
r'\b[a-zA-Z0-9][a-zA-Z0-9._%+\-]*@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}\b'
)
_PHONE_RE = re.compile(
r'(?:'
r'(?:\+45|0045)[\s\-]?[2-9]\d{3}[\s\-]?\d{4}' # +45/0045 DDDD DDDD
r'|(?:\+45|0045)[\s\-]?[2-9]\d(?:[\s\-]\d{2}){3}' # +45/0045 DD DD DD DD
r'|\b[2-9]\d{7}\b' # 8 consecutive digits
r'|\b[2-9]\d{3}[\s\-]\d{4}\b' # DDDD DDDD
r'|\b[2-9]\d(?:[\s\-]\d{2}){3}\b' # DD DD DD DD
r')'
)
def _extract_text_from_bytes(content: bytes, filename: str) -> str:
"""Extract plain text from file bytes for email/phone pattern matching.
Returns empty string for binary media files (photos, video, audio) and
on any parse error callers must never raise from this function.
"""
ext = Path(filename).suffix.lower()
try:
if ext in {".txt", ".csv", ".eml", ".msg"}:
return content.decode("utf-8", errors="replace")
if ext in {".docx", ".doc"}:
from docx import Document as _Doc
doc = _Doc(io.BytesIO(content))
parts = [p.text for p in doc.paragraphs]
for tbl in doc.tables:
for row in tbl.rows:
for cell in row.cells:
parts.append(cell.text)
return "\n".join(parts)
if ext in {".xlsx", ".xlsm"}:
import openpyxl as _xl
wb = _xl.load_workbook(io.BytesIO(content), read_only=True, data_only=True)
parts = [
str(cell.value)
for ws in wb.worksheets
for row in ws.iter_rows()
for cell in row
if cell.value is not None
]
wb.close()
return " ".join(parts)
if ext == ".pdf":
import pdfplumber as _pp
with _pp.open(io.BytesIO(content)) as pdf:
parts = [p.extract_text() or "" for p in pdf.pages]
return "\n".join(parts)
except Exception:
pass
if ext not in PHOTO_EXTS | VIDEO_EXTS | AUDIO_EXTS:
try:
return content.decode("utf-8", errors="replace")
except Exception:
pass
return ""
def _find_emails_phones(text: str) -> dict:
"""Extract unique email addresses and Danish phone numbers from text.
Returns {"emails": [{"formatted": str}, ...], "phones": [{"formatted": str}, ...]}.
Phones are normalised to digit-only strings (preserving a leading '+').
"""
if not text:
return {"emails": [], "phones": []}
emails = list(dict.fromkeys(m.group(0).lower() for m in _EMAIL_RE.finditer(text)))
phones = list(dict.fromkeys(
('+' + re.sub(r'[\s\-]', '', m.group(0)[1:]) if m.group(0).lstrip().startswith('+')
else re.sub(r'[\s\-]', '', m.group(0)))
for m in _PHONE_RE.finditer(text)
))
return {
"emails": [{"formatted": e} for e in emails],
"phones": [{"formatted": p} for p in phones],
}
def _scan_bytes(content: bytes, filename: str, poppler_path=None) -> dict:
"""Scan raw bytes for CPRs. Returns scanner result dict."""
"""Scan raw bytes for CPRs, emails, and phone numbers. Returns result dict."""
if not SCANNER_OK:
return {"cprs": [], "dates": [], "error": "scanner not available"}
return {"cprs": [], "dates": [], "emails": [], "phones": [], "error": "scanner not available"}
ext = Path(filename).suffix.lower()
with tempfile.NamedTemporaryFile(suffix=ext, delete=False) as tmp:
tmp.write(content)
tmp_path = Path(tmp.name)
result: dict = {"cprs": [], "dates": []}
try:
if ext == ".pdf":
# Check if the PDF has a text layer before running full scan_pdf.
# Image-only PDFs (scanned documents) have no text and would trigger
# Tesseract OCR subprocesses that hang indefinitely on some files.
try:
import pdfplumber as _pp, io as _io
with _pp.open(_io.BytesIO(content)) as _pdf:
import pdfplumber as _pp
with _pp.open(io.BytesIO(content)) as _pdf:
has_text = any(ds.is_text_page(p) for p in _pdf.pages)
if not has_text:
return {"cprs": [], "dates": []} # image-only PDF — no CPRs possible
return {"cprs": [], "dates": [], "emails": [], "phones": []}
except Exception:
pass # if pdfplumber fails, fall through to full scan_pdf
return ds.scan_pdf(tmp_path, poppler_path=poppler_path)
result = ds.scan_pdf(tmp_path, poppler_path=poppler_path)
elif ext in {".docx", ".doc"}:
return ds.scan_docx(tmp_path)
result = ds.scan_docx(tmp_path)
elif ext in {".xlsx", ".xlsm"}:
return ds.scan_xlsx(tmp_path)
result = ds.scan_xlsx(tmp_path)
elif ext == ".csv":
return ds.scan_csv(tmp_path)
result = ds.scan_csv(tmp_path)
elif ext == ".txt":
text = content.decode("utf-8", errors="replace")
cprs, dates = ds.extract_matches(text, 1, "text")
return {"cprs": cprs, "dates": dates}
result = {"cprs": cprs, "dates": dates}
elif ext in {".jpg", ".jpeg", ".png", ".bmp", ".tiff", ".tif", ".webp"}:
return ds.scan_image(tmp_path)
result = ds.scan_image(tmp_path)
else:
# Try plain text
try:
text = content.decode("utf-8", errors="replace")
cprs, dates = ds.extract_matches(text, 1, "text")
return {"cprs": cprs, "dates": dates}
result = {"cprs": cprs, "dates": dates}
except Exception:
return {"cprs": [], "dates": []}
pass
except Exception as e:
return {"cprs": [], "dates": [], "error": str(e)}
result = {"cprs": [], "dates": [], "error": str(e)}
finally:
try:
tmp_path.unlink()
except Exception:
pass
ep = _find_emails_phones(_extract_text_from_bytes(content, filename))
result["emails"] = ep["emails"]
result["phones"] = ep["phones"]
return result
def _worker_scan_pdf(pdf_path_str: str, result_q) -> None:
"""Worker executed in a spawned subprocess — must be a module-level function."""
@ -607,19 +692,22 @@ def _scan_bytes_timeout(content: bytes, filename: str, timeout: int = 60) -> dic
def _scan_text_direct(text: str) -> dict:
"""Scan a plain text string for CPRs using extract_matches.
"""Scan a plain text string for CPRs, emails, and phone numbers.
Uses ds.extract_matches() directly rather than ds.scan_text() because
scan_text() calls extract_cpr_and_dates() which is not defined in
document_scanner.py (pre-existing bug).
"""
if not SCANNER_OK or not text:
return {"cprs": [], "dates": []}
if not text:
return {"cprs": [], "dates": [], "emails": [], "phones": []}
ep = _find_emails_phones(text)
if not SCANNER_OK:
return {"cprs": [], "dates": [], **ep}
try:
cprs, dates = ds.extract_matches(text, 1, "text")
return {"cprs": cprs, "dates": dates}
return {"cprs": cprs, "dates": dates, **ep}
except Exception:
return {"cprs": [], "dates": []}
return {"cprs": [], "dates": [], **ep}
def _html_esc(s: str) -> str:
"""HTML-escape a string for safe inline embedding."""

View File

@ -200,6 +200,8 @@ _MIGRATIONS: list[tuple[int, str]] = [
(4, "ALTER TABLE flagged_items ADD COLUMN face_count INTEGER NOT NULL DEFAULT 0"),
(5, "ALTER TABLE flagged_items ADD COLUMN exif_json TEXT NOT NULL DEFAULT '{}'"),
(6, "ALTER TABLE flagged_items ADD COLUMN full_path TEXT NOT NULL DEFAULT ''"),
(8, "ALTER TABLE flagged_items ADD COLUMN email_count INTEGER NOT NULL DEFAULT 0"),
(9, "ALTER TABLE flagged_items ADD COLUMN phone_count INTEGER NOT NULL DEFAULT 0"),
(7, """CREATE TABLE IF NOT EXISTS schedule_runs (
id INTEGER PRIMARY KEY AUTOINCREMENT,
started_at REAL NOT NULL,
@ -311,8 +313,9 @@ class ScanDB:
(id, scan_id, name, source, source_type, account_id, folder,
url, drive_id, size_kb, modified, cpr_count, risk,
thumb_b64, thumb_mime, attachments, user_role, transfer_risk,
special_category, face_count, exif_json, full_path, scanned_at)
VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)""",
special_category, face_count, exif_json, full_path,
email_count, phone_count, scanned_at)
VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)""",
(
card.get("id", ""),
scan_id,
@ -336,6 +339,8 @@ class ScanDB:
card.get("face_count", 0),
json.dumps(card.get("exif", {})),
card.get("full_path", ""),
card.get("email_count", 0),
card.get("phone_count", 0),
now,
),
)

View File

@ -570,6 +570,12 @@
"m365_opt_skip_gps": "Ignorer GPS i billeder",
"m365_opt_skip_gps_hint": "Billeder med GPS-koordinater flagges ikke — nyttigt ved elevscanninger, hvor smartphones indlejrer placering i alle fotos.",
"m365_opt_min_cpr": "Min. CPR-antal pr. fil",
"m365_opt_scan_emails": "Søg efter e-mailadresser",
"m365_opt_scan_emails_hint": "Flagger filer med e-mailadresser. Slået fra som standard — e-mailadresser er meget almindelige og kan give mange resultater.",
"m365_opt_scan_phones": "Søg efter telefonnumre",
"m365_opt_scan_phones_hint": "Flagger filer med danske telefonnumre (8 cifre). Nyttigt til at finde kontaktlister og forældrekorrespondance.",
"m365_badge_emails": "e-mail",
"m365_badge_phones": "tlf.",
"m365_opt_min_cpr_hint": "Filer med færre distinkte CPR-numre end denne tærskel rapporteres ikke. Sæt til 2 for at undgå falske positive, når elever har egne CPR-numre i filer.",
"m365_filter_photo_only": "📷 Billeder / biometrisk",
"m365_filter_all_roles": "Alle roller",

View File

@ -570,6 +570,12 @@
"m365_opt_skip_gps": "GPS in Bildern ignorieren",
"m365_opt_skip_gps_hint": "Bilder mit GPS-Koordinaten werden nicht markiert — nützlich beim Scannen von Schüler-Konten, deren Smartphones Standort in jedes Foto einbetten.",
"m365_opt_min_cpr": "Min. CPR-Anzahl pro Datei",
"m365_opt_scan_emails": "E-Mail-Adressen scannen",
"m365_opt_scan_emails_hint": "Markiert Dateien mit E-Mail-Adressen. Standardmäßig deaktiviert — E-Mail-Adressen sind sehr häufig und können viele Treffer erzeugen.",
"m365_opt_scan_phones": "Telefonnummern scannen",
"m365_opt_scan_phones_hint": "Markiert Dateien mit dänischen Telefonnummern (8 Ziffern). Nützlich zum Auffinden von Kontaktlisten.",
"m365_badge_emails": "E-Mail",
"m365_badge_phones": "Tel.",
"m365_opt_min_cpr_hint": "Dateien mit weniger eindeutigen CPR-Nummern als dieser Schwellenwert werden nicht gemeldet. Auf 2 setzen, um Falsch-Positive zu vermeiden, wenn Schüler eigene CPR-Nummern in Dateien haben.",
"m365_filter_photo_only": "📷 Fotos / biometrisch",
"m365_filter_all_roles": "Alle Rollen",

View File

@ -570,6 +570,12 @@
"m365_opt_skip_gps": "Ignore GPS in images",
"m365_opt_skip_gps_hint": "Images with GPS coordinates are not flagged — useful when scanning students whose smartphones embed location in every photo.",
"m365_opt_min_cpr": "Min. CPR count per file",
"m365_opt_scan_emails": "Scan for email addresses",
"m365_opt_scan_emails_hint": "Flags files that contain email addresses. Off by default — email addresses are very common and may produce many results.",
"m365_opt_scan_phones": "Scan for phone numbers",
"m365_opt_scan_phones_hint": "Flags files containing Danish phone numbers (8 digits). Useful for finding contact lists and parent correspondence.",
"m365_badge_emails": "email",
"m365_badge_phones": "phone",
"m365_opt_min_cpr_hint": "Files with fewer distinct CPR numbers than this threshold are not reported. Set to 2 to avoid false positives when students have their own CPR in documents.",
"m365_filter_photo_only": "📷 Photos / biometric",
"m365_filter_all_roles": "All roles",

View File

@ -141,6 +141,8 @@ def _run_google_scan(options: dict):
scan_body = bool(scan_opts.get("scan_body", True))
scan_att = bool(scan_opts.get("scan_attachments", True))
delta_enabled = bool(scan_opts.get("delta", False))
scan_emails = bool(scan_opts.get("scan_emails", False))
scan_phones = bool(scan_opts.get("scan_phones", False))
from checkpoint import _load_delta_tokens, _save_delta_tokens
_drive_delta_tokens: dict = _load_delta_tokens() if delta_enabled else {}
@ -212,6 +214,8 @@ def _run_google_scan(options: dict):
"source": item_meta.get("_source", ""),
"source_type": item_meta.get("_source_type", ""),
"cpr_count": len(cprs),
"email_count": item_meta.get("_email_count", 0),
"phone_count": item_meta.get("_phone_count", 0),
"url": item_meta.get("_url", ""),
"size_kb": round(item_meta.get("size", 0) / 1024, 1),
"modified": (item_meta.get("lastModifiedDateTime") or item_meta.get("receivedDateTime") or "")[:10],
@ -278,7 +282,11 @@ def _run_google_scan(options: dict):
continue
cprs = result.get("cprs", [])
pii_counts = result.get("pii_counts")
if cprs or (pii_counts and any(pii_counts.values())):
_em = list(dict.fromkeys(e["formatted"] for e in result.get("emails", []))) if scan_emails else []
_ph = list(dict.fromkeys(p["formatted"] for p in result.get("phones", []))) if scan_phones else []
if cprs or (pii_counts and any(pii_counts.values())) or _em or _ph:
meta["_email_count"] = len(_em)
meta["_phone_count"] = len(_ph)
_broadcast_card(meta, cprs, pii_counts)
except GoogleError as e:
broadcast("scan_error", {"file": f"Gmail/{user_email}", "error": str(e)})
@ -336,7 +344,11 @@ def _run_google_scan(options: dict):
continue
cprs = result.get("cprs", [])
pii_counts = result.get("pii_counts")
if cprs or (pii_counts and any(pii_counts.values())):
_em = list(dict.fromkeys(e["formatted"] for e in result.get("emails", []))) if scan_emails else []
_ph = list(dict.fromkeys(p["formatted"] for p in result.get("phones", []))) if scan_phones else []
if cprs or (pii_counts and any(pii_counts.values())) or _em or _ph:
meta["_email_count"] = len(_em)
meta["_phone_count"] = len(_ph)
_broadcast_card(meta, cprs, pii_counts)
except GoogleError as e:
broadcast("scan_error", {"file": f"Drive/{user_email}", "error": str(e)})

View File

@ -182,6 +182,8 @@ def run_file_scan(source: dict):
scan_photos = bool(source.get("scan_photos", False))
skip_gps_images = bool(source.get("skip_gps_images", False))
min_cpr_count = max(1, int(source.get("min_cpr_count", 1)))
scan_emails = bool(source.get("scan_emails", False))
scan_phones = bool(source.get("scan_phones", False))
max_mb = int(source.get("max_file_mb", 50))
if source_kind == "sftp":
@ -269,6 +271,8 @@ def run_file_scan(source: dict):
continue
cprs = result.get("cprs", [])
emails = result.get("emails", []) if scan_emails else []
phones = result.get("phones", []) if scan_phones else []
# Photo / biometric scan + EXIF/video/audio metadata extraction
_face_count = 0
@ -285,11 +289,13 @@ def run_file_scan(source: dict):
# Apply filters: distinct CPR threshold and GPS suppression
_distinct_cprs = list(dict.fromkeys(c["formatted"] for c in cprs))
_cpr_qualifies = len(_distinct_cprs) >= min_cpr_count
_distinct_emails = list(dict.fromkeys(e["formatted"] for e in emails))
_distinct_phones = list(dict.fromkeys(p["formatted"] for p in phones))
_exif_has_pii = _exif.get("has_pii") and (
not skip_gps_images or bool(_exif.get("pii_fields") or _exif.get("author"))
)
if not (_cpr_qualifies and cprs) and _face_count == 0 and not _exif_has_pii:
if not (_cpr_qualifies and cprs) and not _distinct_emails and not _distinct_phones and _face_count == 0 and not _exif_has_pii:
continue
# Build card metadata
@ -325,6 +331,8 @@ def run_file_scan(source: dict):
"source": label,
"source_type": source_type,
"cpr_count": len(cprs),
"email_count": len(_distinct_emails),
"phone_count": len(_distinct_phones),
"url": "",
"size_kb": meta["size_kb"],
"modified": meta["modified"],
@ -437,6 +445,8 @@ def run_scan(options: dict):
scan_photos = bool(scan_opts.get("scan_photos", False)) # biometric photo scan (#9)
skip_gps_images= bool(scan_opts.get("skip_gps_images", False))
min_cpr_count = max(1, int(scan_opts.get("min_cpr_count", 1)))
scan_emails = bool(scan_opts.get("scan_emails", False))
scan_phones = bool(scan_opts.get("scan_phones", False))
# Delta token state — loaded once, updated per-source, saved on completion
delta_tokens: dict = _load_delta_tokens() if delta_enabled else {}
@ -490,6 +500,8 @@ def run_scan(options: dict):
"source": item_meta.get("_source", ""),
"source_type": item_meta.get("_source_type", ""),
"cpr_count": len(cprs),
"email_count": item_meta.get("_email_count", 0),
"phone_count": item_meta.get("_phone_count", 0),
"url": item_meta.get("webUrl", "") or item_meta.get("_url", ""),
"size_kb": round(item_meta.get("size", 0) / 1024, 1),
"modified": (item_meta.get("lastModifiedDateTime") or item_meta.get("receivedDateTime") or "")[:10],
@ -1057,11 +1069,17 @@ def run_scan(options: dict):
# Scan body — use pre-extracted text (body HTML was stripped at
# collection time to keep work_items memory footprint small)
all_cprs = []
all_emails = []
all_phones = []
body_text = ""
if scan_email_body:
body_text = meta.pop("_precomputed_body", "")
body_result = _scan_text_direct(body_text)
all_cprs = list(body_result.get("cprs", []))
if scan_emails:
all_emails = list(body_result.get("emails", []))
if scan_phones:
all_phones = list(body_result.get("phones", []))
# <span data-i18n="m365_opt_attachments" data-i18n="m365_opt_attachments">Scan attachments</span>
uid = meta.get("_account_id", "me")
@ -1084,14 +1102,22 @@ def run_scan(options: dict):
att_result = _scan_bytes(att_bytes, att_name)
att_cprs = att_result.get("cprs", [])
all_cprs.extend(att_cprs)
if scan_emails:
all_emails.extend(att_result.get("emails", []))
if scan_phones:
all_phones.extend(att_result.get("phones", []))
att_results.append({"name": att_name, "cpr_count": len(att_cprs)})
except Exception as att_err:
broadcast("scan_error", {"file": att_name, "error": str(att_err)})
if all_cprs:
_distinct_emails = list(dict.fromkeys(e["formatted"] for e in all_emails))
_distinct_phones = list(dict.fromkeys(p["formatted"] for p in all_phones))
if all_cprs or _distinct_emails or _distinct_phones:
meta["_thumb"] = _placeholder_svg(".eml", subject)
meta["_thumb_is_jpeg"] = False
meta["_attachments"] = att_results
meta["_email_count"] = len(_distinct_emails)
meta["_phone_count"] = len(_distinct_phones)
_email_pii = _get_pii_counts(body_text) if scan_email_body else {}
meta["_transfer_risk"] = _check_transfer_risk(meta)
meta["_special_category"] = _check_special_category(
@ -1121,10 +1147,12 @@ def run_scan(options: dict):
else:
content = conn.download_item(meta)
# CPR scan — skip for video and audio (metadata-only; no text layer)
# CPR/email/phone scan — skip for video and audio (metadata-only; no text layer)
_media_only = ext in VIDEO_EXTS or ext in AUDIO_EXTS
result = {"cprs": [], "dates": []} if _media_only else _scan_bytes(content, name)
result = {"cprs": [], "dates": [], "emails": [], "phones": []} if _media_only else _scan_bytes(content, name)
cprs = result.get("cprs", [])
emails = result.get("emails", []) if scan_emails else []
phones = result.get("phones", []) if scan_phones else []
# ── Biometric photo scan (#9) + EXIF/video/audio metadata (#18) ─
_face_count = 0
@ -1141,12 +1169,14 @@ def run_scan(options: dict):
# Apply filters: distinct CPR threshold and GPS suppression
_distinct_cprs = list(dict.fromkeys(c["formatted"] for c in cprs))
_cpr_qualifies = len(_distinct_cprs) >= min_cpr_count
_distinct_emails = list(dict.fromkeys(e["formatted"] for e in emails))
_distinct_phones = list(dict.fromkeys(p["formatted"] for p in phones))
_exif_has_pii = _exif.get("has_pii") and (
not skip_gps_images or bool(_exif.get("pii_fields") or _exif.get("author"))
)
# Flag item if CPRs found (above threshold), faces detected, or EXIF PII found
if (_cpr_qualifies and cprs) or _face_count > 0 or _exif_has_pii:
# Flag item if CPRs/emails/phones found, faces detected, or EXIF PII found
if (_cpr_qualifies and cprs) or _distinct_emails or _distinct_phones or _face_count > 0 or _exif_has_pii:
# Make thumbnail
if ext in {".jpg", ".jpeg", ".png"} and PIL_OK:
thumb = _make_thumb(content, name)
@ -1182,6 +1212,8 @@ def run_scan(options: dict):
meta["_special_category"] = _sc
meta["_face_count"] = _face_count
meta["_exif"] = _exif
meta["_email_count"] = len(_distinct_emails)
meta["_phone_count"] = len(_distinct_phones)
_broadcast_card(meta, cprs, pii_counts=_file_pii)
else:
del content # no hits — free raw bytes immediately

View File

@ -137,6 +137,16 @@ function _applyProfile(profile) {
if (el) el.value = opts.min_cpr_count;
}
if (opts.scan_emails !== undefined) {
const el = document.getElementById('optScanEmails');
if (el) el.checked = opts.scan_emails;
}
if (opts.scan_phones !== undefined) {
const el = document.getElementById('optScanPhones');
if (el) el.checked = opts.scan_phones;
}
// ── Date filter ───────────────────────────────────────────────────────────
const days = opts.older_than_days;
if (days !== undefined) {
@ -417,6 +427,8 @@ function _openEditorForProfile(profile) {
<div class="pmgmt-opt-row"><span>${t('m365_opt_scan_photos','Søg efter ansigter i billeder')}</span><label class="toggle"><input type="checkbox" id="peOptPhotos" ${opts.scan_photos ? 'checked' : ''}><span class="toggle-slider"></span></label></div>
<div class="pmgmt-opt-row"><span>${t('m365_opt_skip_gps','Ignorer GPS i billeder')}</span><label class="toggle"><input type="checkbox" id="peOptSkipGps" ${opts.skip_gps_images ? 'checked' : ''}><span class="toggle-slider"></span></label></div>
<div class="pmgmt-opt-row"><span style="color:var(--muted)">${t('m365_opt_min_cpr','Min. CPR-antal pr. fil')}</span><input type="number" id="peOptMinCpr" value="${opts.min_cpr_count || 1}" min="1" max="50" style="width:46px;padding:3px 6px;font-size:11px;text-align:right"></div>
<div class="pmgmt-opt-row"><span>${t('m365_opt_scan_emails','Søg efter e-mailadresser')}</span><label class="toggle"><input type="checkbox" id="peOptEmails" ${opts.scan_emails ? 'checked' : ''}><span class="toggle-slider"></span></label></div>
<div class="pmgmt-opt-row"><span>${t('m365_opt_scan_phones','Søg efter telefonnumre')}</span><label class="toggle"><input type="checkbox" id="peOptPhones" ${opts.scan_phones ? 'checked' : ''}><span class="toggle-slider"></span></label></div>
<hr style="border:none;border-top:1px solid var(--pmgmt-divider);margin:2px 0">
<div class="pmgmt-opt-row"><span>${t('m365_opt_retention','Opbevaringspolitik')}</span><label class="toggle"><input type="checkbox" id="peOptRetention" ${profile.retention_years ? 'checked' : ''}><span class="toggle-slider"></span></label></div>
<div style="padding:7px 8px;background:var(--bg);border-radius:6px">
@ -633,6 +645,8 @@ async function _pmgmtSaveFullEdit() {
scan_photos: document.getElementById('peOptPhotos')?.checked ?? false,
skip_gps_images: document.getElementById('peOptSkipGps')?.checked ?? false,
min_cpr_count: parseInt(document.getElementById('peOptMinCpr')?.value) || 1,
scan_emails: document.getElementById('peOptEmails')?.checked ?? false,
scan_phones: document.getElementById('peOptPhones')?.checked ?? false,
},
retention_years: document.getElementById('peOptRetention')?.checked ? (parseInt(document.getElementById('peOptRetYears')?.value) || 5) : null,
fiscal_year_end: document.getElementById('peOptRetention')?.checked ? (document.getElementById('peOptFiscalYearEnd')?.value || '') : '',

View File

@ -46,6 +46,8 @@ function appendCard(f) {
<div class="card-source"><span class="source-badge ${badgeCls}">${label}</span> ${f.source || ''}${f.account_name ? ' · <span class="account-pill" title="' + f.account_name + '">' + (f.user_role === 'student' ? '<span class="role-badge">' + t('role_student','Elev') + '</span>' : f.user_role === 'staff' ? '<span class="role-badge">' + t('role_staff','Ansat') + '</span>' : '') + f.account_name + '</span>' : ''}${f.transfer_risk === 'external-recipient' ? ' <span class="role-pill" style="background:#7B2D00;color:#FFD0B0"> Ext.</span>' : f.transfer_risk ? ' <span class="role-pill" style="background:#003D7B;color:#B0D4FF">🔗</span>' : ''}</div>
</div>
<span class="cpr-badge">${f.cpr_count} CPR</span>
${f.email_count > 0 ? '<span class="email-badge">' + f.email_count + ' ' + t('m365_badge_emails', 'e-mail') + '</span> ' : ''}
${f.phone_count > 0 ? '<span class="phone-badge">' + f.phone_count + ' ' + t('m365_badge_phones', 'tlf.') + '</span> ' : ''}
${f.face_count > 0 ? '<span class="photo-face-badge">' + f.face_count + ' ' + t('m365_badge_faces', f.face_count === 1 ? 'face' : 'faces') + '</span> ' : ''}
${f.exif && f.exif.gps ? '<span class="photo-face-badge" style="background:#0a3a5a;color:#7ec8d0">🌍 GPS</span> ' : ''}
${f.special_category && f.special_category.length ? '<span class="special-cat-badge">⚠ Art.9 — ' + f.special_category.filter(function(s){return s !== 'gps_location' && s !== 'exif_pii';}).join(', ') + '</span> ' : ''}${f.overdue ? '<span class="overdue-badge">🗓 Overdue</span>' : ''}
@ -58,7 +60,7 @@ function appendCard(f) {
<div class="card-meta">${f.size_kb} KB · ${f.modified || ''}</div>
${f.folder ? `<div class="card-meta" style="font-size:10px" title="${f.folder}">📂 ${f.folder}</div>` : ''}
<div class="card-source"><span class="source-badge ${badgeCls}">${label}</span>${f.account_name ? ' <span class="account-pill" title="' + f.account_name + '">' + (f.user_role === "student" ? '<span class="role-badge">' + t("role_student","Elev") + "</span>" : f.user_role === "staff" ? '<span class="role-badge">' + t("role_staff","Ansat") + "</span>" : "") + f.account_name + '</span>' : ''}${f.transfer_risk === "external-recipient" ? ' <span class="role-pill" style="background:#7B2D00;color:#FFD0B0"> Ext.</span>' : f.transfer_risk ? ' <span class="role-pill" style="background:#003D7B;color:#B0D4FF">🔗</span>' : ''}</div>
<span class="cpr-badge">${f.cpr_count} CPR</span>${f.face_count > 0 ? ' <span class="photo-face-badge">' + f.face_count + ' ' + t('m365_badge_faces', f.face_count === 1 ? 'face' : 'faces') + '</span>' : ''}${f.exif && f.exif.gps ? ' <span class="photo-face-badge" style="background:#0a3a5a;color:#7ec8d0">🌍 GPS</span>' : ''}${f.overdue ? ' <span class="overdue-badge">🗓 Overdue</span>' : ''}
<span class="cpr-badge">${f.cpr_count} CPR</span>${f.email_count > 0 ? ' <span class="email-badge">' + f.email_count + ' ' + t('m365_badge_emails', 'e-mail') + '</span>' : ''}${f.phone_count > 0 ? ' <span class="phone-badge">' + f.phone_count + ' ' + t('m365_badge_phones', 'tlf.') + '</span>' : ''}${f.face_count > 0 ? ' <span class="photo-face-badge">' + f.face_count + ' ' + t('m365_badge_faces', f.face_count === 1 ? 'face' : 'faces') + '</span>' : ''}${f.exif && f.exif.gps ? ' <span class="photo-face-badge" style="background:#0a3a5a;color:#7ec8d0">🌍 GPS</span>' : ''}${f.overdue ? ' <span class="overdue-badge">🗓 Overdue</span>' : ''}
</div>
${delBtn}`;
}
@ -102,6 +104,8 @@ async function openPreview(f) {
f.size_kb ? `<span>${f.size_kb} KB</span>` : '',
f.modified ? `<span>${f.modified}</span>` : '',
f.cpr_count ? `<span style="color:var(--danger)">${f.cpr_count} CPR</span>` : '',
f.email_count ? `<span style="color:#7ec8f0">${f.email_count} ${t('m365_badge_emails','e-mail')}</span>` : '',
f.phone_count ? `<span style="color:#7eeac0">${f.phone_count} ${t('m365_badge_phones','tlf.')}</span>` : '',
f.url ? `<button class="preview-open-btn" onclick="window.open('${f.url}','_blank')">${t("m365_preview_open","Open in M365 ↗")}</button>` : '',
].filter(Boolean).join('');

View File

@ -127,6 +127,8 @@ function buildScanPayload() {
scan_photos: document.getElementById('optScanPhotos') ? document.getElementById('optScanPhotos').checked : false,
skip_gps_images: document.getElementById('optSkipGps') ? document.getElementById('optSkipGps').checked : false,
min_cpr_count: document.getElementById('optMinCpr') ? (parseInt(document.getElementById('optMinCpr').value) || 1) : 1,
scan_emails: document.getElementById('optScanEmails') ? document.getElementById('optScanEmails').checked : false,
scan_phones: document.getElementById('optScanPhones') ? document.getElementById('optScanPhones').checked : false,
retention_enabled: document.getElementById('optRetention') ? document.getElementById('optRetention').checked : false,
retention_years: parseInt(document.getElementById('optRetentionYears')?.value) || 5,
fiscal_year_end: document.getElementById('optFiscalYearEnd')?.value || '',
@ -588,6 +590,8 @@ function startScan(resume) {
scan_photos: options.scan_photos || false,
skip_gps_images: options.skip_gps_images || false,
min_cpr_count: options.min_cpr_count || 1,
scan_emails: options.scan_emails || false,
scan_phones: options.scan_phones || false,
}))
}).catch(e => { log('File scan error: ' + e, 'err'); });
});

View File

@ -491,6 +491,12 @@
.overdue-badge { font-size: 9px; padding: 1px 5px; border-radius: 10px;
background: #7c3200; color: #ffb347; font-weight: 600; white-space: nowrap; }
[data-theme="light"] .overdue-badge { background: #fff3e0; color: #c55a00; }
.email-badge { font-size: 9px; padding: 1px 5px; border-radius: 10px;
background: #1a3a5c; color: #7ec8f0; font-weight: 500; white-space: nowrap; }
[data-theme="light"] .email-badge { background: #d0eaff; color: #004a80; }
.phone-badge { font-size: 9px; padding: 1px 5px; border-radius: 10px;
background: #1a4030; color: #7eeac0; font-weight: 500; white-space: nowrap; }
[data-theme="light"] .phone-badge { background: #d0f5ea; color: #005a3a; }
.badge-email { background: rgba(139,68,173,.2); color: #b87fd8; }
.badge-onedrive { background: rgba(0,120,212,.2); color: #5ba4e8; }
.badge-sharepoint { background: rgba(0,160,100,.2); color: #2ecc71; }

View File

@ -137,6 +137,22 @@ document.addEventListener('DOMContentLoaded', applyI18n);
style="width:46px;padding:3px 6px;font-size:11px;text-align:right">
</div>
<!-- Scan for email addresses -->
<div class="toggle-row">
<span class="toggle-label" style="flex:1">
<span data-i18n="m365_opt_scan_emails">Scan for email addresses</span><span class="hint-wrap"><span class="hint-icon" onclick="toggleHint(this)">?</span><span class="hint-bubble" data-i18n="m365_opt_scan_emails_hint">Flags files that contain email addresses. Off by default — email addresses are very common and may produce many results.</span></span>
</span>
<label class="toggle"><input type="checkbox" id="optScanEmails"><span class="toggle-slider"></span></label>
</div>
<!-- Scan for phone numbers -->
<div class="toggle-row">
<span class="toggle-label" style="flex:1">
<span data-i18n="m365_opt_scan_phones">Scan for phone numbers</span><span class="hint-wrap"><span class="hint-icon" onclick="toggleHint(this)">?</span><span class="hint-bubble" data-i18n="m365_opt_scan_phones_hint">Flags files containing Danish phone numbers (8 digits). Useful for finding contact lists and parent correspondence.</span></span>
</span>
<label class="toggle"><input type="checkbox" id="optScanPhones"><span class="toggle-slider"></span></label>
</div>
<!-- Retention policy (suggestion #1) -->
<div class="toggle-row">
<span class="toggle-label" style="flex:1">