Added tests for Video & Audio

feat: video/audio metadata scanning, profile rename fix, route tests - Scan .mp4/.mov/.avi/.mkv and .mp3/.flac/.ogg/.m4a/.wma (+ 7 more) for GPS coordinates, artist/author, title, comment — metadata only, no frame or audio analysis. Uses mutagen (added to requirements.txt). GPS-tagged phone recordings now flag with gps_location like photos. - Fix _extract_audio_metadata silently returning empty results: mutagen.File() first positional arg is `filename`, not `fileobj` — was passing BytesIO as the filename. Fixed to keyword args. - Fix profile copy rename not reflected in left column until modal reopen: _pmgmtSaveFullEdit called loadProfiles() but never _renderProfileMgmt(). Added re-render and active-row highlight. - Add TestProfileRoutes (10 tests) covering all profile API endpoints including a rename regression test. Total: 182 tests. - generate_fixtures.py now produces 6 audio/video fixtures (14–19): 2 MP3, 2 FLAC, 2 MP4 — 4 flagged, 2 negative cases.
2026-04-21 21:26:58 +02:00 · 2026-04-21 21:26:58 +02:00 · d42518dc81
commit d42518dc81
parent 2a2d79de90
16 changed files with 476 additions and 21 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -9,6 +9,14 @@ Version numbers follow [Semantic Versioning](https://semver.org/spec/v2.0.0.html

 ## [1.6.23] — 2026-04-21

+### Added
+
+- **Video file metadata scanning** — `.mp4`, `.mov`, `.m4v`, `.avi`, `.mkv`, `.wmv`, `.flv`, `.webm` files are now included in all scan sources (M365 OneDrive/SharePoint/Teams, Google Drive, local/SMB). No frame or audio analysis is performed; only container metadata is extracted: GPS coordinates (iPhone/Android QuickTime `©xyz` atom, ISO 6709 format), author/artist, title, comment/description, and recording date. A smartphone recording with an embedded GPS location is flagged with the `gps_location` special category, exactly like a geotagged photo. AVI metadata (RIFF INFO `INAM`/`IART`/`ICMT`) is parsed without any external library. Requires `mutagen>=1.47` (added to `requirements.txt`).
+
+- **Audio file metadata scanning** — `.mp3`, `.flac`, `.ogg`, `.m4a`, `.aac`, `.wma`, `.wav`, `.opus`, `.aiff` files are now scanned for PII-bearing tags across all sources. Extracted fields: title, artist, album artist, composer, lyricist, conductor, author, copyright, comment, description. No audio content is transcribed. Uses `mutagen.File(easy=True)` which normalises tag formats across ID3 (MP3), MPEG-4 (M4A/AAC), Vorbis (FLAC/OGG), and ASF (WMA) into a unified lowercase-key interface. A voice recording saved with a student's name in the artist tag will be flagged with `exif_pii`. Fixed a silent bug in `_extract_audio_metadata` where `mutagen.File(io.BytesIO(content), filename)` was passing the BytesIO as the `filename` positional argument; corrected to `mutagen.File(fileobj=..., filename=...)`.
+
+- **Audio and video test fixtures** — `tests/fixtures/local_files/generate_fixtures.py` now generates 6 new fixtures: `14_audio_artist_pii.mp3`, `15_audio_artist_pii.flac` (artist name → flag), `16_audio_no_pii.mp3`, `17_audio_no_pii.flac` (no tags → no flag), `18_video_gps.mp4` (GPS + artist → flag), `19_video_no_pii.mp4` (no tags → no flag). Total fixtures: 19 (14 flagged, 5 negative).
+
 ### Fixed

 - **Profile copy rename not reflected in left column until modal reopen** — saving a renamed profile via the full editor (`_pmgmtSaveFullEdit`) called `loadProfiles()` to refresh `S._profiles` but never called `_renderProfileMgmt()`, so the left-column list was not repainted. The new name only appeared after closing and reopening the modal. Fixed by calling `_renderProfileMgmt()` immediately after `loadProfiles()` and re-applying the `.active` highlight to the correct row. 10 new route integration tests added for all profile API endpoints; total test count: 182.
--- a/CLAUDE.md
+++ b/CLAUDE.md
@ -46,7 +46,7 @@ python -m pytest tests/ -q

 **`tests/test_route_integration.py`** — 54 Flask test-client tests covering security-sensitive paths: viewer token CRUD and scope validation, `GET /api/db/flagged` role/user scope enforcement, bulk disposition isolation, viewer PIN (set/verify/rate-limit/change/clear), interface PIN gate (multi-step flows require `session["interface_ok"] = True` after PIN set — the `before_request` hook blocks the same endpoint once a PIN exists), scan lock release on `run_scan()` exception, `GET /api/db/sessions` shape and ordering, profile routes CRUD and rename (including the rename-after-copy regression). Uses a tmp-path `ScanDB` monkeypatched into `routes.database._get_db` — tests never touch the real database. Interface PIN tests manipulate the real `config.json` via `setup_method`/`teardown_method` calling `clear_interface_pin()`.

-**Local-file scan fixtures** — `tests/fixtures/local_files/` holds 13 documents for manual/UI-level testing of the file scanner. 10 should be flagged; 3 are true negatives. All CPR numbers verified against `is_valid_cpr`. `generate_fixtures.py` (requires `python-docx` + `openpyxl`, already in venv) regenerates the binary `.docx`/`.xlsx` files.
+**Local-file scan fixtures** — `tests/fixtures/local_files/` holds 19 files for manual/UI-level testing of the file scanner. 14 should be flagged; 5 are true negatives. All CPR numbers verified against `is_valid_cpr`. `generate_fixtures.py` (requires `python-docx`, `openpyxl`, `mutagen` — all in venv) regenerates the binary `.docx`/`.xlsx`/`.mp3`/`.flac`/`.mp4` files. Audio fixtures need 2 silent MPEG frames so mutagen can sync; FLAC uses a hand-packed STREAMINFO + Vorbis comment block; MP4 uses a minimal `ftyp`+`moov`/`mvhd` base that mutagen can tag.

 **`_CPR_PREFIX_NOISE` in `.docx` fixtures** — `scan_docx` builds a single string by concatenating all run texts with no separators between paragraphs. If a CPR value run is immediately followed by text from the next paragraph without a word boundary, `\b` in `CPR_PATTERN` fails and the number is silently missed. The fixture generator appends a trailing `" "` to every value run so CPRs are always surrounded by word boundaries after concatenation. Do not remove this trailing space — the detection will silently regress.

--- a/README.md
+++ b/README.md
@ -617,7 +617,7 @@ The test suite should be run before every release and after any change to `docum

 #### Local-file scan fixtures

-`tests/fixtures/local_files/` provides 13 hand-crafted documents for end-to-end testing of the file scanner via the UI or `file_scanner.py`. Drop the folder as a local source and run a scan — all 10 PII-bearing files should be flagged and all 3 negative-case files should produce zero hits.
+`tests/fixtures/local_files/` provides 19 files for end-to-end testing of the file scanner via the UI or `file_scanner.py`. Drop the folder as a local source and run a scan — all 14 PII-bearing files should be flagged and all 5 negative-case files should produce zero hits.

 | File | Format | Expected | Scenario |
 |---|---|---|---|
@ -634,8 +634,14 @@ The test suite should be run before every release and after any change to `docum
 | `11_false_positive_invoice.txt` | TXT | **No flag** | Invoice: CPR-shaped numbers suppressed by `faktura`/`varenr` context |
 | `12_post2007_no_context.txt` | TXT | **No flag** | Equipment serial that looks like a post-2007 CPR but has no context keyword |
 | `13_cpr_in_xlsx.xlsx` | XLSX | Flag | Excel workbook with two sheets: students + employees |
+| `14_audio_artist_pii.mp3` | MP3 | Flag | ID3 artist/title tags with a personal name → `exif_pii` |
+| `15_audio_artist_pii.flac` | FLAC | Flag | Vorbis comment artist/title tags with a personal name → `exif_pii` |
+| `16_audio_no_pii.mp3` | MP3 | **No flag** | Empty ID3 header — no metadata tags |
+| `17_audio_no_pii.flac` | FLAC | **No flag** | FLAC with no Vorbis comment block |
+| `18_video_gps.mp4` | MP4 | Flag | QuickTime GPS coordinates (Copenhagen) + artist tag → `gps_location` + `exif_pii` |
+| `19_video_no_pii.mp4` | MP4 | **No flag** | Minimal MP4 container with no metadata |

-All CPR numbers are mathematically valid (verified against `is_valid_cpr`). Run `generate_fixtures.py` inside the venv to regenerate the `.docx` and `.xlsx` binary files after any changes.
+All CPR numbers are mathematically valid (verified against `is_valid_cpr`). Run `generate_fixtures.py` inside the venv to regenerate all binary files after any changes. Requires `python-docx`, `openpyxl`, and `mutagen` (all included in `requirements.txt`).

 ### Roadmap

--- a/cpr_detector.py
+++ b/cpr_detector.py
@ -5,12 +5,14 @@ Provides:
  _scan_bytes(content, filename)         — dispatch to correct scanner by file type
  _scan_text_direct(text)                — scan a plain text string
  _extract_exif(content, filename)       — extract PII-bearing EXIF tags from images
+  _extract_video_metadata(content, fn)   — extract PII-bearing metadata from video files
+  _extract_audio_metadata(content, fn)   — extract PII-bearing tags from audio files
  _detect_photo_faces(content, fn)       — count faces in an image (OpenCV)
  _get_pii_counts(text)                  — NER-based PII type counts
  _make_thumb(content, filename)         — JPEG thumbnail as base64 string
  _placeholder_svg(ext, name)            — SVG file-type icon

-Globals SCANNER_OK, PIL_OK, PHOTO_EXTS, SUPPORTED_EXTS, ds, PILImage, LANG,
+Globals SCANNER_OK, PIL_OK, PHOTO_EXTS, VIDEO_EXTS, AUDIO_EXTS, SUPPORTED_EXTS, ds, PILImage, LANG,
 and _check_special_category are injected at startup by gdpr_scanner.py via
 `from cpr_detector import *` AFTER those names are defined.  This keeps the
 module cleanly importable in isolation for unit tests (#26) while preserving
@ -47,11 +49,17 @@ except ImportError:
    PILImage = None  # type: ignore[assignment]
    PIL_OK = False

+VIDEO_EXTS = {
+    ".mp4", ".mov", ".m4v", ".avi", ".mkv", ".wmv", ".flv", ".webm",
+}
+AUDIO_EXTS = {
+    ".mp3", ".flac", ".ogg", ".m4a", ".aac", ".wma", ".wav", ".opus", ".aiff", ".aif",
+}
 SUPPORTED_EXTS = {
    ".pdf", ".docx", ".doc", ".xlsx", ".xlsm", ".csv",
    ".txt", ".eml", ".msg",
    ".jpg", ".jpeg", ".png", ".bmp", ".tiff", ".tif", ".webp",
-}
+} | VIDEO_EXTS | AUDIO_EXTS
 PHOTO_EXTS = {
    ".jpg", ".jpeg", ".png", ".bmp", ".tiff", ".tif", ".webp", ".heic", ".heif",
 }
@ -190,6 +198,226 @@ def _extract_exif(content: bytes, filename: str) -> dict:
    return result


+def _extract_video_metadata(content: bytes, filename: str) -> dict:
+    """Extract PII-bearing metadata from a video file.
+
+    Returns the same structure as _extract_exif so callers can treat both
+    identically:
+        gps        — {lat, lon, lat_ref, lon_ref, maps_url} or None
+        pii_fields — {label: value} for title/artist/comment/description
+        author     — str or None
+        datetime   — str or None
+        device     — str or None
+        has_pii    — bool
+
+    MP4/MOV/M4V: reads QuickTime/MPEG-4 tags via mutagen (no system deps).
+    GPS is extracted from the ©xyz QuickTime atom (ISO 6709 string written by
+    iPhones and Android devices: "+55.6763+012.5681+005.000/").
+    AVI: parses the RIFF INFO list chunk without any external library.
+    All other extensions: returns empty result immediately.
+    """
+    result: dict = {"gps": None, "pii_fields": {}, "author": None,
+                    "datetime": None, "device": None, "has_pii": False}
+    ext = Path(filename).suffix.lower()
+
+    if ext in {".mp4", ".mov", ".m4v"}:
+        _extract_mp4_tags(content, result)
+    elif ext == ".avi":
+        _extract_avi_info(content, result)
+
+    return result
+
+
+def _extract_mp4_tags(content: bytes, result: dict) -> None:
+    """Populate result dict from MPEG-4/QuickTime container tags via mutagen."""
+    try:
+        import mutagen.mp4
+        tags = mutagen.mp4.MP4(io.BytesIO(content)).tags
+        if not tags:
+            return
+
+        # Text fields that may contain personal data
+        _tag_label = {
+            "©nam": "Title",
+            "©cmt": "Comment",
+            "©des": "Description",
+            "desc": "Description",
+            "©lyr": "Lyrics",
+        }
+        for tag, label in _tag_label.items():
+            val = tags.get(tag)
+            if val:
+                text = str(val[0]).strip() if isinstance(val, list) else str(val).strip()
+                if len(text) >= _EXIF_PII_MIN_LEN:
+                    result["pii_fields"][label] = text
+                    result["has_pii"] = True
+
+        # Author — prefer ©ART (artist), fall back to album artist
+        for tag in ("©ART", "aART"):
+            val = tags.get(tag)
+            if val:
+                author = str(val[0]).strip() if isinstance(val, list) else str(val).strip()
+                if len(author) >= _EXIF_PII_MIN_LEN:
+                    result["author"] = author
+                    result["pii_fields"]["Artist"] = author
+                    result["has_pii"] = True
+                break
+
+        # Recording date
+        val = tags.get("©day")
+        if val:
+            result["datetime"] = str(val[0]).strip() if isinstance(val, list) else str(val).strip()
+
+        # Device (QuickTime-specific tags written by iPhones)
+        make  = tags.get("©mak")
+        model = tags.get("©mod")
+        if make or model:
+            result["device"] = " ".join(
+                str(v[0] if isinstance(v, list) else v).strip()
+                for v in (make, model) if v
+            )
+
+        # GPS — QuickTime ©xyz atom: "+55.6763+012.5681+005.000/" (ISO 6709)
+        import re as _re
+        for gps_tag in ("©xyz", "com.apple.quicktime.location.ISO6709"):
+            val = tags.get(gps_tag)
+            if val:
+                gps_str = str(val[0] if isinstance(val, list) else val).strip()
+                m = _re.match(r'([+-]\d+\.?\d*)([+-]\d+\.?\d*)', gps_str)
+                if m:
+                    lat = round(float(m.group(1)), 7)
+                    lon = round(float(m.group(2)), 7)
+                    result["gps"] = {
+                        "lat":      lat,
+                        "lon":      lon,
+                        "lat_ref":  "N" if lat >= 0 else "S",
+                        "lon_ref":  "E" if lon >= 0 else "W",
+                        "maps_url": f"https://www.google.com/maps?q={lat},{lon}",
+                    }
+                    result["has_pii"] = True
+                break
+    except Exception:
+        pass
+
+
+def _extract_avi_info(content: bytes, result: dict) -> None:
+    """Populate result dict from RIFF INFO list chunk in an AVI file."""
+    try:
+        import struct
+        if len(content) < 12 or content[:4] != b"RIFF":
+            return
+        # Walk top-level RIFF chunks looking for the INFO LIST
+        i = 12
+        while i + 8 <= len(content):
+            chunk_id   = content[i:i+4]
+            chunk_size = struct.unpack_from("<I", content, i + 4)[0]
+            if chunk_id == b"LIST" and content[i+8:i+12] == b"INFO":
+                _parse_riff_info(content, i + 12, i + 8 + chunk_size, result)
+                break
+            i += 8 + chunk_size + (chunk_size & 1)  # RIFF chunks are word-aligned
+    except Exception:
+        pass
+
+
+def _parse_riff_info(content: bytes, start: int, end: int, result: dict) -> None:
+    import struct
+    _info_labels = {
+        b"INAM": "Title",
+        b"IART": "Artist",
+        b"ICMT": "Comment",
+        b"ISBJ": "Subject",
+        b"ICRD": "Date",
+    }
+    i = start
+    while i + 8 <= end and i + 8 <= len(content):
+        sub_id   = content[i:i+4]
+        sub_size = struct.unpack_from("<I", content, i + 4)[0]
+        label    = _info_labels.get(sub_id)
+        if label:
+            raw = content[i+8 : i+8+sub_size]
+            val = raw.decode("utf-8", errors="replace").strip("\x00 ")
+            if val and len(val) >= _EXIF_PII_MIN_LEN:
+                result["pii_fields"][label] = val
+                result["has_pii"] = True
+                if label == "Artist" and not result["author"]:
+                    result["author"] = val
+                if label == "Date" and not result["datetime"]:
+                    result["datetime"] = val
+        i += 8 + sub_size + (sub_size & 1)
+
+
+def _extract_audio_metadata(content: bytes, filename: str) -> dict:
+    """Extract PII-bearing tags from an audio file.
+
+    Returns the same structure as _extract_exif / _extract_video_metadata.
+    No GPS extraction — GPS is not embedded in audio containers in practice.
+
+    Uses mutagen.File(easy=True) which normalises tags to lowercase keys for
+    MP3 (ID3), M4A/AAC (MPEG-4), FLAC, OGG Vorbis, and AIFF.  WMA/ASF tags
+    use mixed-case keys (e.g. "Title", "Author") — these are lowercased during
+    normalisation so the same extraction logic covers all formats.
+    """
+    result: dict = {"gps": None, "pii_fields": {}, "author": None,
+                    "datetime": None, "device": None, "has_pii": False}
+    try:
+        import mutagen
+        f = mutagen.File(fileobj=io.BytesIO(content), filename=filename, easy=True)
+        if not f or not f.tags:
+            return result
+
+        # Normalise all tags to {lowercase_key: str_value} regardless of format
+        def _strval(v):
+            return str(v[0] if isinstance(v, list) and v else v).strip()
+
+        tags: dict[str, str] = {
+            k.lower(): _strval(v) for k, v in f.tags.items()
+        }
+
+        # Fields that may contain personal names or descriptions
+        _pii_keys = {
+            "title":           "Title",
+            "artist":          "Artist",
+            "albumartist":     "Album Artist",
+            "composer":        "Composer",
+            "lyricist":        "Lyricist",
+            "conductor":       "Conductor",
+            "author":          "Author",
+            "copyright":       "Copyright",
+            "comment":         "Comment",
+            "description":     "Description",
+            # WMA/ASF mixed-case keys survive as lowercase after normalisation
+            "wm/albumartist":  "Album Artist",
+            "wm/composer":     "Composer",
+            "wm/conductor":    "Conductor",
+            "wm/lyrics":       "Lyrics",
+        }
+        seen: set[str] = set()  # avoid duplicate label entries
+        for key, label in _pii_keys.items():
+            val = tags.get(key, "")
+            if val and len(val) >= _EXIF_PII_MIN_LEN and label not in seen:
+                result["pii_fields"][label] = val
+                result["has_pii"] = True
+                seen.add(label)
+
+        # Author — most specific personal name field wins
+        for key in ("artist", "author", "albumartist", "wm/albumartist", "composer"):
+            val = tags.get(key, "")
+            if val and len(val) >= _EXIF_PII_MIN_LEN:
+                result["author"] = val
+                break
+
+        # Recording / release date
+        for key in ("date", "year", "wm/year"):
+            val = tags.get(key, "")
+            if val:
+                result["datetime"] = val
+                break
+
+    except Exception:
+        pass
+
+    return result
+

    """Detect faces in an image file using OpenCV Haar cascades.

--- a/gdpr_scanner.py
+++ b/gdpr_scanner.py
@ -260,8 +260,8 @@ import sse as _sse_mod  # for _current_scan_id access at call time
 from cpr_detector import (
    _scan_bytes, _scan_bytes_timeout, _scan_text_direct, _html_esc, _get_pii_counts,
    _make_thumb, _placeholder_svg,
-    _extract_exif, _detect_photo_faces,
-    SUPPORTED_EXTS, PHOTO_EXTS,
+    _extract_exif, _extract_video_metadata, _extract_audio_metadata, _detect_photo_faces,
+    SUPPORTED_EXTS, PHOTO_EXTS, VIDEO_EXTS, AUDIO_EXTS,
    _EXIF_PII_TAGS,
 )
 # Inject runtime deps into cpr_detector
@ -285,12 +285,16 @@ _se.FILE_SCANNER_OK  = FILE_SCANNER_OK
 _se.CONNECTOR_OK     = CONNECTOR_OK
 _se.DB_OK            = DB_OK
 _se.PHOTO_EXTS       = PHOTO_EXTS
+_se.VIDEO_EXTS       = VIDEO_EXTS
+_se.AUDIO_EXTS       = AUDIO_EXTS
 _se.SUPPORTED_EXTS   = SUPPORTED_EXTS
 # cpr helpers
 _se._scan_bytes              = _scan_bytes
 _se._scan_bytes_timeout      = _scan_bytes_timeout
 _se._detect_photo_faces      = _detect_photo_faces
 _se._extract_exif            = _extract_exif
+_se._extract_video_metadata  = _extract_video_metadata
+_se._extract_audio_metadata  = _extract_audio_metadata
 _se._make_thumb              = _make_thumb
 _se._placeholder_svg         = _placeholder_svg
 _se._check_special_category  = _check_special_category
--- a/requirements.txt
+++ b/requirements.txt
@ -13,10 +13,11 @@ pdfplumber>=0.11       # PDF text extraction
 python-docx>=1.1       # Word document scanning
 openpyxl>=3.1          # Excel scanning + export

-# ── Image processing ──────────────────────────────────────────────────────────
+# ── Image / video processing ─────────────────────────────────────────────────
 Pillow>=10.0           # Image thumbnails + EXIF extraction (always-on)
 opencv-python>=4.9     # Face detection (opt-in — Scan photos for faces)
 numpy>=1.26            # Required by opencv-python
+mutagen>=1.47          # Video metadata extraction (MP4/MOV/AVI — GPS, author, title)

 # ── NER / PII detection ───────────────────────────────────────────────────────
 # spaCy 3.7 supports Python 3.8–3.12. Do NOT upgrade past Python 3.12.
--- a/scan_engine.py
+++ b/scan_engine.py
@ -99,6 +99,8 @@ except ImportError:
 # Stubs for standalone import — overwritten by gdpr_scanner.py injections
 LANG: dict = {}
 PHOTO_EXTS: set = set()
+VIDEO_EXTS: set = set()
+AUDIO_EXTS: set = set()
 SUPPORTED_EXTS: set = set()

 # cpr_detector helpers — injected by gdpr_scanner.py
@ -106,6 +108,8 @@ def _scan_bytes(content, filename, poppler_path=None): return {"cprs": [], "date
 def _scan_bytes_timeout(content, filename, timeout=60): return {"cprs": [], "dates": []}  # type: ignore[misc]
 def _detect_photo_faces(content, filename): return 0  # type: ignore[misc]
 def _extract_exif(content, filename): return {}  # type: ignore[misc]
+def _extract_video_metadata(content, filename): return {}  # type: ignore[misc]
+def _extract_audio_metadata(content, filename): return {}  # type: ignore[misc]
 def _make_thumb(content, filename): return ""  # type: ignore[misc]
 def _placeholder_svg(ext, name): return ""  # type: ignore[misc]
 def _check_special_category(text, cprs): return []  # type: ignore[misc]
@ -227,9 +231,9 @@ def run_file_scan(source: dict):

            ext = Path(rel_path).suffix.lower()

-            # CPR scan — skip for images (no text layer; EXIF/face detection handles them)
+            # CPR scan — skip for images, video and audio (no text layer)
            result: dict = {"cprs": [], "dates": []}
-            if ext not in PHOTO_EXTS:
+            if ext not in PHOTO_EXTS and ext not in VIDEO_EXTS and ext not in AUDIO_EXTS:
                try:
                    result = _scan_bytes_timeout(content, rel_path)
                except Exception as e:
@ -238,13 +242,17 @@ def run_file_scan(source: dict):

            cprs = result.get("cprs", [])

-            # Photo / biometric scan + EXIF extraction
+            # Photo / biometric scan + EXIF/video/audio metadata extraction
            _face_count = 0
            _exif       = {}
            if ext in PHOTO_EXTS:
                if scan_photos:
                    _face_count = _detect_photo_faces(content, rel_path)
                _exif = _extract_exif(content, rel_path)
+            elif ext in VIDEO_EXTS:
+                _exif = _extract_video_metadata(content, rel_path)
+            elif ext in AUDIO_EXTS:
+                _exif = _extract_audio_metadata(content, rel_path)

            # Apply filters: distinct CPR threshold and GPS suppression
            _distinct_cprs = list(dict.fromkeys(c["formatted"] for c in cprs))
@ -1084,16 +1092,23 @@ def run_scan(options: dict):
                    content = conn.download_drive_item_for(uid, item_id)
                else:
                    content = conn.download_item(meta)
-                result  = _scan_bytes(content, name)
+
+                # CPR scan — skip for video and audio (metadata-only; no text layer)
+                _media_only = ext in VIDEO_EXTS or ext in AUDIO_EXTS
+                result = {"cprs": [], "dates": []} if _media_only else _scan_bytes(content, name)
                cprs   = result.get("cprs", [])

-                # ── Biometric photo scan (#9) + EXIF (#18) ───────────────
+                # ── Biometric photo scan (#9) + EXIF/video/audio metadata (#18) ─
                _face_count = 0
                _exif       = {}
                if ext in PHOTO_EXTS:
                    if scan_photos:
                        _face_count = _detect_photo_faces(content, name)
                    _exif = _extract_exif(content, name)
+                elif ext in VIDEO_EXTS:
+                    _exif = _extract_video_metadata(content, name)
+                elif ext in AUDIO_EXTS:
+                    _exif = _extract_audio_metadata(content, name)

                # Apply filters: distinct CPR threshold and GPS suppression
                _distinct_cprs   = list(dict.fromkeys(c["formatted"] for c in cprs))
--- a/tests/fixtures/local_files/09_cpr_in_docx.docx
+++ b/tests/fixtures/local_files/09_cpr_in_docx.docx
--- a/tests/fixtures/local_files/13_cpr_in_xlsx.xlsx
+++ b/tests/fixtures/local_files/13_cpr_in_xlsx.xlsx
--- a/tests/fixtures/local_files/14_audio_artist_pii.mp3
+++ b/tests/fixtures/local_files/14_audio_artist_pii.mp3
--- a/tests/fixtures/local_files/15_audio_artist_pii.flac
+++ b/tests/fixtures/local_files/15_audio_artist_pii.flac
--- a/tests/fixtures/local_files/16_audio_no_pii.mp3
+++ b/tests/fixtures/local_files/16_audio_no_pii.mp3
--- a/tests/fixtures/local_files/17_audio_no_pii.flac
+++ b/tests/fixtures/local_files/17_audio_no_pii.flac
--- a/tests/fixtures/local_files/18_video_gps.mp4
+++ b/tests/fixtures/local_files/18_video_gps.mp4
--- a/tests/fixtures/local_files/19_video_no_pii.mp4
+++ b/tests/fixtures/local_files/19_video_no_pii.mp4
--- a/tests/fixtures/local_files/generate_fixtures.py
+++ b/tests/fixtures/local_files/generate_fixtures.py
@ -4,7 +4,26 @@ Generate binary fixture files for the local-file GDPR scan test suite.
 Run from repo root:
    source venv/bin/activate
    python tests/fixtures/local_files/generate_fixtures.py
+
+Fixtures produced
+─────────────────
+Document fixtures (require python-docx + openpyxl):
+  09_cpr_in_docx.docx   — Word document with 2 CPR numbers          → Flag
+  13_cpr_in_xlsx.xlsx   — Excel workbook with CPR numbers            → Flag
+
+Audio fixtures (require mutagen):
+  14_audio_artist_pii.mp3  — MP3 with artist/title tags (personal name)    → Flag
+  15_audio_artist_pii.flac — FLAC with artist/title Vorbis comments        → Flag
+  16_audio_no_pii.mp3      — MP3 with no metadata tags                     → No flag
+  17_audio_no_pii.flac     — FLAC with no metadata                         → No flag
+
+Video fixtures (require mutagen):
+  18_video_gps.mp4      — MP4 with GPS coordinates + artist tag       → Flag
+  19_video_no_pii.mp4   — MP4 with no metadata tags                   → No flag
 """
+import struct
+import tempfile
+import os
 from pathlib import Path
 import sys

@ -19,6 +38,7 @@ def _require(pkg):

 openpyxl = _require("openpyxl")
 docx = _require("docx")
+_require("mutagen")

 from openpyxl import Workbook
 from openpyxl.styles import Font, PatternFill, Alignment
@ -148,7 +168,180 @@ def make_xlsx():
    print(f"Written: {out.name}")


+# ── Audio / video helpers ─────────────────────────────────────────────────────
+
+# Two silent MPEG1 Layer3 frames (128 kbps / 44100 Hz / mono).
+# mutagen needs at least 2 consecutive frame headers to confirm sync.
+# 4-byte header + 413 bytes frame body = 417 bytes × 2 = 834 bytes total.
+_MPEG_FRAMES = (b'\xff\xfb\x90\x00' + b'\x00' * 413) * 2
+
+
+def _flac_block_header(block_type: int, data_len: int, last: bool = False) -> bytes:
+    first = (0x80 if last else 0x00) | block_type
+    return bytes([first, (data_len >> 16) & 0xFF, (data_len >> 8) & 0xFF, data_len & 0xFF])
+
+
+def _vorbis_comment_block(comments: dict) -> bytes:
+    vendor = b'GDPRScanner fixture'
+    data = struct.pack('<I', len(vendor)) + vendor
+    data += struct.pack('<I', len(comments))
+    for key, value in comments.items():
+        entry = f'{key}={value}'.encode('utf-8')
+        data += struct.pack('<I', len(entry)) + entry
+    return data
+
+
+def _minimal_flac(comments: dict) -> bytes:
+    """Return bytes for a valid minimal FLAC file with Vorbis comments."""
+    # STREAMINFO (34 bytes): 44100 Hz, mono, 16-bit, 0 samples, zero MD5.
+    si = bytearray(34)
+    si[0:2] = struct.pack('>H', 4096)   # min block size
+    si[2:4] = struct.pack('>H', 4096)   # max block size
+    # bytes 4-9: min/max frame sizes = 0 (unknown)
+    # Bits 80-99: sample_rate=44100 (0xAC44 in 20-bit field)
+    # Bits 100-102: channels-1 = 0 (mono)
+    # Bits 103-107: bits_per_sample-1 = 15 (16-bit)
+    # Bits 108-143: total_samples = 0; bytes 14-17 remain zero
+    si[10] = 0x0A   # 0000_1010 — top 8 of 44100 in 20-bit field
+    si[11] = 0xC4   # 1100_0100
+    si[12] = 0x40   # bottom 4 of sample_rate | channels(000) | bps_msb(0)
+    si[13] = 0xF0   # bps remaining 4 bits (1111) | top 4 of total_samples (0)
+
+    vc = _vorbis_comment_block(comments)
+    return (
+        b'fLaC'
+        + _flac_block_header(0, 34, last=not comments)  # STREAMINFO
+        + bytes(si)
+        + (_flac_block_header(4, len(vc), last=True) + vc if comments else b'')
+    )
+
+
+def _mp4_atom(name: bytes, data: bytes) -> bytes:
+    return struct.pack('>I', 8 + len(data)) + name + data
+
+
+def _minimal_mp4_base() -> bytes:
+    """Return bytes for the smallest valid MPEG-4 container mutagen can tag."""
+    # ftyp — identifies the file as M4A
+    ftyp = _mp4_atom(
+        b'ftyp',
+        b'M4A ' + struct.pack('>I', 0) + b'M4A ' + b'mp42' + b'isom',
+    )
+    # mvhd version 0 — 100 bytes of content (ISO 14496-12 §8.2.2)
+    mvhd = bytearray(100)
+    mvhd[0:4] = b'\x00\x00\x00\x00'                          # version + flags
+    struct.pack_into('>IIII', mvhd, 4, 0, 0, 1000, 0)        # creation, modification, timescale, duration
+    struct.pack_into('>I', mvhd, 16, 0x00010000)              # rate = 1.0
+    struct.pack_into('>H', mvhd, 20, 0x0100)                  # volume = 1.0
+    # bytes 22-31: reserved (10 bytes, already zero)
+    struct.pack_into('>9i', mvhd, 32,                         # unity matrix
+        0x00010000, 0, 0, 0, 0x00010000, 0, 0, 0, 0x40000000)
+    # bytes 68-91: pre-defined (24 bytes, already zero)
+    struct.pack_into('>I', mvhd, 96, 0xFFFFFFFF)              # next_track_ID
+
+    return ftyp + _mp4_atom(b'moov', _mp4_atom(b'mvhd', bytes(mvhd)))
+
+
+def _mp4_with_tags(tags: dict) -> bytes:
+    """Return bytes for a minimal MP4 with the given mutagen tag dict."""
+    import mutagen.mp4
+    tmp = tempfile.mktemp(suffix='.mp4')
+    try:
+        with open(tmp, 'wb') as fh:
+            fh.write(_minimal_mp4_base())
+        f = mutagen.mp4.MP4(tmp)
+        f.add_tags()
+        for key, value in tags.items():
+            f.tags[key] = [value]
+        f.save()
+        with open(tmp, 'rb') as fh:
+            return fh.read()
+    finally:
+        if os.path.exists(tmp):
+            os.unlink(tmp)
+
+
+# ── 14_audio_artist_pii.mp3 ───────────────────────────────────────────────────
+def make_mp3_pii():
+    from mutagen.easyid3 import EasyID3
+    tmp = tempfile.mktemp(suffix='.mp3')
+    try:
+        t = EasyID3()
+        t['artist'] = ['Emma Slot Henriksen']
+        t['title']  = ['Fortrolig optagelse — personalemøde']
+        t['date']   = ['2026-04-21']
+        t.save(tmp)
+        with open(tmp, 'rb') as fh:
+            id3_bytes = fh.read()
+    finally:
+        if os.path.exists(tmp):
+            os.unlink(tmp)
+
+    out = HERE / '14_audio_artist_pii.mp3'
+    out.write_bytes(id3_bytes + _MPEG_FRAMES)
+    print(f"Written: {out.name}")
+
+
+# ── 15_audio_artist_pii.flac ──────────────────────────────────────────────────
+def make_flac_pii():
+    out = HERE / '15_audio_artist_pii.flac'
+    out.write_bytes(_minimal_flac({
+        'ARTIST': 'Emma Slot Henriksen',
+        'TITLE':  'Fortrolig optagelse — personalemøde',
+        'DATE':   '2026-04-21',
+    }))
+    print(f"Written: {out.name}")
+
+
+# ── 16_audio_no_pii.mp3 ───────────────────────────────────────────────────────
+def make_mp3_no_pii():
+    from mutagen.easyid3 import EasyID3
+    tmp = tempfile.mktemp(suffix='.mp3')
+    try:
+        EasyID3().save(tmp)  # empty ID3 header, no tags
+        with open(tmp, 'rb') as fh:
+            id3_bytes = fh.read()
+    finally:
+        if os.path.exists(tmp):
+            os.unlink(tmp)
+
+    out = HERE / '16_audio_no_pii.mp3'
+    out.write_bytes(id3_bytes + _MPEG_FRAMES)
+    print(f"Written: {out.name}")
+
+
+# ── 17_audio_no_pii.flac ──────────────────────────────────────────────────────
+def make_flac_no_pii():
+    out = HERE / '17_audio_no_pii.flac'
+    out.write_bytes(_minimal_flac({}))   # no Vorbis comment block
+    print(f"Written: {out.name}")
+
+
+# ── 18_video_gps.mp4 ─────────────────────────────────────────────────────────
+def make_mp4_gps():
+    out = HERE / '18_video_gps.mp4'
+    out.write_bytes(_mp4_with_tags({
+        '©xyz': '+55.6761+012.5683+000.000/',   # Copenhagen
+        '©ART': 'Emma Slot Henriksen',
+        '©nam': 'Optagelse fra skolegården',
+    }))
+    print(f"Written: {out.name}")
+
+
+# ── 19_video_no_pii.mp4 ──────────────────────────────────────────────────────
+def make_mp4_no_pii():
+    out = HERE / '19_video_no_pii.mp4'
+    out.write_bytes(_minimal_mp4_base())   # no moov/udta/meta/ilst — no tags
+    print(f"Written: {out.name}")
+
+
 if __name__ == "__main__":
    make_docx()
    make_xlsx()
+    make_mp3_pii()
+    make_flac_pii()
+    make_mp3_no_pii()
+    make_flac_no_pii()
+    make_mp4_gps()
+    make_mp4_no_pii()
    print("Done.")