From 66986a16f9f963a5f21f89865de53dbd6f592c25 Mon Sep 17 00:00:00 2001 From: StyxX65 <150797939+StyxX65@users.noreply.github.com> Date: Thu, 28 May 2026 17:53:53 +0200 Subject: [PATCH] =?UTF-8?q?=E2=80=BB=20recap:=20Extended=20in-place=20CPR?= =?UTF-8?q?=20redaction=20to=20Google=20Drive,=20SFTP,=20SMB,=20and=20loca?= =?UTF-8?q?l=20=20=20=20PDFs,=20then=20updated=20CLAUDE.md=20and=20both=20?= =?UTF-8?q?manuals.=20Everything=20is=20committed=20and=20all=20=20=20=202?= =?UTF-8?q?01=20tests=20pass.=20(disable=20recaps=20in=20/config)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- CLAUDE.md | 10 ++++++++-- docs/manuals/MANUAL-DA.md | 20 ++++++++++++++++++-- docs/manuals/MANUAL-EN.md | 20 ++++++++++++++++++-- 3 files changed, 44 insertions(+), 6 deletions(-) diff --git a/CLAUDE.md b/CLAUDE.md index 6162762..d958ebb 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -130,8 +130,14 @@ Large M365 tenants can generate enormous memory pressure. Key rules to preserve: - **Excel Summary sheet vs. per-source tabs** — the Summary sheet shows all scanned sources (even with 0 items). Per-source tabs are only created for sources with items; an empty tab has no value. - **ART.30 breakdown table** — iterates `scanned_sources` (not `by_source`) so Gmail, Google Drive, etc. appear with `0 | 0 | 0 | —` when the scan found nothing. - **Role-filtered exports** — `_build_excel_bytes(role='')` and `_build_article30_docx(role='')` accept `role='student'` or `role='staff'`. A local `_items` list is built at the top of each function and used everywhere instead of `state.flagged_items` directly — GPS sheet, External transfers sheet, and Art.30 staff/student tables all see only the filtered subset. Route handlers read `request.args.get('role', '')` and forward it. Filenames get `_elever` / `_ansatte` suffix. The `#filterRole` dropdown in the filter bar drives both the client-side grid filter and the export URL param — do not separate them. -- **`POST /api/redact_item`** — rewrites a local file in-place with CPR numbers replaced by `██████-████` / `█` blocks, then removes the card from the grid and logs a `"redacted"` disposition. Supported extensions: `.docx`, `.xlsx`, `.csv`, `.txt`, `.pdf` (`_REDACT_EXTS`). The file is written to a temp path in the **same directory** as the original before `shutil.move` — this avoids cross-device rename failures on mounted volumes. Uses existing `document_scanner` functions (`redact_docx`, `redact_xlsx`, `redact_csv`, `find_pii_spans_in_text`, `scan_pdf`, `redact_pdf_secure`). Only works for `source_type == "local"` — SMB/cloud files are not supported (button is hidden on those cards). The button (`✂`, class `card-redact-btn`) appears in `appendCard` when `_redactable(f)` is true; hidden in viewer mode and for resolved items. -- **PDF redaction** — `redact_pdf_secure` uses PyMuPDF `page.apply_redactions()` which physically removes text data from the PDF stream (not just an overlay). Falls back to `redact_pdf` (reportlab overlay) if PyMuPDF is absent. Text-based pages use `find_cpr_char_bboxes`; scanned pages render via OCR at 200 DPI and use `find_cpr_image_bboxes`. Raises `RuntimeError` if both backends are unavailable. Do not add `.pdf` to `_redactExts` in `results.js` without also handling it in `export.py` — the button and the route must stay in sync. +- **`POST /api/redact_item`** — rewrites a file in-place with CPR numbers replaced by `██████-████` / `█` blocks, removes the card from the grid, and logs a `"redacted"` disposition. Supported source types and extensions: + - **`local`** — DOCX, XLSX, CSV, TXT, PDF. File is written to a temp path in the same directory then `shutil.move`d (avoids cross-device rename). + - **`onedrive` / `sharepoint` / `teams`** — DOCX, XLSX, PDF. Downloaded via Graph, redacted locally, re-uploaded via `put_drive_item_content()` (PUT with `Content-Type: application/octet-stream`). + - **`gdrive`** — DOCX, XLSX, PDF. MIME type checked first — Google-native Docs/Sheets (exported as DOCX during scan) are refused with a clear message. Downloaded via `download_drive_file_by_id()`, redacted, uploaded back via `update_drive_file()` (`files().update()`). Requires `drive` scope (not `drive.readonly`) on the service-account delegation. + - **`sftp`** — DOCX, XLSX, CSV, TXT, PDF. Source config matched from `_load_file_sources()` by `sftp_host` + `sftp_user` parsed from `item_meta["account_name"]` (the `sftp://user@host/root` URI). Requires the item to still be in `state.flagged_items` — `account_name` is not persisted to the DB. Read/write via `SFTPScanner.read_file()` / `write_file()`. + - **`smb`** — DOCX, XLSX, CSV, TXT, PDF. Host + share parsed from `full_path` (`//host/share/…`); source config matched from `_load_file_sources()`. Written back via `file_scanner.write_smb_file()` with `CreateDisposition.FILE_SUPERSEDE`. + - The ✂ button (`card-redact-btn`) appears in `appendCard` via `_redactable` logic in `results.js`; hidden in viewer mode and for resolved items. **Keep `_redactExts` / `_cloudRedactExts` in `results.js` and `_REDACT_EXTS` / `_GDRIVE_MIME_MAP` / `_ALL_REDACTABLE_TYPES` in `export.py` in sync** — the button and the route must agree. +- **PDF redaction** — `redact_pdf_secure` uses PyMuPDF `page.apply_redactions()` which physically removes text data from the PDF stream (not just an overlay). Falls back to `redact_pdf` (reportlab overlay) if PyMuPDF is absent. Text-based pages use `find_cpr_char_bboxes`; scanned pages render via OCR at 200 DPI and use `find_cpr_image_bboxes`. Raises `RuntimeError` if both backends are unavailable. ## Scan history browser — static/js/history.js + gdpr_db.py + routes/database.py diff --git a/docs/manuals/MANUAL-DA.md b/docs/manuals/MANUAL-DA.md index f7cc665..36a2a44 100644 --- a/docs/manuals/MANUAL-DA.md +++ b/docs/manuals/MANUAL-DA.md @@ -292,12 +292,28 @@ Hvert element har en **Disposition**-rullemenu i forhåndsvisningspanelet. Vælg Klik på **Gem** efter valget. En lille **✓ Gemt**-bekræftelse vises. -### Redigér en lokal fil +### Redigér en fil på stedet -For lokale DOCX-, XLSX-, CSV-, TXT- og PDF-filer vises en **✂**-knap på kortet. Klikker du på den, overskrives filen på stedet, og alle CPR-numre erstattes med `██████-████`-blokke. Kortet fjernes fra gitteret, og handlingen registreres som en `"redacted"`-disposition. Brug denne mulighed, når du ønsker at anonymisere en fil frem for at slette den helt. Knappen er ikke tilgængelig for e-mails, cloud-filer eller SFTP-filer. +En **✂**-knap vises på resultatkort, hvor scanneren kan overskrive filen direkte. Klikker du på den, erstattes alle CPR-numre med `██████-████`-blokke, kortet fjernes fra gitteret, og handlingen registreres som en `"redacted"`-disposition. Brug denne mulighed, når du ønsker at anonymisere en fil frem for at slette den helt. + +Knappen er tilgængelig for følgende kildetyper og formater: + +| Kilde | Understøttede formater | +|---|---| +| Lokale filer | DOCX, XLSX, CSV, TXT, PDF | +| Netværksdrev (SMB) | DOCX, XLSX, CSV, TXT, PDF | +| SFTP | DOCX, XLSX, CSV, TXT, PDF | +| OneDrive / SharePoint / Teams | DOCX, XLSX, PDF | +| Google Drev | DOCX, XLSX, PDF | + +Knappen er **ikke** tilgængelig for e-mail-elementer (Exchange/Gmail) eller i visningsmode. Google Docs og Sheets, der er eksporteret som DOCX/XLSX under scanning, kan ikke redigeres på stedet — eksportér filen manuelt fra Google først og redigér derefter den hentede kopi. > **PDF-sikkerhedsnote:** PDF-redigering sker fysisk — CPR-nummerteksten slettes fra PDF-datastrømmen og er ikke blot dækket over med en sort boks. En læser kan ikke gendanne den oprindelige tekst ved at markere under redigeringen eller ved programmatisk inspektion af filen. Billedbaserede (scannede) PDF-filer understøttes også: scanneren lokaliserer CPR-nummeret på sidebilledet via OCR og overskriver det pågældende område fysisk. +> **Google Drev-note:** Redigering i Google Drev kræver `drive`-scopet på servicekontoens domain-wide delegation (ikke blot `drive.readonly`). Hvis redigeringen fejler med en rettighedsfejl, bedes du kontakte din Google Workspace-administrator for at tilføje scopet `https://www.googleapis.com/auth/drive` til servicekontoens delegation i Admin Console. + +> **SFTP-note:** SFTP-redigering er kun tilgængelig for elementer fundet i den aktuelle scansession. Gennemfør en ny scanning, hvis du gennemser historiske resultater. + ### Massemarkering af flere elementer på én gang Hvis du skal anvende den samme disposition på mange elementer, kan du bruge **Vælg-tilstand** i stedet for at åbne hvert kort enkeltvis. diff --git a/docs/manuals/MANUAL-EN.md b/docs/manuals/MANUAL-EN.md index 2f1cf6e..24293d6 100644 --- a/docs/manuals/MANUAL-EN.md +++ b/docs/manuals/MANUAL-EN.md @@ -292,12 +292,28 @@ Every item has a **Disposition** dropdown in the preview panel. Choose one of: After choosing, click **Save**. A small **✓ Saved** confirmation appears. -### Redacting a local file +### Redacting a file in-place -For local DOCX, XLSX, CSV, TXT, and PDF files a **✂** button appears in the card. Clicking it rewrites the file in-place, replacing all CPR numbers with `██████-████` blocks. The card is removed from the grid and the action is logged as a `"redacted"` disposition. This is useful when you want to sanitise a file rather than delete it entirely. The button is not available for email items, cloud files, or SFTP files. +A **✂** button appears on result cards where the scanner can overwrite the file directly. Clicking it replaces all CPR numbers with `██████-████` blocks, removes the card from the grid, and logs the action as a `"redacted"` disposition. This is useful when you want to sanitise a file rather than delete it entirely. + +The button is available for the following source types and formats: + +| Source | Supported formats | +|---|---| +| Local files | DOCX, XLSX, CSV, TXT, PDF | +| Network share (SMB) | DOCX, XLSX, CSV, TXT, PDF | +| SFTP | DOCX, XLSX, CSV, TXT, PDF | +| OneDrive / SharePoint / Teams | DOCX, XLSX, PDF | +| Google Drive | DOCX, XLSX, PDF | + +The button is **not** available for email items (Exchange/Gmail) or viewer mode. Google Docs and Sheets that were exported as DOCX/XLSX during scanning cannot be redacted in-place — export the file from Google manually first, then redact the downloaded copy. > **PDF security note:** PDF redaction uses physical removal — the CPR number text is erased from the PDF data stream, not just painted over with a black box. A reader cannot recover the original text by selecting under the redaction or inspecting the file programmatically. Image-based (scanned) PDFs are also supported: the scanner locates the CPR number on the page image via OCR and physically overwrites that region. +> **Google Drive note:** Drive redaction requires the `drive` scope on the service account's domain-wide delegation grant (not just `drive.readonly`). If redaction fails with a permission error, ask your Google Workspace admin to add the `https://www.googleapis.com/auth/drive` scope to the service account delegation in the Admin Console. + +> **SFTP note:** SFTP redaction is only available for items found in the current scan session. If you are browsing historical results, re-run the scan first. + ### Bulk tagging multiple items at once If you need to apply the same disposition to many items, use **Select mode** instead of opening each card individually.