GDPRScanner/CONTRIBUTING.md
2026-04-12 14:49:28 +02:00

4.0 KiB

Contributing to GDPR Scanner

Thank you for considering a contribution. This project helps organisations find and manage personal data across Microsoft 365 (Exchange, OneDrive, SharePoint, Teams), Google Workspace (Gmail, Google Drive), and local/SMB file systems. Contributions that improve compliance coverage, reliability, and usability are very welcome.


Before You Start

  • Check the open issues to see if your idea is already tracked
  • For large features, open an issue first to discuss the approach — this avoids wasted effort if the direction doesn't fit
  • Security vulnerabilities: see SECURITY.md — do not file public issues

Development Setup

# Clone and set up a virtual environment
git clone https://github.com/your-org/gdpr-scanner.git
cd gdpr-scanner
python3 -m venv venv
source venv/bin/activate          # macOS / Linux
venv\Scripts\activate             # Windows

pip install -r requirements.txt

# Danish NER model (optional — needed for name/address detection)
python -m spacy download da_core_news_lg

# Start the scanner (serves on http://0.0.0.0:5100)
python gdpr_scanner.py

# Run the test suite
python -m pytest tests/ -q

To test against a real M365 tenant you will need a Microsoft Azure app registration with the permissions described in the README. A free developer tenant is available via the Microsoft 365 Developer Program.


What We Welcome

  • Bug fixes
  • Improved CPR false-positive reduction
  • New language files (see lang/en.json for the key list)
  • Performance improvements for large tenants
  • Docker / deployment improvements
  • Documentation fixes

Code Style

Python

  • Follow PEP 8 with a line length of 100
  • Use type hints for function signatures
  • No external formatters are enforced — just keep it consistent with the surrounding code
  • All personal data (CPR numbers) must be SHA-256 hashed before storage — never store or log raw CPR values
  • Wrap Graph API calls in try/except and handle M365PermissionError gracefully

JavaScript (static/js/*.js — ES modules)

  • const / let — no var
  • async/await over .then() chains
  • All user-visible strings must have a data-i18n key so translations work

SQL

  • Use parameterised queries — never string-format SQL
  • New columns on existing tables must have a corresponding migration in _MIGRATIONS in gdpr_db.py

Adding a Language

  1. Copy lang/en.json to lang/xx.json (ISO 639-1 code)
  2. Translate all values — keys must stay identical
  3. Test by writing xx to ~/.gdprscanner/lang and restarting

Pull Request Process

  1. Fork the repository and create a branch: git checkout -b feature/my-feature
  2. Make your changes and test them
  3. Run the test suite: python -m pytest tests/ -q
  4. Run a syntax check on the modules you touched, e.g.: python -m py_compile gdpr_scanner.py scan_engine.py app_config.py gdpr_db.py
  5. Update README.md if your change adds or changes user-visible behaviour
  6. Open a pull request with a clear description of what it does and why
  7. Link to the relevant issue if applicable

We aim to review pull requests within one week.


Personal Data in Tests and Examples

Do not include real CPR numbers, email addresses, or names in test data, example output, or documentation. Use clearly fictional values:

# Good
test_cpr = "010101-1234"   # fictional — fails Modulus 11 check

# Bad
test_cpr = "150385-1234"   # could be a real person

If you are testing with a real Microsoft 365 tenant, ensure you have appropriate authorisation to access that data.


Contributor License Agreement

By submitting a pull request you confirm that:

  • You wrote the contribution yourself or have the right to submit it
  • You license your contribution under the same AGPL-3.0 terms as this project
  • You understand the disclaimer in LICENSE — this is a compliance tool, not legal advice

Code of Conduct

Be respectful. Harassment of any kind will not be tolerated.