[Tier 5] DSPM-lite: sensitive-data classification + risk-score boost

## Problem

`SecurityAsset.containsSensitiveData: boolean` exists in the schema and is shown in the UI on the OAuth-grants card, but **nothing populates it**. It's seeded by the demo script and otherwise empty. The asset model has `DATA_RESOURCE`, `WORKSPACE`, `VAULT`, `REPOSITORY` types and `dataAssetTypes` filtering in `internal/bootstrap/security_overview.go` — the **scaffolding is there**, the **classifier is missing**.

DSPM (Data Security Posture Management) is the fastest-growing SSPM adjacency: every buyer wants to know *which findings touch sensitive data* and *which OAuth grants can exfiltrate it*. Even a lightweight version is differentiating.

## Goals

1. **Lightweight data classifier** that samples content in Drive / SharePoint / Box / GitHub for well-known sensitive-data patterns (PII, PHI, PCI, secrets).
2. **Asset-level enrichment**: populate `SecurityAsset.containsSensitiveData` + a new `sensitivityScore` and `dataLabels` set.
3. **Risk-aware overlay**: every existing finding's risk score boosts when the affected asset carries sensitive data; OAuth-grant risk likewise.
4. **Data-flow heatmap**: where in the SaaS estate does sensitive data live, who can read it, who has?
5. **Findings filter**: "show me everything touching customer PII."

## Non-goals

- Not building a full DSPM with policy-as-code, fine-grained classification taxonomies, or ML-based content classification — start with regex/dictionary classifiers + provider-native labels where available.
- Not scanning every file on every sync — sampling + provider-native classification labels (Google DLP, M365 sensitivity labels) where they exist.
- Not transferring file contents off the customer's tenant — all classification happens server-side in Aperio's runtime.

## Proposed design

### Schema extensions

```prisma
enum DataLabel {
  PII
  PHI
  PCI
  SECRET
  CONFIDENTIAL_INTERNAL
  REGULATED_OTHER
}

model SecurityAsset {
  // ... existing ...
  sensitivityScore  Int        @default(0) @map("sensitivity_score")  // 0-100, normalized
  dataLabels        DataLabel[] @default([]) @map("data_labels")
  classifiedAt      DateTime?  @map("classified_at")
  classifierVersion String?    @map("classifier_version") @db.VarChar(40)
}

model DataClassificationSample {
  id              String   @id @default(cuid())
  organizationId  String   @map("organization_id")
  assetId         String   @map("asset_id")
  source          String   @db.VarChar(60)        // "google_drive" | "github" | etc.
  pattern         String   @db.VarChar(120)       // matched classifier ID
  hitCount        Int      @map("hit_count")
  contentExcerptHash String? @map("content_excerpt_hash") @db.VarChar(64) // hash only, never raw content
  sampledAt       DateTime @default(now()) @map("sampled_at")
  organization    Organization @relation(...)
  asset           SecurityAsset @relation(fields: [assetId], references: [id], onDelete: Cascade)
  @@index([organizationId, assetId, sampledAt])
  @@map("data_classification_samples")
}
```

### Classifier pipeline

`internal/dspm/` Go package:

1. **Sampling worker** — for each `SecurityAsset` of type `DATA_RESOURCE`/`WORKSPACE`/`VAULT`/`REPOSITORY`, sample N files (configurable) per scan cycle.
2. **Provider-native first** — if Google DLP / M365 Sensitivity Labels / Box Shield labels already exist on the object, take them as authoritative and skip regex.
3. **Regex/dictionary classifier** with built-in patterns:
   - PII: SSN, NI number, EU national IDs, passport numbers, IBAN/SWIFT
   - PHI: ICD-10 codes, MRN patterns, NDC drug codes
   - PCI: PAN (Luhn-validated), CVV, magnetic stripe
   - Secrets: AWS access keys, GCP service-account JSON, GitHub PATs, Stripe keys, OpenAI keys, Slack tokens, Twilio tokens (reuse `trufflehog`-style detector list)
   - Confidential internal: dictionary terms ("acquisition", "M&A", "salary band") — operator-tunable
4. **Hash, don't store** — only persist sample-hit metadata + content excerpt hash, never the raw matched content (data-residency safe).
5. **Score computation** — `sensitivityScore = f(unique labels, hit density, asset exposure)`.

### Connectors with sampling

- **Google Drive / Docs** — `drive.files.list` with size cap; `files.export` for Docs to plain text; abide by `Drive Activity API` rate limits.
- **Microsoft 365 (SharePoint / OneDrive)** — Graph search API with file filter; Sensitivity Labels read.
- **Box** — Box Shield label read; sampled file content.
- **GitHub** — repo content sampling for secrets (already a partial story via secret scanning if enabled — reuse #48).
- **Dropbox** — file metadata + sample content via API.
- **Notion / Confluence** — page content sampling.

### Risk boost integration

Existing finding risk-score calculators (`packages/shared/src/risk-scoring.ts`) get a multiplier when the linked `SecurityAsset.sensitivityScore > threshold`:

```ts
adjustedRiskScore = baseRiskScore * (1 + min(1.0, sensitivityScore / 100))
```

OAuth grant risk likewise: an app holding a CRITICAL scope on a `containsSensitiveData = true` asset bumps to MAX risk.

### UI surface

- **`/security/data`** — sensitivity heatmap across the SaaS estate, by provider + asset type + data label.
- **Per-asset detail** — show the data labels detected, the hit count, the sample timestamp, the affected findings.
- **Findings filter** — "touches sensitive data" toggle.
- **Shadow-IT overlay** — OAuth-grants card already renders `containsSensitiveData`; once the classifier runs, those flags become real.

### Phasing

| Phase | Scope |
|---|---|
| **P1** | Schema additions; regex/dictionary classifier; Google Drive sampler; risk-score boost wiring; `containsSensitiveData` actually populated |
| **P2** | M365 / SharePoint sampler; provider-native label ingestion (M365 Sensitivity Labels, Google DLP); `/security/data` heatmap |
| **P3** | GitHub secret sampling reuse; Box / Dropbox / Notion samplers; operator-tunable dictionary terms |
| **P4** | Data-flow tracking (file shared from Drive → Slack channel → Notion page) — relies on cross-connector event correlation from #11 |

## Open questions

- Sampling budget per scan — files-per-asset cap balanced against API quota; needs per-provider tuning.
- False-positive control on regex classifiers (PCI Luhn check is great, SSN regex is terrible) — start strict, expand cautiously.
- Should custom regex patterns be a per-tenant configurable list? (Almost certainly yes — every customer has bespoke confidential terms.)
- Data flow (P4) is its own correlation problem — leans on #11.

## References

- Reuses: `SecurityAsset` (already has `containsSensitiveData`, `type` enum, `dataAssetTypes` filtering); `internal/bootstrap/security_overview.go`'s data-asset awareness; `packages/shared/src/risk-scoring.ts`.
- Depends on: **#11** for the data-flow story (P4 only); rest is independent.
- Unlocks: **#14** chain analysis becomes much sharper ("this chain terminates in a sensitive-data asset"); **#46** AI narratives gain a powerful new dimension; **#5** compliance evidence pack carries data-class proof.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Tier 5] DSPM-lite: sensitive-data classification + risk-score boost #59

Problem

Goals

Non-goals

Proposed design

Schema extensions

Classifier pipeline

Connectors with sampling

Risk boost integration

UI surface

Phasing

Open questions

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Phase	Scope
P1	Schema additions; regex/dictionary classifier; Google Drive sampler; risk-score boost wiring; `containsSensitiveData` actually populated
P2	M365 / SharePoint sampler; provider-native label ingestion (M365 Sensitivity Labels, Google DLP); `/security/data` heatmap
P3	GitHub secret sampling reuse; Box / Dropbox / Notion samplers; operator-tunable dictionary terms
P4	Data-flow tracking (file shared from Drive → Slack channel → Notion page) — relies on cross-connector event correlation from #11

[Tier 5] DSPM-lite: sensitive-data classification + risk-score boost #59

Description

Problem

Goals

Non-goals

Proposed design

Schema extensions

Classifier pipeline

Connectors with sampling

Risk boost integration

UI surface

Phasing

Open questions

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions