Skip to content

[Tier 5] DSPM-lite: sensitive-data classification + risk-score boost #59

@dcoln25-writer

Description

@dcoln25-writer

Problem

SecurityAsset.containsSensitiveData: boolean exists in the schema and is shown in the UI on the OAuth-grants card, but nothing populates it. It's seeded by the demo script and otherwise empty. The asset model has DATA_RESOURCE, WORKSPACE, VAULT, REPOSITORY types and dataAssetTypes filtering in internal/bootstrap/security_overview.go — the scaffolding is there, the classifier is missing.

DSPM (Data Security Posture Management) is the fastest-growing SSPM adjacency: every buyer wants to know which findings touch sensitive data and which OAuth grants can exfiltrate it. Even a lightweight version is differentiating.

Goals

  1. Lightweight data classifier that samples content in Drive / SharePoint / Box / GitHub for well-known sensitive-data patterns (PII, PHI, PCI, secrets).
  2. Asset-level enrichment: populate SecurityAsset.containsSensitiveData + a new sensitivityScore and dataLabels set.
  3. Risk-aware overlay: every existing finding's risk score boosts when the affected asset carries sensitive data; OAuth-grant risk likewise.
  4. Data-flow heatmap: where in the SaaS estate does sensitive data live, who can read it, who has?
  5. Findings filter: "show me everything touching customer PII."

Non-goals

  • Not building a full DSPM with policy-as-code, fine-grained classification taxonomies, or ML-based content classification — start with regex/dictionary classifiers + provider-native labels where available.
  • Not scanning every file on every sync — sampling + provider-native classification labels (Google DLP, M365 sensitivity labels) where they exist.
  • Not transferring file contents off the customer's tenant — all classification happens server-side in Aperio's runtime.

Proposed design

Schema extensions

enum DataLabel {
  PII
  PHI
  PCI
  SECRET
  CONFIDENTIAL_INTERNAL
  REGULATED_OTHER
}

model SecurityAsset {
  // ... existing ...
  sensitivityScore  Int        @default(0) @map("sensitivity_score")  // 0-100, normalized
  dataLabels        DataLabel[] @default([]) @map("data_labels")
  classifiedAt      DateTime?  @map("classified_at")
  classifierVersion String?    @map("classifier_version") @db.VarChar(40)
}

model DataClassificationSample {
  id              String   @id @default(cuid())
  organizationId  String   @map("organization_id")
  assetId         String   @map("asset_id")
  source          String   @db.VarChar(60)        // "google_drive" | "github" | etc.
  pattern         String   @db.VarChar(120)       // matched classifier ID
  hitCount        Int      @map("hit_count")
  contentExcerptHash String? @map("content_excerpt_hash") @db.VarChar(64) // hash only, never raw content
  sampledAt       DateTime @default(now()) @map("sampled_at")
  organization    Organization @relation(...)
  asset           SecurityAsset @relation(fields: [assetId], references: [id], onDelete: Cascade)
  @@index([organizationId, assetId, sampledAt])
  @@map("data_classification_samples")
}

Classifier pipeline

internal/dspm/ Go package:

  1. Sampling worker — for each SecurityAsset of type DATA_RESOURCE/WORKSPACE/VAULT/REPOSITORY, sample N files (configurable) per scan cycle.
  2. Provider-native first — if Google DLP / M365 Sensitivity Labels / Box Shield labels already exist on the object, take them as authoritative and skip regex.
  3. Regex/dictionary classifier with built-in patterns:
    • PII: SSN, NI number, EU national IDs, passport numbers, IBAN/SWIFT
    • PHI: ICD-10 codes, MRN patterns, NDC drug codes
    • PCI: PAN (Luhn-validated), CVV, magnetic stripe
    • Secrets: AWS access keys, GCP service-account JSON, GitHub PATs, Stripe keys, OpenAI keys, Slack tokens, Twilio tokens (reuse trufflehog-style detector list)
    • Confidential internal: dictionary terms ("acquisition", "M&A", "salary band") — operator-tunable
  4. Hash, don't store — only persist sample-hit metadata + content excerpt hash, never the raw matched content (data-residency safe).
  5. Score computationsensitivityScore = f(unique labels, hit density, asset exposure).

Connectors with sampling

  • Google Drive / Docsdrive.files.list with size cap; files.export for Docs to plain text; abide by Drive Activity API rate limits.
  • Microsoft 365 (SharePoint / OneDrive) — Graph search API with file filter; Sensitivity Labels read.
  • Box — Box Shield label read; sampled file content.
  • GitHub — repo content sampling for secrets (already a partial story via secret scanning if enabled — reuse Finish the connector rule packs (Okta, GitHub, Slack, M365, Atlassian, 1Password) #48).
  • Dropbox — file metadata + sample content via API.
  • Notion / Confluence — page content sampling.

Risk boost integration

Existing finding risk-score calculators (packages/shared/src/risk-scoring.ts) get a multiplier when the linked SecurityAsset.sensitivityScore > threshold:

adjustedRiskScore = baseRiskScore * (1 + min(1.0, sensitivityScore / 100))

OAuth grant risk likewise: an app holding a CRITICAL scope on a containsSensitiveData = true asset bumps to MAX risk.

UI surface

  • /security/data — sensitivity heatmap across the SaaS estate, by provider + asset type + data label.
  • Per-asset detail — show the data labels detected, the hit count, the sample timestamp, the affected findings.
  • Findings filter — "touches sensitive data" toggle.
  • Shadow-IT overlay — OAuth-grants card already renders containsSensitiveData; once the classifier runs, those flags become real.

Phasing

Phase Scope
P1 Schema additions; regex/dictionary classifier; Google Drive sampler; risk-score boost wiring; containsSensitiveData actually populated
P2 M365 / SharePoint sampler; provider-native label ingestion (M365 Sensitivity Labels, Google DLP); /security/data heatmap
P3 GitHub secret sampling reuse; Box / Dropbox / Notion samplers; operator-tunable dictionary terms
P4 Data-flow tracking (file shared from Drive → Slack channel → Notion page) — relies on cross-connector event correlation from #11

Open questions

  • Sampling budget per scan — files-per-asset cap balanced against API quota; needs per-provider tuning.
  • False-positive control on regex classifiers (PCI Luhn check is great, SSN regex is terrible) — start strict, expand cautiously.
  • Should custom regex patterns be a per-tenant configurable list? (Almost certainly yes — every customer has bespoke confidential terms.)
  • Data flow (P4) is its own correlation problem — leans on Add Go ConnectRPC API foundation #11.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    dspmData security posture / classificationtier-5-surface-expansionTier 5: expand the surfaces Aperio covers

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions