Personal identifiable information (PII)

Definition

The PII test detects and validates the presence of personal identifiable information (PII) in your data. The test supports detection of a comprehensive range of PII types, including financial information, government identifiers, contact details, and location data across multiple countries and regions. You can specify one or multiple PII types to check for, and set thresholds on either the absolute count or percentage of rows containing PII.

Taxonomy

  • Task types: LLM, tabular classification, tabular regression, text classification.
  • Availability: and .

Why it matters

  • Data privacy compliance: Ensures your data meets privacy regulations like GDPR, CCPA, and other data protection laws
  • Security: Prevents accidental exposure of sensitive personal information
  • Model safety: LLMs are prone to memorizing and potentially leaking PII from training data
  • Audit trail: Provides documentation of PII detection for compliance reporting

Supported PII types

General PII Types

TypeDescription
CREDIT_CARDCredit card numbers (various formats)
EMAIL_ADDRESSEmail addresses
PHONE_NUMBERPhone numbers (various formats)
IP_ADDRESSIP addresses
URLWeb URLs
DATE_TIMEDate and time information
LOCATIONGeographic locations
PERSONPerson names
CRYPTOCryptocurrency addresses
MEDICAL_LICENSEMedical license numbers
NRPNational registry of persons
IBAN_CODEInternational Bank Account Numbers

United States

TypeDescription
US_SSNSocial Security Numbers
US_BANK_NUMBERUS bank account numbers
US_DRIVER_LICENSEUS driver’s license numbers
US_ITINIndividual Taxpayer Identification Numbers
US_PASSPORTUS passport numbers

United Kingdom

TypeDescription
UK_NHSNational Health Service numbers
UK_NINONational Insurance numbers

European Union

TypeDescription
ES_NIFSpanish tax identification numbers
ES_NIESpanish foreigner identification numbers
IT_FISCAL_CODEItalian tax codes
IT_DRIVER_LICENSEItalian driver’s licenses
IT_VAT_CODEItalian VAT codes
IT_PASSPORTItalian passport numbers
IT_IDENTITY_CARDItalian identity cards
FI_PERSONAL_IDENTITY_CODEFinnish personal identity codes
PL_PESELPolish personal identification numbers

Asia-Pacific

TypeDescription
SG_NRIC_FINSingapore NRIC/FIN numbers
SG_UENSingapore Unique Entity Numbers
AU_ABNAustralian Business Numbers
AU_ACNAustralian Company Numbers
AU_TFNAustralian Tax File Numbers
AU_MEDICAREAustralian Medicare numbers
IN_PANIndian Permanent Account Numbers
IN_AADHAARIndian Aadhaar numbers
IN_VEHICLE_REGISTRATIONIndian vehicle registration
IN_VOTERIndian voter ID numbers
IN_PASSPORTIndian passport numbers

South America

TypeDescription
BR_CPFBrazilian individual taxpayer registry
BR_CNPJBrazilian national registry of legal entities

Test configuration examples

If you are writing a tests.json, here are a few valid configurations for the PII test:
[
  {
    "name": "No financial PII in model outputs",
    "description": "Ensures no credit cards or bank numbers appear in model outputs",
    "type": "integrity",
    "subtype": "containsPii",
    "thresholds": [
      {
        "insightName": "containsPii",
        "insightParameters": [
          {
            "name": "pii_type",
            "value": ["CREDIT_CARD", "US_BANK_NUMBER", "IBAN_CODE"] // Check multiple PII types
          },
          {
            "name": "column_name",
            "value": "model_output"
          }
        ],
        "measurement": "containsPIIRowCount",
        "operator": "<=",
        "value": 0
      }
    ],
    "subpopulationFilters": null,
    "mode": "development",
    "usesValidationDataset": true,
    "usesTrainingDataset": false,
    "usesMlModel": false,
    "syncId": "b4dee7dc-4f15-48ca-a282-63e2c04e0689" // Some unique id
  },
  {
    "name": "Limited contact information leakage",
    "description": "Allows up to 5% of rows to contain contact information",
    "type": "integrity",
    "subtype": "containsPii",
    "thresholds": [
      {
        "insightName": "containsPii",
        "insightParameters": [
          {
            "name": "pii_type",
            "value": ["EMAIL_ADDRESS", "PHONE_NUMBER"] // Multiple types in array
          },
          {
            "name": "column_name",
            "value": "generated_text"
          }
        ],
        "measurement": "containsPIIRowPercentage", // Use percentage measurement
        "operator": "<=",
        "value": 5.0
      }
    ],
    "subpopulationFilters": null,
    "mode": "development",
    "usesValidationDataset": true,
    "usesTrainingDataset": false,
    "usesMlModel": false,
    "syncId": "96622fba-ea00-4e42-8f42-5e8f5f60805f" // Some unique id
  }
]