Skip to content

fix(attributes): migrating several numeric fields to pii=maybe#228

Merged
constantinius merged 6 commits intomainfrom
constantinius/fix/attributes/change-pii-maybe-for-numbers
Jan 23, 2026
Merged

fix(attributes): migrating several numeric fields to pii=maybe#228
constantinius merged 6 commits intomainfrom
constantinius/fix/attributes/change-pii-maybe-for-numbers

Conversation

@constantinius
Copy link
Contributor

@constantinius constantinius commented Jan 22, 2026

Description

In a recent discussion it was concluded that numeric fields (unless they are IDs of some kind) should be migrated to pii=maybe. This also removes the test that was enforcing the wrong behavior of not allowing it for non-string fields.

Contributes to https://linear.app/getsentry/issue/TET-1760/ttft-relay-treats-it-as-pii

PR Checklist

  • I have run yarn test and verified that the tests pass.
  • I have run yarn generate && yarn format to generate and format code and docs.

If an attribute was added:

  • I have used the correct value for pii (i.e. maybe or true. Use false only for values that should never be scrubbed such as IDs)

If an attribute was deprecated:

Extra Info

This is the total list of updates and omissions:

============================================================
CHANGED ATTRIBUTES (PII set to 'maybe'):
============================================================
  - ai.completion_tokens.used
  - ai.frequency_penalty
  - ai.presence_penalty
  - ai.prompt_tokens.used
  - ai.temperature
  - ai.top_k
  - ai.top_p
  - ai.total_cost
  - ai.total_tokens.used
  - browser.script.source_char_position
  - cache.item_size
  - cache.ttl
  - client.port
  - cloudflare.d1.duration
  - cloudflare.d1.rows_read
  - cloudflare.d1.rows_written
  - code.line.number
  - code.lineno
  - frames.delay
  - frames.frozen
  - frames.slow
  - frames.total
  - gen_ai.cost.input_tokens
  - gen_ai.cost.output_tokens
  - gen_ai.cost.total_tokens
  - gen_ai.request.frequency_penalty
  - gen_ai.request.max_tokens
  - gen_ai.request.presence_penalty
  - gen_ai.request.temperature
  - gen_ai.request.top_k
  - gen_ai.request.top_p
  - gen_ai.response.tokens_per_second
  - gen_ai.usage.completion_tokens
  - gen_ai.usage.input_tokens
  - gen_ai.usage.input_tokens.cache_write
  - gen_ai.usage.input_tokens.cached
  - gen_ai.usage.output_tokens
  - gen_ai.usage.output_tokens.reasoning
  - gen_ai.usage.prompt_tokens
  - gen_ai.usage.total_tokens
  - http.decoded_response_content_length
  - http.request.connect_start
  - http.request.connection_end
  - http.request.domain_lookup_end
  - http.request.domain_lookup_start
  - http.request.fetch_start
  - http.request.redirect_end
  - http.request.redirect_start
  - http.request.request_start
  - http.request.resend_count
  - http.request.response_end
  - http.request.response_start
  - http.request.secure_connection_start
  - http.request.time_to_first_byte
  - http.request.worker_start
  - http.response.body.size
  - http.response.size
  - http.response.status_code
  - http.response_content_length
  - http.response_transfer_size
  - http.status_code
  - lcp.size
  - mcp.progress.current
  - mcp.progress.percentage
  - mcp.progress.total
  - mcp.prompt.result.message_count
  - mcp.protocol.ready
  - mcp.tool.result.content_count
  - messaging.message.body.size
  - messaging.message.envelope.size
  - messaging.message.receive.latency
  - messaging.message.retry.count
  - nel.elapsed_time
  - nel.sampling_function
  - net.host.port
  - net.peer.port
  - net.sock.host.port
  - net.sock.peer.port
  - network.local.port
  - network.peer.port
  - process.pid
  - rpc.grpc.status_code
  - sentry.client_sample_rate
  - sentry.exclusive_time
  - sentry.server_sample_rate
  - sentry.status_code
  - server.port
  - url.port
  - vercel.proxy.response_byte_size
  - vercel.proxy.status_code
  - vercel.proxy.timestamp
  - vercel.status_code

============================================================
OMITTED ATTRIBUTES (ID fields, not changed):
============================================================
  - event.id
  - thread.id

Total changed: 92
Total omitted: 2

@linear
Copy link

linear bot commented Jan 22, 2026

@constantinius
Copy link
Contributor Author

This was the vibe coded script to update the fields:

#!/usr/bin/env python3
"""
Script to update PII to 'maybe' for numeric attributes that have PII set to false,
excluding ID fields.
"""

import json
import os
from pathlib import Path


NUMERIC_TYPES = {"integer", "int", "double", "float", "long", "number"}
ATTRIBUTES_DIR = Path("model/attributes")


def is_id_field(key: str) -> bool:
    """Check if the attribute key looks like an ID field."""
    key_lower = key.lower()
    # Check for common ID patterns
    if key_lower.endswith("_id") or key_lower.endswith(".id"):
        return True
    if "_id." in key_lower or "_id_" in key_lower:
        return True
    # Check for specific ID-like suffixes
    parts = key_lower.split(".")
    if parts and parts[-1] in ("id", "request_id", "trace_id", "span_id", "session_id"):
        return True
    return False


def process_attribute_file(file_path: Path) -> tuple[str | None, str | None]:
    """
    Process a single attribute file.

    Returns:
        (changed_key, omitted_key) - one will be set, the other None
    """
    try:
        with open(file_path, "r") as f:
            data = json.load(f)
    except (json.JSONDecodeError, IOError) as e:
        print(f"Error reading {file_path}: {e}")
        return None, None

    key = data.get("key", "")
    attr_type = data.get("type", "")
    pii = data.get("pii", {})
    pii_value = pii.get("key") if isinstance(pii, dict) else None

    # Check if it's a numeric type with pii=false
    if attr_type not in NUMERIC_TYPES:
        return None, None

    if pii_value != "false":
        return None, None

    # Check if it's an ID field (should be omitted)
    if is_id_field(key):
        return None, key

    # Update the PII to "maybe"
    data["pii"]["key"] = "maybe"

    with open(file_path, "w") as f:
        json.dump(data, f, indent=2)
        f.write("\n")  # Add trailing newline

    return key, None


def main():
    changed = []
    omitted = []

    # Find all JSON files in the attributes directory
    for json_file in ATTRIBUTES_DIR.rglob("*.json"):
        changed_key, omitted_key = process_attribute_file(json_file)
        if changed_key:
            changed.append(changed_key)
        if omitted_key:
            omitted.append(omitted_key)

    # Sort for consistent output
    changed.sort()
    omitted.sort()

    print("=" * 60)
    print("CHANGED ATTRIBUTES (PII set to 'maybe'):")
    print("=" * 60)
    if changed:
        for key in changed:
            print(f"  - {key}")
    else:
        print("  (none)")

    print()
    print("=" * 60)
    print("OMITTED ATTRIBUTES (ID fields, not changed):")
    print("=" * 60)
    if omitted:
        for key in omitted:
            print(f"  - {key}")
    else:
        print("  (none)")

    print()
    print(f"Total changed: {len(changed)}")
    print(f"Total omitted: {len(omitted)}")


if __name__ == "__main__":
    main()

Copy link
Member

@Dav1dde Dav1dde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only checked the changed attribute names, 👍

@lcian
Copy link
Member

lcian commented Jan 22, 2026

Looks like not all of the can be sent by the SDK, but I think these should remain pii: "false":

- sentry.client_sample_rate
- sentry.exclusive_time
- sentry.server_sample_rate
- sentry.status_code

Do you agree @Dav1dde?

Thank you!

@Dav1dde
Copy link
Member

Dav1dde commented Jan 22, 2026

Good callout, the sample rates make sense to be false and the client one is also sent by the SDK

In practice exclusive time and status code I think don't matter and are fine either way. I'd just generally default to maybe unless there is a really good reason no to, which for sample rates definitely makes sense.

@constantinius
Copy link
Contributor Author

This comment was raised:

sentry.status_code is populated for Insights and is copied from the http response status code field - I think that would be safer as maybe since we don't control the source data

What's the verdict on that one? @lcian @Dav1dde

@lcian
Copy link
Member

lcian commented Jan 22, 2026

@constantinius Then I would say let's please revert just both sample rates to false, the other two we can keep as maybe.

@constantinius constantinius requested a review from Lms24 as a code owner January 23, 2026 08:08
@constantinius
Copy link
Contributor Author

@lcian Done!
I still need an approval from either code owner

@constantinius constantinius merged commit b39c051 into main Jan 23, 2026
10 checks passed
@constantinius constantinius deleted the constantinius/fix/attributes/change-pii-maybe-for-numbers branch January 23, 2026 08:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants