Skip to content

Cosmos Diagnostics Logging Filters and optimizations.#39897

Merged
bambriz merged 16 commits intoAzure:mainfrom
bambriz:cosmos-diagnostics-logger-filter
Apr 18, 2025
Merged

Cosmos Diagnostics Logging Filters and optimizations.#39897
bambriz merged 16 commits intoAzure:mainfrom
bambriz:cosmos-diagnostics-logger-filter

Conversation

@bambriz
Copy link
Copy Markdown
Member

@bambriz bambriz commented Feb 28, 2025

Description:

This PR adds the ability to use Filters from the Logging Module to filter Cosmos Diagnostics Logs based on the same filterable parameters as the Cosmos Diagnostics Handler. The Cosmos Diagnostics Filterable parameters are added to the Log Record of the Logger used to log diagnostics.

Each log can be filtered by the following attributes:

  • status_code
  • sub_status_code
  • duration
  • verb
  • database_name
  • collection_name
  • operation_type
  • url
  • resource_type
  • is_request

example of using Filters to filter diagnostics logs:

import logging
# Configure logging
logger = logging.getLogger("azure.cosmos")
logging.basicConfig(filename='cosmosdiagnostics.log', level=logging.INFO)
#49.7
class CosmosStatusCodeFilter(logging.Filter):
    def filter(self, record):
        ret = (hasattr(record, 'status_code') and record.status_code > 400
           and not (record.status_code in [404, 409, 412] and getattr(record, 'sub_status_code', None) in [0, None])
           and hasattr(record, 'duration') and record.duration > 1000)
        return ret
 logger.addFilter(CosmosStatusCodeFilter())  
 client = CosmosClient(url, Key, logger=logger, enable_diagnostics_logging=True)

Additionally, this PR adds some optimizations to significantly reduce the time spent on logging diagnostics. The improvement is more noticeable when large amounts of requests are logged. As a reference If using the old diagnostics for logging 5000 requests, the time spent on logging diagnostics was over 10000 ms, while with the new optimizations, the time spent is reduced to between 3000-5000ms if using the same filterable parameters.

See the following profiles of old diagnostics logging and new diagnostics logging while making 3600 requests:

  • Old Diagnostics Logging(Current beta version):
Log Everything
| Name        | Call Count | Total Time (ms) | Own Time (ms) |
|-------------|-------------|-----------------|---------------|
| on_response | 3601        | 16728           | 632          |
| on_request  | 3601        | 12029            | 821           |

Not Logging
| Name        | Call Count | Total Time (ms) | Own Time (ms) |
|-------------|-------------|-----------------|---------------|
| on_response | 3601        | 10097             | 18           |
| on_request  | 3601        | 5435            | 16           |

With Filter
| Name        | Call Count | Total Time (ms) | Own Time (ms) |
|-------------|-------------|-----------------|---------------|
| on_response | 3601        | 15527           | 662           |
| on_request  | 7202        | 6403            | 595           |
  • New Diagnostics Logging( From this PR):
Log Everything
| Name        | Call Count | Total Time (ms) | Own Time (ms) |
|-------------|-------------|-----------------|---------------|
| on_response | 3601        | 3489             | 359          |
| on_request  | 3601        | 3188            | 543          |

Not Logging
| Name        | Call Count | Total Time (ms) | Own Time (ms) |
|-------------|-------------|-----------------|---------------|
| on_response | 3601        | 1503             | 15           |
| on_request  | 3601        | 1147            | 14          |

With Filter
| Name        | Call Count | Total Time (ms) | Own Time (ms) |
|-------------|-------------|-----------------|---------------|
| on_response | 3601        | 3241             | 327          |
| on_request  | 7202        | 1833            | 338           |

The logs have also change in format to make the error messages more readable, as well as added some more diagnostic information. Formerly redacted headers are also now simply omitted to optimize logging.

Example:

INFO:cosmos_logger:
Client Settings: 
	Client Preferred Regions: []
	Client Available Read Regions: ['South Central US']
	Client Available Write Regions: ['South Central US']

Database Account Settings: 
	Consistency Level: Session
	Writable Locations: [{'name': 'South Central US', 'databaseAccountEndpoint': 'https://127.0.0.1:8081/'}]
	Readable Locations: [{'name': 'South Central US', 'databaseAccountEndpoint': 'https://127.0.0.1:8081/'}]
	Multi-Region Writes: False

Request URL: 'https://127.0.0.1:8081/dbs/example_db/colls/example_container/docs/itemA1/'
Request method: 'GET'
Request headers:
    'Cache-Control': 'no-cache'
    'x-ms-version': '2020-07-15'
    'x-ms-documentdb-query-iscontinuationexpected': 'False'
    'x-ms-consistency-level': 'Session'
    'x-ms-session-token': '0:-1#14040'
    'x-ms-documentdb-partitionkey': '["itemA1"]'
    'x-ms-date': 'Mon, 24 Mar 2025 21:41:15 GMT'
    'Accept': 'application/json'
    'x-ms-cosmos-intended-collection-rid': '8M9lAOgPGWY='
    'x-ms-thinclient-proxy-resource-type': 'docs'
    'x-ms-thinclient-proxy-operation-type': 'Read'
    'Content-Length': '0'
    'User-Agent': 'azsdk-python-cosmos-async/4.10.0b1 Python/3.10.11 (Windows-10-10.0.26100-SP0)'
No body was attached to the request
INFO:cosmos_logger:
Response status: 404
Response headers:
    'Transfer-Encoding': 'chunked'
    'Content-Type': 'application/json'
    'Content-Location': 'https://127.0.0.1:8081/dbs/example_db/colls/example_container/docs/itemA1/'
    'Server': 'Microsoft-HTTPAPI/2.0'
    'x-ms-activity-id': '3ec6641c-a2aa-4baa-8de4-655762a78291'
    'x-ms-last-state-change-utc': 'Thu, 13 Mar 2025 02:59:44.048 GMT'
    'x-ms-schemaversion': '1.18'
    'lsn': '14040'
    'x-ms-request-charge': '1'
    'x-ms-quorum-acked-lsn': '14040'
    'x-ms-current-write-quorum': '1'
    'x-ms-current-replica-set-size': '1'
    'x-ms-documentdb-partitionkeyrangeid': '0'
    'x-ms-xp-role': '0'
    'x-ms-global-Committed-lsn': '-1'
    'x-ms-number-of-read-regions': '0'
    'x-ms-transport-request-id': '3'
    'x-ms-cosmos-llsn': '14040'
    'x-ms-cosmos-quorum-acked-llsn': '14040'
    'x-ms-session-token': '0:-1#14040'
    'x-ms-request-duration-ms': '0.6'
    'x-ms-serviceversion': 'version=2.14.0.0'
    'x-ms-cosmos-physical-partition-id': '3'
    'x-ms-gatewayversion': 'version=2.14.0'
    'Date': 'Mon, 24 Mar 2025 21:41:15 GMT'
Elapsed time in seconds: 0.0006
Response error message: 
	Code: NotFound
	Message: {"Errors":["Resource Not Found. Learn more: https:\/\/aka.ms\/cosmosdb-tsg-not-found"]}
		ActivityId: 3ec6641c-a2aa-4baa-8de4-655762a78291,
		 Request URI: /apps/DocDbApp/services/DocDbServer19/partitions/a4cb495f-38c8-11e6-8106-8cdcd42c33be/replicas/1p/,
		 RequestStats: 
		RequestStartTime: 2025-03-24T21:41:15.0345786Z,
		 RequestEndTime: 2025-03-24T21:41:15.0506864Z,
		  Number of regions attempted:1
		{"systemHistory":[{"dateUtc":"2025-03-24T21:40:17.2658379Z",
		"cpu":27.457,
		"memory":11921940.000,
		"threadInfo":{"isThreadStarving":"False",
		"threadWaitIntervalInMs":0.226,
		"availableThreads":32766,
		"minThreads":8,
		"maxThreads":32767},
		"numberOfOpenTcpConnection":1},
		{"dateUtc":"2025-03-24T21:40:27.2699942Z",
		"cpu":21.270,
		"memory":12007768.000,
		"threadInfo":{"isThreadStarving":"False",
		"threadWaitIntervalInMs":0.1075,
		"availableThreads":32766,
		"minThreads":8,
		"maxThreads":32767},
		"numberOfOpenTcpConnection":1},
		{"dateUtc":"2025-03-24T21:40:37.2768191Z",
		"cpu":33.535,
		"memory":12081832.000,
		"threadInfo":{"isThreadStarving":"False",
		"threadWaitIntervalInMs":0.1004,
		"availableThreads":32766,
		"minThreads":8,
		"maxThreads":32767},
		"numberOfOpenTcpConnection":1},
		{"dateUtc":"2025-03-24T21:40:47.2816776Z",
		"cpu":39.899,
		"memory":12078540.000,
		"threadInfo":{"isThreadStarving":"False",
		"threadWaitIntervalInMs":0.0878,
		"availableThreads":32764,
		"minThreads":8,
		"maxThreads":32767},
		"numberOfOpenTcpConnection":1},
		{"dateUtc":"2025-03-24T21:40:57.2901438Z",
		"cpu":48.223,
		"memory":12009504.000,
		"threadInfo":{"isThreadStarving":"False",
		"threadWaitIntervalInMs":0.0152,
		"availableThreads":32766,
		"minThreads":8,
		"maxThreads":32767},
		"numberOfOpenTcpConnection":1},
		{"dateUtc":"2025-03-24T21:41:07.3029936Z",
		"cpu":34.653,
		"memory":11946860.000,
		"threadInfo":{"isThreadStarving":"False",
		"threadWaitIntervalInMs":0.0818,
		"availableThreads":32766,
		"minThreads":8,
		"maxThreads":32767},
		"numberOfOpenTcpConnection":1}]}
		RequestStart: 2025-03-24T21:41:15.0490499Z; ResponseTime: 2025-03-24T21:41:15.0506864Z; StoreResult: StorePhysicalAddress: rntbd://127.0.0.1:10253/apps/DocDbApp/services/DocDbServer19/partitions/a4cb495f-38c8-11e6-8106-8cdcd42c33be/replicas/1p/,
		 LSN: 14040,
		 GlobalCommittedLsn: -1,
		 PartitionKeyRangeId: 0,
		 IsValid: True,
		 StatusCode: 404,
		 SubStatusCode: 0,
		 RequestCharge: 1,
		 ItemLSN: -1,
		 SessionToken: -1#14040,
		 UsingLocalLSN: True,
		 TransportException: null,
		 BELatencyMs: 0.6,
		 ActivityId: 3ec6641c-a2aa-4baa-8de4-655762a78291,
		 RetryAfterInMs: ,
		 ReplicaHealthStatuses: [(port: 10253 | status: Unknown | lkt: 3/24/2025 5:01:40 PM),
		(port: 10253 | status: Unknown | lkt: 3/24/2025 5:01:40 PM),
		(port: 10253 | status: Unknown | lkt: 3/24/2025 5:01:40 PM),
		(port: 10253 | status: Unknown | lkt: 3/24/2025 5:01:40 PM),
		(port: 10253 | status: Unknown | lkt: 3/24/2025 5:01:40 PM),
		(port: 10253 | status: Unknown | lkt: 3/24/2025 5:01:40 PM),
		(port: 10253 | status: Unknown | lkt: 3/24/2025 5:01:40 PM),
		(port: 10253 | status: Unknown | lkt: 3/24/2025 5:01:40 PM),
		(port: 10253 | status: Unknown | lkt: 3/24/2025 5:01:40 PM),
		(port: 10253 | status: Unknown | lkt: 3/24/2025 5:01:40 PM),
		(port: 10253 | status: Unknown | lkt: 3/24/2025 5:01:40 PM),
		(port: 10253 | status: Unknown | lkt: 3/24/2025 5:01:40 PM)],
		 TransportRequestTimeline: {"requestTimeline":[{"event": "Created",
		 "startTimeUtc": "2025-03-24T21:41:15.0488112Z",
		 "durationInMs": 0.02},
		{"event": "ChannelAcquisitionStarted",
		 "startTimeUtc": "2025-03-24T21:41:15.0488312Z",
		 "durationInMs": 0.003},
		{"event": "Pipelined",
		 "startTimeUtc": "2025-03-24T21:41:15.0488342Z",
		 "durationInMs": 0.1936},
		{"event": "Transit Time",
		 "startTimeUtc": "2025-03-24T21:41:15.0490278Z",
		 "durationInMs": 1.2215},
		{"event": "Received",
		 "startTimeUtc": "2025-03-24T21:41:15.0502493Z",
		 "durationInMs": 0.1598},
		{"event": "Completed",
		 "startTimeUtc": "2025-03-24T21:41:15.0504091Z",
		 "durationInMs": 0}],
		"serviceEndpointStats":{"inflightRequests":1,
		"openConnections":1},
		"connectionStats":{"waitforConnectionInit":"False",
		"callsPendingReceive":0,
		"lastSendAttempt":"2025-03-24T21:41:14.9828567Z",
		"lastSend":"2025-03-24T21:41:14.9830833Z",
		"lastReceive":"2025-03-24T21:41:14.9849080Z"},
		"requestSizeInBytes":511,
		"responseMetadataSizeInBytes":186,
		"responseBodySizeInBytes":87};
		 ResourceType: Document,
		 OperationType: Read
		,
		 SDK: Microsoft.Azure.Documents.Common/2.14.0

Adds the ability to use logging.Filters on diagnostics logs based on the same filterable parameters as the cosmos diagnostics handler. Also adds some optimizations to reduce time spent on logging diagnostics. Lastly improves the formatting of the diagnostics log to improve readabilty especially when errors are logged.
@azure-sdk
Copy link
Copy Markdown
Collaborator

API change check

APIView has identified API level changes in this PR and created following API reviews.

azure-cosmos

@bambriz bambriz marked this pull request as ready for review March 5, 2025 04:47
@bambriz bambriz requested review from a team and annatisch as code owners March 5, 2025 04:47
Copy link
Copy Markdown
Member

@simorenoh simorenoh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks Bryan!

Main thing I'd like though, is to add something similar to your PR description to the README section that talks about diagnostics in the SDK since this is highly relevant and probably the only way that customers will realistically use the logging since otherwise it's too verbose. Adding a section below with explanations on the two different ways to filter as well as links to the relevant samples would be great.

Copy link
Copy Markdown
Member

@tvaron3 tvaron3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. One suggestion would be to lower the amount of headers we log by default. For example these headers, I don't how much value they have for debugging.

'Server': 'Microsoft-HTTPAPI/2.0'
'Access-Control-Allow-Origin': ''
'Access-Control-Allow-Credentials': 'REDACTED'
'Transfer-Encoding': 'chunked'

Copy link
Copy Markdown
Member

@FabianMeiswinkel FabianMeiswinkel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thik the perf results in the PR description are not very helpful - to understand the impact it would be required to udnerstand how much of the improvement is because logs are filtered out now? So, instead having two perf tests - one with high percentage of requetss really being logged. And one where the percentage of records being actually logged is extremely low (could even be zero) would be important.

Last-but-not-least - throughput would be good - Like how many point reads per second can be done with each of the models - with reference to CPU - to ensure that better throughput is not explained by just higher CPU.

Copy link
Copy Markdown
Member

@FabianMeiswinkel FabianMeiswinkel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments

Copy link
Copy Markdown
Member

@simorenoh simorenoh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comment on expanding the test but this looks great, thanks Bryan!

@FabianMeiswinkel FabianMeiswinkel self-requested a review April 17, 2025 22:53
Copy link
Copy Markdown
Member

@FabianMeiswinkel FabianMeiswinkel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...

@FabianMeiswinkel FabianMeiswinkel dismissed their stale review April 17, 2025 23:00

No need to block on me

Copy link
Copy Markdown
Member

@kushagraThapar kushagraThapar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bambriz, looks good to me.

@bambriz
Copy link
Copy Markdown
Member Author

bambriz commented Apr 18, 2025

/azp run python - cosmos - tests

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@bambriz bambriz merged commit 9712d82 into Azure:main Apr 18, 2025
18 checks passed
cRui861 pushed a commit that referenced this pull request May 14, 2025
* Updates on Cosmos Diagnostics

Adds the ability to use logging.Filters on diagnostics logs based on the same filterable parameters as the cosmos diagnostics handler. Also adds some optimizations to reduce time spent on logging diagnostics. Lastly improves the formatting of the diagnostics log to improve readabilty especially when errors are logged.

* Update CHANGELOG.md

* Update CHANGELOG.md

* updates to cosmos diagnostics

* update to logging policy and tests

* pylint updates

* Update README.md

* Remove diagnostics handler

This removes the custom diagnostics handler and instead allows the use of logging filters.

* Update CHANGELOG.md

* Update _cosmos_http_logging_policy.py

* update recommended changes

* updates
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

7 participants