Description:
Stats such as upstream_rq which are broken out by response code (e.g. 404) and response code class (e.g. 4xx) lead to duplicated data in metrics.
For example:
cluster.FOO.upstream_rq_403: 2
cluster.FOO.upstream_rq_404: 1726
cluster.FOO.upstream_rq_4xx: 1728
Currently leads to the following in Prometheus format:
cluster_upstream_rq{cluster_name="FOO",envoy_response_code="403"} 2
cluster_upstream_rq{cluster_name="FOO",envoy_response_code="404"} 1726
cluster_upstream_rq{cluster_name="FOO",envoy_response_code_class="4xx"} 1728
Which means that aggregates over the cluster_upstream_rq metric are meaningless (e.g. sum(cluster_upstream_rq) = 3456).
The output should be metrics with both a response_code and response_code_class label:
cluster_upstream_rq{cluster_name="FOO",envoy_response_code="403",envoy_response_code_class="4xx"} 2
cluster_upstream_rq{cluster_name="FOO",envoy_response_code="404",envoy_response_code_class="4xx"} 1726
Or, alternatively, the response_code_class labels could be dropped entirely since they can be calculated by Prometheus from the response_code labels if needed.