Skip to content

feat: incident.io Notifier#4372

Merged
gotjosh merged 6 commits intoprometheus:mainfrom
incident-io:rorymalcolm/incidentio-notifier
Sep 22, 2025
Merged

feat: incident.io Notifier#4372
gotjosh merged 6 commits intoprometheus:mainfrom
incident-io:rorymalcolm/incidentio-notifier

Conversation

@rorymalcolm
Copy link
Contributor

@rorymalcolm rorymalcolm commented Apr 27, 2025

Closes #4367

  • Adds the technical implementation, and tests, for the incident.io notifier

  • Configured through the following config:

receivers:
  - name: 'incidentio-notifications'
    incidentio_configs:
      - url: '$alert_source_url'
        alert_source_token: '$alert_source_token'

@rorymalcolm rorymalcolm changed the title - Adds the technical implementation, and tests, for the incident.io n… incident.io Notifier Apr 27, 2025
@rorymalcolm rorymalcolm force-pushed the rorymalcolm/incidentio-notifier branch 2 times, most recently from 98ed08f to 416cac1 Compare April 27, 2025 22:09
@rorymalcolm rorymalcolm changed the title incident.io Notifier feat: incident.io Notifier Apr 27, 2025
@rorymalcolm rorymalcolm force-pushed the rorymalcolm/incidentio-notifier branch 3 times, most recently from 118a310 to 4bc00a7 Compare April 28, 2025 09:49
}

// IncidentioConfig configures notifications via incident.io.
type IncidentioConfig struct {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about adding a field Metadata to this struct? Something similar to the Details field in the OpsgenieConfig struct.
This would enable users to define additional data (see incident.io API definition for reference).

@grobinson-grafana
Copy link
Collaborator

grobinson-grafana commented May 3, 2025

Looks good, a couple comments. It also needs docs here and here 👍

@rorymalcolm rorymalcolm force-pushed the rorymalcolm/incidentio-notifier branch 3 times, most recently from 17d6a6f to 255befe Compare May 7, 2025 07:06
@rorymalcolm
Copy link
Contributor Author

Looks good, a couple comments. It also needs docs here and here 👍

All done - I think? 🙏

@grobinson-grafana
Copy link
Collaborator

You have some lint failures in notify/incidentio/incidentio_test.go

@rorymalcolm rorymalcolm force-pushed the rorymalcolm/incidentio-notifier branch 2 times, most recently from 8f16fcd to bbc39ce Compare May 9, 2025 16:14
@grobinson-grafana
Copy link
Collaborator

You still have some failing tests I'm afraid https://github.com/prometheus/alertmanager/actions/runs/14933150149/job/41982648209?pr=4372

--- FAIL: TestIncidentIORetry (0.01s)
    incidentio_test.go:48: 
        	Error Trace:	/__w/alertmanager/alertmanager/notify/incidentio/incidentio_test.go:48
        	Error:      	Received unexpected error:
        	            	one of alert_source_token or alert_source_token_file must be configured
        	Test:       	TestIncidentIORetry
--- FAIL: TestIncidentIORedactedURL (0.01s)
    incidentio_test.go:69: 
        	Error Trace:	/__w/alertmanager/alertmanager/notify/incidentio/incidentio_test.go:69
        	Error:      	Received unexpected error:
        	            	one of alert_source_token or alert_source_token_file must be configured
        	Test:       	TestIncidentIORedactedURL
--- FAIL: TestIncidentIOURLFromFile (0.01s)
    incidentio_test.go:91: 
        	Error Trace:	/__w/alertmanager/alertmanager/notify/incidentio/incidentio_test.go:91
        	Error:      	Received unexpected error:
        	            	one of alert_source_token or alert_source_token_file must be configured
        	Test:       	TestIncidentIOURLFromFile
--- FAIL: TestIncidentIONotify (0.01s)
    incidentio_test.go:[143](https://github.com/prometheus/alertmanager/actions/runs/14933150149/job/41982648209?pr=4372#step:6:144): 
        	Error Trace:	/__w/alertmanager/alertmanager/notify/incidentio/incidentio_test.go:143
        	Error:      	Received unexpected error:
        	            	one of alert_source_token or alert_source_token_file must be configured
        	Test:       	TestIncidentIONotify
--- FAIL: TestIncidentIORetryScenarios (0.03s)
    --- FAIL: TestIncidentIORetryScenarios/success_response (0.01s)
        incidentio_test.go:223: 
            	Error Trace:	/__w/alertmanager/alertmanager/notify/incidentio/incidentio_test.go:223
            	Error:      	Received unexpected error:
            	            	one of alert_source_token or alert_source_token_file must be configured
            	Test:       	TestIncidentIORetryScenarios/success_response
    --- FAIL: TestIncidentIORetryScenarios/rate_limit_response (0.01s)
        incidentio_test.go:223: 
            	Error Trace:	/__w/alertmanager/alertmanager/notify/incidentio/incidentio_test.go:223
            	Error:      	Received unexpected error:
            	            	one of alert_source_token or alert_source_token_file must be configured
            	Test:       	TestIncidentIORetryScenarios/rate_limit_response
    --- FAIL: TestIncidentIORetryScenarios/server_error_response (0.01s)
        incidentio_test.go:223: 
            	Error Trace:	/__w/alertmanager/alertmanager/notify/incidentio/incidentio_test.go:223
            	Error:      	Received unexpected error:
            	            	one of alert_source_token or alert_source_token_file must be configured
            	Test:       	TestIncidentIORetryScenarios/server_error_response
    --- FAIL: TestIncidentIORetryScenarios/client_error_response (0.01s)
        incidentio_test.go:223: 
            	Error Trace:	/__w/alertmanager/alertmanager/notify/incidentio/incidentio_test.go:223
            	Error:      	Received unexpected error:
            	            	one of alert_source_token or alert_source_token_file must be configured
            	Test:       	TestIncidentIORetryScenarios/client_error_response

@rorymalcolm rorymalcolm force-pushed the rorymalcolm/incidentio-notifier branch from bbc39ce to 07604e3 Compare May 11, 2025 21:03
@rorymalcolm
Copy link
Contributor Author

You still have some failing tests I'm afraid https://github.com/prometheus/alertmanager/actions/runs/14933150149/job/41982648209?pr=4372

--- FAIL: TestIncidentIORetry (0.01s)
    incidentio_test.go:48: 
        	Error Trace:	/__w/alertmanager/alertmanager/notify/incidentio/incidentio_test.go:48
        	Error:      	Received unexpected error:
        	            	one of alert_source_token or alert_source_token_file must be configured
        	Test:       	TestIncidentIORetry
--- FAIL: TestIncidentIORedactedURL (0.01s)
    incidentio_test.go:69: 
        	Error Trace:	/__w/alertmanager/alertmanager/notify/incidentio/incidentio_test.go:69
        	Error:      	Received unexpected error:
        	            	one of alert_source_token or alert_source_token_file must be configured
        	Test:       	TestIncidentIORedactedURL
--- FAIL: TestIncidentIOURLFromFile (0.01s)
    incidentio_test.go:91: 
        	Error Trace:	/__w/alertmanager/alertmanager/notify/incidentio/incidentio_test.go:91
        	Error:      	Received unexpected error:
        	            	one of alert_source_token or alert_source_token_file must be configured
        	Test:       	TestIncidentIOURLFromFile
--- FAIL: TestIncidentIONotify (0.01s)
    incidentio_test.go:[143](https://github.com/prometheus/alertmanager/actions/runs/14933150149/job/41982648209?pr=4372#step:6:144): 
        	Error Trace:	/__w/alertmanager/alertmanager/notify/incidentio/incidentio_test.go:143
        	Error:      	Received unexpected error:
        	            	one of alert_source_token or alert_source_token_file must be configured
        	Test:       	TestIncidentIONotify
--- FAIL: TestIncidentIORetryScenarios (0.03s)
    --- FAIL: TestIncidentIORetryScenarios/success_response (0.01s)
        incidentio_test.go:223: 
            	Error Trace:	/__w/alertmanager/alertmanager/notify/incidentio/incidentio_test.go:223
            	Error:      	Received unexpected error:
            	            	one of alert_source_token or alert_source_token_file must be configured
            	Test:       	TestIncidentIORetryScenarios/success_response
    --- FAIL: TestIncidentIORetryScenarios/rate_limit_response (0.01s)
        incidentio_test.go:223: 
            	Error Trace:	/__w/alertmanager/alertmanager/notify/incidentio/incidentio_test.go:223
            	Error:      	Received unexpected error:
            	            	one of alert_source_token or alert_source_token_file must be configured
            	Test:       	TestIncidentIORetryScenarios/rate_limit_response
    --- FAIL: TestIncidentIORetryScenarios/server_error_response (0.01s)
        incidentio_test.go:223: 
            	Error Trace:	/__w/alertmanager/alertmanager/notify/incidentio/incidentio_test.go:223
            	Error:      	Received unexpected error:
            	            	one of alert_source_token or alert_source_token_file must be configured
            	Test:       	TestIncidentIORetryScenarios/server_error_response
    --- FAIL: TestIncidentIORetryScenarios/client_error_response (0.01s)
        incidentio_test.go:223: 
            	Error Trace:	/__w/alertmanager/alertmanager/notify/incidentio/incidentio_test.go:223
            	Error:      	Received unexpected error:
            	            	one of alert_source_token or alert_source_token_file must be configured
            	Test:       	TestIncidentIORetryScenarios/client_error_response

Ah - apols; fixed!

@rorymalcolm
Copy link
Contributor Author

Hey @grobinson-grafana - just confirming this is good to go?

@grobinson-grafana
Copy link
Collaborator

I haven't had time to review the latest changes, but I hope to be able to do that over the weekend. If the docs are looking good and code + tests still look good I hope to merge this over the weekend.

@axdotl
Copy link

axdotl commented Jun 10, 2025

Hello @grobinson-grafana - any chance to get this merged soon? Looking forward to make use of it.
Thanks in advance!

@ankitdh7
Copy link

Any chance to get this merge soon? It will be a great improvement over webhook.

@grobinson-grafana
Copy link
Collaborator

Just heads up, we have an issue with Circle CI being broken for Alertmanager and are figuring out with other Prometheus contributors how to fix it.

@rorymalcolm rorymalcolm force-pushed the rorymalcolm/incidentio-notifier branch from 7f60f95 to 34eac10 Compare July 14, 2025 08:50
@grobinson-grafana
Copy link
Collaborator

Hi! Our CI is fixed, so will do a final review and then merge this. Thanks for your patience, I know this has been open since April.

Copy link
Collaborator

@grobinson-grafana grobinson-grafana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the lint error, but I think there is a bug where it doesn't inherit the global HTTP config from the configuration file, and some of the truncation doesn't work.

}

// encodeMessage encodes the message and truncates content if it exceeds maxPayloadSize.
func (n *Notifier) encodeMessage(msg *Message) (bytes.Buffer, error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the step based reduction is clever, but I am a bit worried about the number of times we marshal the message when reducing its final size. JSON Marshal in Go is expensive from both the perspective of CPU and allocations. I appreciate in the worst case its effectively O(logN) marshals as you half the number of alerts each time, but I'm also thinking about this from the perspective of multi-tenant projects like Cortex and Mimir that might have 1000s or 10,000s of tenants "sharing" an Alertmanager. It's hard to say this is fine without any data to prove it.

Have you tried keeping a running estimate on the size instead to avoid remarshalling? In contrast, adding ints is one of the fastest operations you can do on a CPU, so it will have a massive difference in overhead. If you want to take into account the byte overhead of the JSON syntax (braces, commas, etc), you can estimate this as 5% of the final payload size.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a chat with Rory and decided to just go for a much simpler implementation here: if the full message is too large, drop all but the first alert and try that.

I've written up the full reasoning in the commit, but I think this is both more efficient (generally 1 encoding, occasionally 2), and a better experience for the humans involved: no strange mangled alerts coming through, just some additional info dropped.

- Adds the technical implementation, and tests, for the incident.io notifier

- Configured through the following config:

```yaml
receivers:
  - name: 'incidentio-notifications'
    incidentio_configs:
      - url: '$alert_source_url'
        alert_source_token: '$alert_source_token'
```

- Add documentation for the incidentio_config

Signed-off-by: Isaac Seymour <i.seymour@oxon.org>
@rorymalcolm rorymalcolm force-pushed the rorymalcolm/incidentio-notifier branch from 7f60f95 to 34eac10 Compare July 14, 2025 08:50
@grobinson-grafana
Copy link
Collaborator

Hi! Our CI is fixed, so will do a final review and then merge this. Thanks for your patience, I know this has been open since April.

Copy link
Collaborator

@grobinson-grafana grobinson-grafana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the lint error, but I think there is a bug where it doesn't inherit the global HTTP config from the configuration file, and some of the truncation doesn't work.

}

// encodeMessage encodes the message and truncates content if it exceeds maxPayloadSize.
func (n *Notifier) encodeMessage(msg *Message) (bytes.Buffer, error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the step based reduction is clever, but I am a bit worried about the number of times we marshal the message when reducing its final size. JSON Marshal in Go is expensive from both the perspective of CPU and allocations. I appreciate in the worst case its effectively O(logN) marshals as you half the number of alerts each time, but I'm also thinking about this from the perspective of multi-tenant projects like Cortex and Mimir that might have 1000s or 10,000s of tenants "sharing" an Alertmanager. It's hard to say this is fine without any data to prove it.

Have you tried keeping a running estimate on the size instead to avoid remarshalling? In contrast, adding ints is one of the fastest operations you can do on a CPU, so it will have a massive difference in overhead. If you want to take into account the byte overhead of the JSON syntax (braces, commas, etc), you can estimate this as 5% of the final payload size.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a chat with Rory and decided to just go for a much simpler implementation here: if the full message is too large, drop all but the first alert and try that.

I've written up the full reasoning in the commit, but I think this is both more efficient (generally 1 encoding, occasionally 2), and a better experience for the humans involved: no strange mangled alerts coming through, just some additional info dropped.

- Adds the technical implementation, and tests, for the incident.io notifier

- Configured through the following config:

```yaml
receivers:
  - name: 'incidentio-notifications'
    incidentio_configs:
      - url: '$alert_source_url'
        alert_source_token: '$alert_source_token'
```

- Add documentation for the incidentio_config

Signed-off-by: Isaac Seymour <i.seymour@oxon.org>
@isaacseymour isaacseymour force-pushed the rorymalcolm/incidentio-notifier branch from d0be1dc to 0973fd3 Compare August 6, 2025 16:48
@isaacseymour
Copy link
Contributor

Hey @grobinson-grafana - I believe this should be ready to go at this point!

Rather than carefully trying to shrink the size of the payload to fit
the 512kB limit, just try two encodings:
1. The full original message; and
2. Remove all but the first alert in the group and send that

For most configurations, each message creates a single alert in
incident.io, with the details of the alerts which made up that group
contained within being useful but not essential.

This means the code is a lot simpler, and does at-most 2 JSON encodings
for each message (although in general it should be pretty rare to need
to do more than 1!)

Signed-off-by: Isaac Seymour <i.seymour@oxon.org>
@isaacseymour isaacseymour force-pushed the rorymalcolm/incidentio-notifier branch from 0973fd3 to 0d86b5b Compare August 7, 2025 08:20
@isaacseymour
Copy link
Contributor

👋 Hey @grobinson-grafana, would love to get this merged if you have time to review! 🙏

@paulyehorov
Copy link

Hey folks,
Just stumbled upon this - we really-really need this. Recently migrated to incident.io on-call and missing some very useful features due to being limited to webhook usage.

@gotjosh
Copy link
Member

gotjosh commented Sep 22, 2025

I'm not sure I follow what's going on - but I'm not able to add commits to your pull request directly despite having Maintainers are allowed to edit this pull request. enabled.

To speed things up, I've made the changes myself (as some of them are a bit pedantic e.g. ending with dots (.) on comments).

I have attached the diff of both commits I've made - I'd appreciate it if you could incorporate them and let me know your thoughts.

linter-fix.diff.txt
patch.diff.txt

Most of the changes are purely cosmetic but do let me know if you have any questions on them - if you can please incorporate them I'll be happy to merge this.

Signed-off-by: Isaac Seymour <i.seymour@oxon.org>
@isaacseymour isaacseymour force-pushed the rorymalcolm/incidentio-notifier branch from 583bc7e to c22bd50 Compare September 22, 2025 15:27
@isaacseymour
Copy link
Contributor

That is indeed mysterious: it seems like Github allows you to merge main in, but not push changes.

I've applied your patches in c22bd50. Thank you so much @gotjosh!

Signed-off-by: gotjosh <josue.abreu@gmail.com>
@gotjosh
Copy link
Member

gotjosh commented Sep 22, 2025

That is indeed mysterious: it seems like Github allows you to merge main in, but not push changes.

It does allow me to do anything through the UI. I'm not sure why is not allowing me via other means

Anyways, I've double checked @grobinson-grafana's comments and they all seem to be addressed now so this LGTM.

Copy link
Member

@gotjosh gotjosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gotjosh
Copy link
Member

gotjosh commented Sep 22, 2025

As for the release, I'm aiming to set some time aside next week to review a few PRs, get them in and then do the release with @grobinson-grafana.

@gotjosh gotjosh merged commit 28d3f86 into prometheus:main Sep 22, 2025
11 checks passed
@gotjosh
Copy link
Member

gotjosh commented Sep 22, 2025

Thank you very much for your contribution @isaacseymour and @rorymalcolm ❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support an incident.io notifier

8 participants