Skip to content

bridge: improvements / app: improvements#427

Merged
marcello33 merged 17 commits intodevelopfrom
mardizzone/improvements
Aug 19, 2025
Merged

bridge: improvements / app: improvements#427
marcello33 merged 17 commits intodevelopfrom
mardizzone/improvements

Conversation

@marcello33
Copy link
Copy Markdown
Collaborator

@marcello33 marcello33 commented Aug 14, 2025

Description

In this PR we implement the following improvements:

  • When initialising the bridge, in case a new validator account (for a non-synced node) joins the network, the bridge used to panic. Now it will keep polling while the node catches up. heimdalld will hence wait until the account is visible locally, meaning the node is synced and past join height
  • The bridge can't run now as a separate process, but will always be embedded in heimdalld as a child process via the --bridge flag. This means that validators running the bridge will need to adapt and use heimdalld start --bridge --all --rest-server to start the bridge within heimdalld service
  • When the rest-server is not responding, heimdalld won't crash unless the context is intentionally canceled. Instead, the service will wait for a certain time (configured to be restServerTimeOutInMinutes = 30) and then will start printing some meaningful logs, so that the operators can check the status of their rest server. Now we see logs like
Aug 14 12:30:56 mardizzone-stg-bor-5 heimdalld[190909]: 12:30PM INF Timed out dur=487.61501 height=144043 module=consensus round=0 step=RoundStepNewHeight
Aug 14 12:30:56 mardizzone-stg-bor-5 heimdalld[190909]: Warning: still waiting for REST server to respond... Something wrong with it, please check!
Aug 14 12:30:57 mardizzone-stg-bor-5 heimdalld[190909]: 12:30PM INF Timed out dur=1000 height=144043 module=consensus round=0 step=RoundStepPropose
Aug 14 12:30:57 mardizzone-stg-bor-5 heimdalld[190909]: Warning: still waiting for REST server to respond... Something wrong with it, please check!
Aug 14 12:30:58 mardizzone-stg-bor-5 heimdalld[190909]: Warning: still waiting for REST server to respond... Something wrong with it, please check!

Until the service is stopped.

Aug 14 12:32:42 mardizzone-stg-bor-5 systemd[1]: heimdalld.service: Main process exited, code=exited, status=1/FAILURE
Aug 14 12:32:42 mardizzone-stg-bor-5 systemd[1]: heimdalld.service: Failed with result 'exit-code'.
Aug 14 12:32:42 mardizzone-stg-bor-5 systemd[1]: Stopped heimdalld.
Aug 14 12:32:42 mardizzone-stg-bor-5 systemd[1]: heimdalld.service: Consumed 2min 21.588s CPU time.

In happy case scenarios (healthy rest-server) no logs are printed, e.g.

Aug 14 12:35:58 mardizzone-stg-bor-2 heimdalld[194609]: 12:35PM INF indexed block events height=144238 module=txindex
Aug 14 12:35:59 mardizzone-stg-bor-2 heimdalld[194609]: 12:35PM INF Node up to date, starting bridge services module=bridge/service/
Aug 14 12:35:59 mardizzone-stg-bor-2 heimdalld[194609]: I[2025-08-14|12:35:59.103] service start                                service=listener msg="Starting listener service" impl=listener
Aug 14 12:35:59 mardizzone-stg-bor-2 heimdalld[194609]: 12:35PM INF Registering checkpoint tasks module=checkpoint service=processor
Aug 14 12:35:59 mardizzone-stg-bor-2 heimdalld[194609]: 12:35PM INF Registering staking related tasks module=staking service=processor
Aug 14 12:35:59 mardizzone-stg-bor-2 heimdalld[194609]: 12:35PM INF Starting module=rootchain service=listener
Aug 14 12:35:59 mardizzone-stg-bor-2 heimdalld[194609]: 12:35PM INF Registering clerk tasks module=clerk service=processor
Aug 14 12:35:59 mardizzone-stg-bor-2 heimdalld[194609]: 12:35PM INF Registering fee related tasks module=fee service=processor
Aug 14 12:35:59 mardizzone-stg-bor-2 heimdalld[194609]: 12:35PM INF Starting module=checkpoint service=processor
Aug 14 12:35:59 mardizzone-stg-bor-2 heimdalld[194609]: 12:35PM INF Starting module=staking service=processor
Aug 14 12:35:59 mardizzone-stg-bor-2 heimdalld[194609]: 12:35PM INF Start polling for no-ack module=checkpoint pollInterval=1010000 service=processor
Aug 14 12:35:59 mardizzone-stg-bor-2 heimdalld[194609]: 12:35PM INF Starting module=clerk service=processor
Aug 14 12:35:59 mardizzone-stg-bor-2 heimdalld[194609]: 12:35PM INF Start polling for rootChain header blocks module=rootchain pollInterval=60000 service=listener
Aug 14 12:35:59 mardizzone-stg-bor-2 heimdalld[194609]: 12:35PM INF starting bor process module=span service=processor
Aug 14 12:35:59 mardizzone-stg-bor-2 heimdalld[194609]: 12:35PM INF Starting module=borchain service=listener
Aug 14 12:35:59 mardizzone-stg-bor-2 heimdalld[194609]: 12:35PM INF Starting header process module=rootchain service=listener
Aug 14 12:35:59 mardizzone-stg-bor-2 heimdalld[194609]: 12:35PM INF Start polling for header blocks module=borchain pollInterval=300000 service=listener
Aug 14 12:35:59 mardizzone-stg-bor-2 heimdalld[194609]: 12:35PM INF start polling for span module=span pollInterval=60000 service=processor
Aug 14 12:35:59 mardizzone-stg-bor-2 heimdalld[194609]: 12:35PM INF Starting heimdall listener module=heimdall service=listener
Aug 14 12:35:59 mardizzone-stg-bor-2 heimdalld[194609]: 12:35PM INF Start polling for events module=heimdall pollInterval=60000 service=listener
Aug 14 12:35:59 mardizzone-stg-bor-2 heimdalld[194609]: 12:35PM INF Self-healing is disabled module=rootchain service=listener
Aug 14 12:35:59 mardizzone-stg-bor-2 heimdalld[194609]: 12:35PM INF Starting module=fee service=processor
Aug 14 12:35:59 mardizzone-stg-bor-2 heimdalld[194609]: 12:35PM INF Starting header process module=borchain service=listener
  • Use a custom httpClient with timeout instead of DefaultClient for InitRootCmd when trying to contact the rest server
  • heimdall chainId is initialized in the context at bridge startup. Fetching such chainId also remains as a fallback while calling the rest-server. Example of log:
Aug 14 12:34:59 mardizzone-stg-bor-2 heimdalld[194609]: 12:34PM INF ChainID set in clientCtx chainId=heimdall-3278 module=bridge/service/
curl -s \"http://127.0.0.1:26657/block_results?height=82339\" | jq
{
  "jsonrpc": "2.0",
  "id": -1,
  "result": {
    "height": "82339",
    "txs_results": [
      {
        "code": 2,
        "data": null,
        "log": "tx parse error",
        "info": "",
        "gas_wanted": "0",
        "gas_used": "0",
        "events": [],
        "codespace": "sdk"
      }
    ],
    "finalize_block_events": [
      {
        "type": "fee-transfer",
        "attributes": [
          {
            "key": "proposer",
            "value": "0x34b647a07c56dcd255298c40e5c652417ffaa0c4",
            "index": true
          },
          {
            "key": "denom",
            "value": "pol",
            "index": true
          },
          {
            "key": "amount",
            "value": "0",
            "index": true
          },
          {
            "key": "mode",
            "value": "EndBlock",
            "index": true
          }
        ]
      }
    ],
    "validator_updates": null,
    "consensus_param_updates": {
      "block": {
        "max_bytes": "22020096",
        "max_gas": "-1"
      },
      "evidence": {
        "max_age_num_blocks": "100000",
        "max_age_duration": "172800000000000",
        "max_bytes": "1048576"
      },
      "validator": {
        "pub_key_types": [
          "secp256k1"
        ]
      },
      "version": {},
      "abci": {
        "vote_extensions_enable_height": "1"
      },
      "blob": {}
    },
    "app_hash": "OY5mSLLTWCyUO9WMz7nLWuQNkoXzFHPlnNq9NYJvzP8="
  }
}

Changes

  • Bugfix (non-breaking change that solves an issue)
  • Hotfix (change that solves an urgent issue and requires immediate attention)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (change that is not backwards-compatible and/or changes current functionality)
  • Changes only for a subset of nodes

Breaking changes

The breaking changes here require all validators (generally speaking, nodes running the bridge) to run such service within heimdalld (via heimdalld start --bridge --all --rest-server) instead of using a custom separated bridge cli process

Checklist

  • I have added at least two reviewers or the whole pos-v1 team
  • I have added sufficient documentation in code
  • I will be resolving comments — if any — by pushing each fix in a separate commit and linking the commit hash in the comment reply

Testing

  • I have added unit tests
  • I have added tests to CI
  • I have tested this code manually on the local environment
  • I have tested this code manually on a remote devnet using express-cli
  • I have tested this code manually on amoy/mumbai
  • I have created new e2e tests into express-cli

Remotely, this has been tested with a fresh devnet anchored to the PR's source branch, and also with a staggered upgrade starting from develop. It is recommended to further test everything with a staggered upgrade (on top of the current latest release), once the candidate release branch is cut off from develop.

@marcello33 marcello33 requested a review from a team August 14, 2025 15:25
@kamuikatsurgi kamuikatsurgi requested a review from Copilot August 18, 2025 05:08
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements significant improvements to bridge service integration and reliability. The bridge is now embedded as a child process within heimdalld instead of running as a separate process, requiring validators to use heimdalld start --bridge --all --rest-server. The PR also improves error handling for unresponsive REST servers with timeout mechanisms and better logging, validates bor_chain_id in checkpoint workflows, and includes dependency updates.

Key changes:

  • Bridge service embedded within heimdalld as a child process instead of standalone operation
  • Enhanced REST server timeout handling with configurable warnings after 30 minutes
  • Chain ID validation in checkpoint processing to prevent mismatched submissions
  • Dependency updates including cosmos-sdk bump and websocket library replacement

Reviewed Changes

Copilot reviewed 24 out of 25 changed files in this pull request and generated no comments.

Show a summary per file
File Description
bridge/service/bridge.go New bridge service implementation for embedded operation within heimdalld
cmd/heimdalld/cmd/commands.go Enhanced REST server polling with timeout handling and bridge integration
x/checkpoint/keeper/side_msg_server.go Added bor chain ID validation in checkpoint processing
bridge/broadcaster/broadcaster.go Improved account polling logic with context-aware waiting
bridge/util/common.go Updated GetAccount function to accept context parameter
app/app.go Added fee transfer event emission in EndBlocker
go.mod Updated cosmos-sdk version and websocket library replacement

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@kamuikatsurgi kamuikatsurgi added the squash merge Use squash merge to merge this PR. label Aug 18, 2025
@sonarqubecloud
Copy link
Copy Markdown

@marcello33 marcello33 merged commit 1d99d2a into develop Aug 19, 2025
8 of 9 checks passed
marcello33 added a commit that referenced this pull request Sep 2, 2025
* Delete spans backfill

* Always vote NO on spans backfill

* Return type assertion for MsgBackfillSpan

* Merge pull request #413 from 0xPolygon/kamui/bump-kurtosis

chore: remove validator test case and bump kurtosis

* chore: side msg and abci handler metrics (#410)

* chore: add side msg metrics

* timer metrics for Pre, Begin, and End blocker funcs

* chore: nits

* abci handler metrics

* chore: use Namespace from metrics package

* misc: migrated from maticnetwork to 0xPolygon (#420)

* misc: migrated from maticnetwork to 0xPolygon

* misc: bump cosmos to v0.2.3-polygon

* misc: bump bor

* misc: addressed comments

* chore: bump kurtosis (#422)

* chore: bump kurtosis

* chore: add milestones test

* fix: test case name

* chore: update test_runner image

* Re-enable voting power check in tally votes (#409)

* app,helper,x/stake: add logic to store penultimate valset

* app: use correct valset in ValidateVoteExtensions

* app: fix if condition

* app: add condition for genesis

* app,x/chainmanager: track initial chain height for UTs

* app,helper,x/milestone: re-enable skipped validator not found errors

* app: fix some tests and refactor a bit

* app,bridge: fix more tests

* app,x/milestone: some cleanup

* app: dedup redundant stubs

* app: more cleanup

* app,helper,x/chainmanager: move setting of initial height outside of x/chainmanager

* app: improved error logging

* app,x/chainmanager: rm unused key + minor nit

* app: deterministically extract max hash and VP from non-rp VE

* x/stake: fix lint

* helper: refactoring

* app: add comment

* bump go to 1.24.6

* helper: rm unnecessary todo

* fix: build (#423)

* Fix generate-keystore command (#424)

* cmd: allow generate-keystore to accept private key

* cmd: add --generate-new flag

* update: cosmos-sdk (#425)

* helper: fix conflicts

* feat: bump kurtosis and migrate to pos-workflows (#431)

* feat: remove matic-cli e2e-tests (#432)

* bridge: improvements / app: improvements (#427)

* improve bridge, remove bridge cli, minor fixes

* validate bor_chain_id in checkpoints flow

* test fee-transfer events emissions

* remove test block

* logs on rest server being unresponsive

* improve initRootCmd

* don't return on ctx done / restServerTimeOutInMinutes to 1m for tests

* improve logs / restore restServerTimeOutInMinutes

* sort imports

* fix comment

* fix tests and linter issues

* bump cosmos-sdk dep

* move const outside of the func

* refactor StartWithCtx function

* address comments

* log anomaly if account not found with node being in sync

* Set voting power and valset check heights for amoy (#435)

* helper: set tallyFixHeight and disableVPCheckHeight for amoy

* helper: set disableValSetCheckHeight for amoy

* helper: set initialHeight for amoy

* api,proto,x/stake: add penultimate valset to genesis export + add some tests (#436)

* set tallyfix mainnet hf heights

---------

Co-authored-by: Angel Valkov <avalkov@polygon.technology>
Co-authored-by: Krishang Shah <109511742+kamuikatsurgi@users.noreply.github.com>
Co-authored-by: Pratik Patil <pratikspatil024@gmail.com>
Co-authored-by: Raneet Debnath <35629432+Raneet10@users.noreply.github.com>
Co-authored-by: Raneet Debnath <raneetdebnath10@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

squash merge Use squash merge to merge this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants