-
Notifications
You must be signed in to change notification settings - Fork 5
Description
Specification
The testnet 6 deployment #551 now being done shows us the utility of having a single dashboard that would be useful for tracking analytics and operational metrics of the testnet.
Right now AWS's dashboard and logging really sucks. The cloudwatch dashboard is hard to configure and doesn't automatically update in relation to changes in our infrastructure (it's not configured through our infrastructure deployment code). And the logging systems is also hard to navigate, there's alot of IDs in AWS that relate to different resources, and it's hard to correlate all these resources together in relation to the actual nodes that we have deployed.
Of particular note are these pages:
- https://ap-southeast-2.console.aws.amazon.com/ecs/v2/clusters/polykey-testnet/services/polykey-testnet-v270ktdd3cs3mp1r3q3dkmick92bn927mii9or4sgroeogd1peqb0/events?region=ap-southeast-2 - this shows ECS events like what the orchestrator is doing for a particular service. This is useful to know if a deployment event occurred.
- https://ap-southeast-2.console.aws.amazon.com/cloudwatch/home?region=ap-southeast-2#dashboards/dashboard/polykey-testnet - this shows the cloudwatch dashboard of the polykey testnet which currently only shows the memory utlisation, CPU utilisation and network input and output.
- https://ap-southeast-2.console.aws.amazon.com/cloudwatch/home?region=ap-southeast-2#logsV2:log-groups/log-group/$252Fecs$252Fpolykey-testnet - this is the log groups
What we would like instead is to aggregate information and place it on testnet.polykey.com.
Here are some examples.
- https://grafana.com/ - we could embed grafana dashboard to customise the kind of data we want to show
- https://grafana.com/docs/grafana/latest/panels-visualizations/visualizations/geomap/ and https://stake.rocketpool.net/network - we can show information about where all the nodes are connected from in the testnet as well as where geographically the testnet nodes are: https://stake.rocketpool.net/network
Current cloudwatch:
There are some challenges though. Right now we use A records on cloudflare to route testnet.polykey.com:
[nix-shell:~/Projects/Polykey-CLI]$ dig testnet.polykey.com
; <<>> DiG 9.18.16 <<>> testnet.polykey.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 61838
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4095
;; QUESTION SECTION:
;testnet.polykey.com. IN A
;; ANSWER SECTION:
testnet.polykey.com. 300 IN A 13.55.53.141
testnet.polykey.com. 300 IN A 3.106.15.126
;; Query time: 15 msec
;; SERVER: 100.100.100.100#53(100.100.100.100) (UDP)
;; WHEN: Mon Oct 23 15:59:03 AEDT 2023
;; MSG SIZE rcvd: 80
You can see here the 2 A records correspond to the Polykey testnet node container tasks.
If we navigate to testnet.polykey.com that would try to use one of those IPs and try to access via port 80 or 443. We should prefer 443 of course (https by default).
Now browsers will do some sort of resolution:
- https://webmasters.stackexchange.com/questions/10927/using-multiple-a-records-for-my-domain-do-web-browsers-ever-try-more-than-one
- https://serverfault.com/questions/349964/dns-round-robin-do-browsers-stick-to-one-ip-as-long-as-it-is-online
So actually it's a bit problematic. We can't use those A records, as they are going to point to polykey nodes directly. We would want to route to a service potentially a cloudflare worker to show the testnet network status page visualisation.
One way to do this is through cloudflare proxying. You can enable proxying and add in rules to cloudflare so that it can show different DNS records. I think this may not work and the simplest solution is to actually to use a different record type.
So DNS record types that are relevant could be:
- A and AAAA for the web page to show the testnet network status
- TXT or SRV records instead for the polykey nodes - the srv records look like this though:

If we do change to using SRV record, we need to also address the bootstrapping into private network changes too.
Also in terms of setting up the dashboard, we could use a cloudflare worker which would not be long running. Not sure how to set this up. Another way is to always route to a cloudflare worker, and have the worker then do all the routing between the http status page and the actual nodes, the cloudflare workers seem quite flexible: https://developers.cloudflare.com/workers/examples/websockets/
Additional context
- HTTP status page for Polykey Agent #412 - this issue is about having an HTTP status page for the polykey agent directly, whereas this issue focuses on having a public testnet status page, it will be useful for traction metrics and showing a sense of the community
The above shows a sort of global network status of rocketpool, but I think grafana can show all of that too.
Tasks
- Point A records of
testnet.polykey.comandmainnet.polykey.comto Dashboards. - Point A records of
${nodeId}.testnet.polykey.comto nodes - Point
_polykey_agent._udp.testnet.poly key.comSRV records to${nodeId}.testnet.polykey.comA records. - Change
testnet.polykey.comandmainnet.polykey.comrecords to point towards the dashboard

