Requirements
Service view
- GEANT proxy
- Will use test instance to monitor
- Will develop full chain test against this proxy
- InAcademia
- Would like multiple endpoints reported on
- eduroam
- Please re-use data from eduroam technical monitor site
- eduGAIN
- Report on eduGAIN metadata and supporting sites
- DO NOT query eduGAIN metadata itself
- eduTEAMS
- Per T&I service dashboard required
- No endpoints discussed yet
Dashboard
(MUST) Separate status reports on a per-service basis
- (MUST) Publicly available part - per service public status pages (also for multiple monitors): (a simple) overview dashboard (do we need to show even more detailed information in order to direct the blame?)
- (MUST) Login protected parts
- (MUST) Allow further link to specific services components if listed
- (SHOULD) Allow messages to be presented at a public page
- (SHOULD?) Overall overview for specific stakeholder (EC) may still have value, even if it is not useful for individual services.
- (COULD?) Consider representation for complex flow(s), like e.g. including login.
Data sources
- (MUST) Checks: HTTP(S) keyword(s), ping, port
- (SHOULD) API access
- (COULD) Optional: check specific headers
- (COULD) Authenticated access to pages (http basic/digest) (is this just for data collection or also for dashboard links?)
Access and visibility
- (MUST) Specifically branded to service in term of URL and use of logos and other branding (colour schemes?); CNAME possible per status page
- (MUST) Publicly available top-level landing for a service-specific status page, redirecting from generic URL to specific page of the vendor (either a feature of to be procured service or perhaps a simple DNS based redirect), e.g. status.edugain.org and status.eduroam.org.
- "Sub" services like e.g. edugain access check would be maybe visible as part of status.edugain, but not as separate pages???
- (SHOULD) Multiple users
- (SHOULD) Separation of duties (multiple users for separate monitors)
- (COULD) 2FA access to the management portal
To be discussed
- Compound states based on several sources and rules?
- Filters? (example: Can be noisy; one test site used during the assessment phase would be working without error, and yet a number of the globally-distributed testing servers would report a problem with this test site. No pattern was found for this behaviour.)
- Charts?
- Long term history and trends?
- Usage/load stats?
Features
Compare features of
- Pingdom https://www.pingdom.com/
- UptimeRobot https://uptimerobot.com/
Some checks from https://docs.google.com/presentation/d/1aeP1R_RK9wNipguSYbcDbyurAkFQD7BV31QcR99f6YQ/edit#slide=id.g7183f71b4a_0_48 onward