Monitoring your validator
Watch the latest entries in your log file for errors. It moves fast...so may need to stop the tail and read before restarting. Here are several ways to run it.
When the validator is first starting up, it asks the entrypoint in the start script "what IP address do you see me at?" Then it proceeds to ask the entrypoint if the required ports are open for it to use. This can be searched for during the startup sequence using the last tail option below that will grep 10 lines before and 50 after the search term to see if you are reachable. There is a ton of good connection debugging info in this search. See screenshot below.
Run as xand user
This will restart your validator then grep the log file for initial connection info:
Create a monitor script to easily run the monitor command
Run as xand user
Create a blank file named mon.sh in your home dir with editor.
Copy the code block into the file, correcting your ledger path if needed. Save and exit.
Make the file executable
Run the monitor from home dir
NOTE: Press enter to drop a line to compare old values and press ctrl+c to exit the monitor command.
Create a catchup script that compares your machine to the RPC that you are connected to.
Create a blank file named catchup.sh in your home dir with editor.
Copy the code block into the file. Save and exit.
Note: if using the ALT method for catchup because localhost is not working for you...be sure to grab your validator ID pubkey using solana-keygen pubkey ~/validator-keypair.json
Make the file executable
Run the monitor from home dir
NOTE: Press enter to drop a line to compare old values and press ctrl+c to exit the catchup command.
NOTE: xandeum-watchtower is an optional monitoring system running on a separate computer that will alert you in your own personal discord. Setup is required that is not shown in this guide.
Run as xand user without sudo
Watchtower should be ran from a remote computer that is running 24/7. It works by asking the RPC node if your Validator passes all the sanity checks. It can be added as a service or ran in a tmux window that never closes. You will need the software compiled to the point that xandeum -V works after a reboot. You will need to create a Discord or Slack channel with a webhook to make this work. Telegram, PagerDudy, and Twilio are also supported. This example script checks every <interval> seconds and alerts to Discord and Slack if <unhealthy-threshold> number of failures show in a row...ie 900 seconds
Multiple scripts can be running with different --validator-identity and pumped into the same alert channel and use the --name-suffix to uniquely identify which machine is failing.
NOTE: If our RPC node goes down or is unreachable from your location...you will get false positives that your machine is down...this can be added if desired:
--ignore-http-bad-gateway
Ignore HTTP 502 Bad Gateway errors from the JSON RPC URL. This flag can help reduce false positives, at the expense of no alerting should a Bad Gateway error be a side effect of the real problem
Add text to file and modify for your needs:
Run the script! You may want this running in a tmux session so it stays active when you close your teminal. Learn Tmux
From the script, we will see that 3 ERRORS in a row checked at 300 second intervals will trigger the alert to be sent to our Discord webhook.
Consider next Zabbix Installation