Hi, I would like to set up a validator on two machines in failover mode. Could anyone please guides me how to do this?
The docs don’t say anything about running multiple validators as a failover, so I’m not sure whether or not it’s possible, but it’s absolutely something I’d be interested in.
Sorry I can’t help, but let me know if you find any info on that anywhere else.
I found some info in pre-release docs here:
But as far as I know it’s not official yet and not supported by version below 1.8
I’ve tried to set it up based on his doc, but for me it’s not clear enough, fe. where exactly to put certificate files or how to automatically switch validator in case of failure.
If you will figure it out I would be interested
Oh, cool, I’ll look into that!
I didn’t know about the pre-release docs. I’ll let you know if I figure anything out.
It looks like it doesn’t matter exactly where you put certs, because you pass in the path to the cert files directly on the commandline.
I think you have to bring your own monitoring solution. So you would have to have something like a script that continually tries to ping the primary validator, and if it fails it would run the command in Triggering a Failover via Monitoring.
It looks like running validators in failover mode is a bit more of an advanced user feature right now.
I have a lot of experience with DevOps/automation stuff, so I might take a hack one day at automating the whole process.
I’ve kind of wanted to get more into how to run validators efficiently and I didn’t know that failover was going to be a possibility, so thanks for pointing this out!
I thought it could be done by using tools like Telegraf and InfluxDb. I used them to visualize the operating parameters of my validator.
I followed this guide:
So, the switch could be triggered when first validator stopped voting. But I didn’t find any guide how to do this. I’m not experienced enough to create such a script, but maybe you are? What do you think?
Yes, I could figure it out. Just not for sure when I’ll get time. I’d have to get a local setup going to start testing.
It looks like to the trick would be to run a bash script that just loops infinitely and uses the
influx query command to execute a query on the voting data, then you’d pipe it to
grep maybe find out when it stops voting, and if that happens then run the command.
It might be even simpler than that, though, because instead of using influx to figure out when it stops voting, you might just be able to
curl the validator RPC endpoint.
Not sure the details because I haven’t personally actually run anything other than
Again, not sure exactly when I’ll get to this or where it sits on my priorities yet, but I’ll definitely get back to you if I figure anything out.
Edit, just found in that Solana monitoring guide you linked to:
The solution consist of a standard telegraf installation and one bash script “monitor.sh” that will get all server performance and validator performance metrics every 30 seconds and send all the metrics to a local or remote influx database server.
So that means the failover script wouldn’t have to deal with influx at all, it could just use whatever logic the
monitor.sh script used to detect votes. That looks like it should be pretty easy then, but I haven’t tested anything.
That would be great, thanks for your interest of the subject
I don’t know if that would be helpful, but I found there are a “status” and “pctVote” fields in the Influx database. I have a validator running in devnet cluster and when everythings is fine status has “Validating” value and pctVote (vote percentage) is between 82% and 86%. Maybe those parameters would be good to determine validator health?