This a great idea, but it's a great idea when on-prem.
During some thread, some where, there's going to be a roundtrip time between my servers and yours, and once I am at a scale where this sort of thing matters, I'm going to want this on-prem.
What's the difference between this and checking against a local cache before firing the request and marking the service down in said local cache so my other systems can see it?
I'm also concerned about a false positive or a single system throwing an error. If it's a false positive, then the protected asset fails on all of my systems, which doesn't seem great. I'll take some requests working vs none when money is in play.
You also state that "The SDK keeps a local cache of breaker state" -- If I've got 50 servers, where is that local cache living? If it's per process, that's not great, and if it's in a local cache like redis or memcache, I'm better off using my own network for "sub microsecond response" vs the time to go over the wire to talk to your service.
I've fought huge cascading issues in production at very large social media companies. It takes a bit more than breakers to solve these problems. Backpressure is a critical component of this, and often turning things off completey isn't the best approach.
This a great idea, but it's a great idea when on-prem.
During some thread, some where, there's going to be a roundtrip time between my servers and yours, and once I am at a scale where this sort of thing matters, I'm going to want this on-prem.
What's the difference between this and checking against a local cache before firing the request and marking the service down in said local cache so my other systems can see it?
I'm also concerned about a false positive or a single system throwing an error. If it's a false positive, then the protected asset fails on all of my systems, which doesn't seem great. I'll take some requests working vs none when money is in play.
You also state that "The SDK keeps a local cache of breaker state" -- If I've got 50 servers, where is that local cache living? If it's per process, that's not great, and if it's in a local cache like redis or memcache, I'm better off using my own network for "sub microsecond response" vs the time to go over the wire to talk to your service.
I've fought huge cascading issues in production at very large social media companies. It takes a bit more than breakers to solve these problems. Backpressure is a critical component of this, and often turning things off completey isn't the best approach.
Now I have seen it all... a SaaS solution for making local outages global.
const openfuse = new OpenfuseCloud(...);
what happens when your service goes down