Idempotency is the only sustainable answer here. Whether it's at the database level using unique constraints or implementing idempotency keys in your API headers, you have to design for the 'at-least-once' delivery reality. I usually implement a 'processed_requests' table that stores the unique ID of the job. Before the worker executes any side effect (like a payment or email), it checks if that ID exists. If it does, it skips the execution and returns the previous result. It adds a bit of latency, but it's much cheaper than dealing with double-billing or corrupted data
I actually talked to someone in temporal about this recently. Temporal gives you the primitives to handle it (activities, configure retries, interceptors), but you still have to implement the deduplication logic yourself for each external integration.
His advice was: Temporal solves orchestration, but making the external API calls idempotent is on you. For simple cases, write observe activities manually. For complex cases, build abstraction.
That's what led me down this path - trying to figure out if the abstraction is worth building or if manual is good enough.
Have you used Temporal for this? How do you handle the idempotency of external calls?
You proxy those api calls yourself and have idempotency to cover you for those APIs that don’t have it. If you architect it right you won’t have more than a ms latency addition. You can avoid the race condition issues by using atomic records so if something else tries they’d see it’s in progress and exit.
This is exactly the approach I took. Proxy layer that:
- Uses atomic records (fence tokens) to prevent concurrent execution
- Checks external system first before retrying (the retrieval step)
- Records result for future lookups
The atomic records part is critical - I learned the hard way that just checking a DB flag isn't enough (process can freeze between check and execute, lease expires, another process takes over, both execute).
How do you handle the case where:
1. Process acquires atomic lock
2. Calls external API successfully
3. Process freezes before releasing lock
4. Lock expires, new process acquires it
5. New process calls API again → duplicate
Do you just accept this edge case (rare but possible)? Or is there a mitigation I'm missing?
I think the answer is probably like most things: it depends.
- If the external service supports idempotent operations, use that option.
- If the external service doesn't, but has a "retrieval" feature (i.e. lookup if the thing already exists, e.g fetch refunds on a given payment), use that first.
- If the system has neither, assess how critical it is to avoid duplicates.
Idempotency is the only sustainable answer here. Whether it's at the database level using unique constraints or implementing idempotency keys in your API headers, you have to design for the 'at-least-once' delivery reality. I usually implement a 'processed_requests' table that stores the unique ID of the job. Before the worker executes any side effect (like a payment or email), it checks if that ID exists. If it does, it skips the execution and returns the previous result. It adds a bit of latency, but it's much cheaper than dealing with double-billing or corrupted data
Use something like Temporal
I actually talked to someone in temporal about this recently. Temporal gives you the primitives to handle it (activities, configure retries, interceptors), but you still have to implement the deduplication logic yourself for each external integration.
His advice was: Temporal solves orchestration, but making the external API calls idempotent is on you. For simple cases, write observe activities manually. For complex cases, build abstraction.
That's what led me down this path - trying to figure out if the abstraction is worth building or if manual is good enough.
Have you used Temporal for this? How do you handle the idempotency of external calls?
You proxy those api calls yourself and have idempotency to cover you for those APIs that don’t have it. If you architect it right you won’t have more than a ms latency addition. You can avoid the race condition issues by using atomic records so if something else tries they’d see it’s in progress and exit.
This is exactly the approach I took. Proxy layer that: - Uses atomic records (fence tokens) to prevent concurrent execution - Checks external system first before retrying (the retrieval step) - Records result for future lookups
The atomic records part is critical - I learned the hard way that just checking a DB flag isn't enough (process can freeze between check and execute, lease expires, another process takes over, both execute).
How do you handle the case where: 1. Process acquires atomic lock 2. Calls external API successfully 3. Process freezes before releasing lock 4. Lock expires, new process acquires it 5. New process calls API again → duplicate
Do you just accept this edge case (rare but possible)? Or is there a mitigation I'm missing?
I think the answer is probably like most things: it depends.
- If the external service supports idempotent operations, use that option.
- If the external service doesn't, but has a "retrieval" feature (i.e. lookup if the thing already exists, e.g fetch refunds on a given payment), use that first.
- If the system has neither, assess how critical it is to avoid duplicates.
This matches my thinking. The retrieval/lookup approach is exactly what I built - basically Option C with an observe-before-act pattern.
For APIs that support idempotency keys (Stripe, etc.), I use those. For ones that don't but have retrieval (most do), I check first before retrying.
The question I'm wrestling with: is the extra round-trip for the lookup worth it? Or should I just accept the edge cases where it duplicates?
What's your threshold for "critical enough to avoid duplicates"? Payments obviously yes, but what about notifications, reporting, analytics events?