If you prefer to watch a video, this webhook architecture diagram was discussed by Svix founder and CEO Tom Hacohen at length in this video: https://youtu.be/4jvV75OD620.
You can find the transcript of the call on our blog.
Core Features
There are several core features that we believe are necessary to deliver a world class developer experience for webhooks. These features are all aimed to optimize the system for reliability, scalability, and security.
Retries
Customer endpoints will inevitably fail. Without retries, your users will have to manually retrieve any webhooks that were sent during a service outage. If instead you requeue the webhook to send based on an exponential backoff algorithm, you help users deal with short term outages automatically while also giving them time to fix any potential issues in time to ensure they receive all messages.
Signature Verification
Because webhooks are unauthenticated HTTP requests, there is no way for your users to ensure the request comes from you unless you digitally sign the request. You want to encrypt the signature with HMAC and SHA256 while ensuring you use a unique key for every endpoint.
Event Types & Fan Out (multiple endpoints)
These are technically two separate features but they go hand in hand. Classifying separate events into event types (e.g. user created vs user deleted) and enabling users to create multiple endpoints allows them to create separate webhook rails for separate systems, services, etc. This simplifies implementations, makes it easier to debug any endpoint failues, and avoids having one point of failure for all webhooks.
Monitoring and Logging Visibility
Webhooks require extra care and attention to monitoring and logging your service compared to REST API services. The one way nature of webhook communications means that your customers won't know and can't notify you when something goes wrong. You need to be the one notifying your users that their endpoints are failing. You also want to give your users visibility into delivery attempts and endpoint statuses to make it easier for them to maintain their endpoints.
Security Best Practices
Web Application Firewall
A WAF is a general best practices to help secure your service from a variety of common attack vectors.
Separate VPCs for internal systems and outgoing proxies
You want to use proxies to send your webhook events to customer endpoints to secure your internal services. You also want to wrap both in VPCs for added security.
SHA256 encryption with HMAC
You need to use secure encryption like SHA256. You will also need to use HMAC (hash based message authentication code) to ensure security vs length extension attacks.
Static IPs
Your outgoing proxies should send webhooks to customer endpoints from static IP addresses. This allows customers to whitelist your IPs to satisfy their security protocols while also preventing server side request forgery attacks.
Webhook dispatching service components:
Webhook dispatching service (big blue box)
This represents what is included in Svix's webhook service offering. If you're looking to implement this type of system internally, we recommend building a microservice modeled after the diagram.
Orange box
A Virtual Private Cloud to protect your internal systems.
Green box
A Virtual Private Cloud to separate outgoing proxies from internal systems.
API Customer
End user of an API.
Customer Endpoints
Endpoints specified by API customers to let the webhook dispatching service know where the webhooks should be sent. Our recommendation is to allow users to specifify multiple endpoints instead of limiting them to only one.
Load Balancer/WAF
These are here as scalabilitiy and security best practices.
API Servers
Endpoints to which API customers can send their webhook payloads to use your webhook service.
Webhook Delivery Service
The webhook delivery service takes requests received from the API servers, persists them in storage, caches them, and adds them to the task queue.
Customer UI
The customer UI makes the process of managing their endpoints much simpler and makes it easier for them to diagnose any issues and debug any failures.
Storage Layer
To ensure deliverability, you must persist all webhook attempts in a storage layer. Not only does this allow you to retry failed attempts, but you can gives users visibility into the status of their webhook attempts and endpoints.
Caching Layer
Given the volume of webhooks you will send and the number of retries you'll have of failed webhooks, we strongly recommend storing webhook payloads in a cache to reduce load on your database.
Task Queue
Not only do you need to requeue failed attempts for retries, but you also want to ensure that your API servers are always available to receive requests instead of being stuck
Webhook Dispatch Workers
The workers take tasks from the queue and route the paylaod to the specified endpoint.
Logging and Monitoring
Logging and monitoring is especially important for webhook systems because the communication is only one way. With traditional API services, a customer will get an error response from the endpoint, knowing a failure occured. There is no way for your user to know that they were supposed to receive a webhook and didn't.
VPC Peering
This lets our internal services send webhooks to our proxy servers while isolating the proxies in their own VPC.
Outgoing Proxies
The outgoing proxies receive webhooks from our service and send them to customer endpoints without having access to internal services. They should also be setup to send from a specified set of IP addresses to prevent server side request forgeries and to comply with potential customer requirements.