Pubsub+ with serverless cloud components (esp azure) in the data flow: which tech? How?

allmhhuran · December 2020

We are using pubsub+ in an event-carried-state-transfer pattern to integrate data between a number of enterprise applications. As of right now, the publishers can make use of a nuget package authored by me to simplify connectivity, which they can incorporate into their application if it can raise events natively. For applications which cannot, we generate events from the database level with a technology like SQL Server change data capture or SQL Server change tracking, with a windows service acting as the bridge between the polling of the database and the connection to pubsub+.

Similarly, on the subscriber side there is typically a windows service which is acting as the "listener" for the events, acting as the subscriber and receiver of messages which are ultimately pushed to some kind of repository on the subscriber side.

Keeping these components outside the native application code seems to make sense for a lot of reasons: the native application might be a vendor system whose code we cannot change, or it might be something like a typical web application with only a standard, slow, http "api" designed purely for the application UI to consume.

For these reasons, in my existing integration data flows these bridge components are implemented as windows services. It seems like a natural fit for something which runs constantly and is kept alive by the operating system. But as more applications move entirely into azure, and particularly as the company pushes development towards serverless components, the choice of bridging technology is less clear.

I am interested in hearing from people who are using pubsub+ in combination with serverless cloud components, especially azure. Which cloud product did you or would you choose to host this kind of bridging code? Azure functions seem like a poor choice if the idea would be to spin up an execution on the arrival of every event if there is a requirement to preserve event order, and also because it seems likely to end up being very expensive.

swenhelge · December 2020

Our preferred mechanism is to use a serverless function in the chosen cloud. For example, Azure Functions, Huawei Function Graphs, AWS Lambda Functions.
These are triggered via an HTTP POST request through the broker’s REST delivery point (RDP) feature.
The functions do any transformation required, handle authentication to the downstream systems etc.
This blog post explains it a bit (it mostly focuses on automation) - https://solace.com/blog/streaming-asset-sensor-data-azure-datalake-ansible/

There is an example of such an Azure function that is triggered by the broker and writes to Azure Blob storage. I think the ZIP here can be uploaded directly to Azure functions: https://github.com/solace-iot-team/solace-int-rdp-az-funcs/releases/tag/v0.1.0

And an example of an Azure function that goes the other way - triggered by an Azure message (in this example IoT Hub) and republishing the event to the broker:
https://github.com/solace-iot-team/azure-iot-hub-integration

allmhhuran · December 2020

Thanks for your comment swenhelge.

In the OP I made a note of two specific reasons why triggering an azure function to receive and process each message seems like a bad idea: it seems like you would lose your message order guarantee at the subscriber side, and when azure functions get called extremely frequently (as they would in an ECST scenario) they get quite expensive. I suppose I can think of a third reason as well, being the introduction of http calls into what should otherwise be a high performance dataflow with per-message processing time desired to be as fast as possible. In my rather long experience developing data integrations between systems I have always been loathe to introduce per-row (or equivalently, per event or per message) http calls other than when absolutely necessary, because they are extremely slow.

But again, the main issue with spinning up an azure function on the receipt of each message is that the independent nature of each message processing call could change the eventual message order when they are ultimately ingested by the actual subscribing application.

swenhelge · December 2020

Oh you are absolutely right with your considerations, this approach is limited to use cases that can live with these limitations.
I didn't really process that last paragraph.
An alternative would be to write event-driven microservices and deploy them in Kubernetes (AKS). We chose the functions route because it seemed easier to deploy and monitor. Costs are a consideration and you'll need to take the full costs of running the solution into account.

[Deleted User] · December 2020

Requiring high throughput and in order delivery is not really a sweet spot for most cloud integration services, they are built to hyper-scale and fall down fairly quickly when order matters. "Functions as a service" or event Event Hubs would display these attributes. Would like to know more about what the back end services are, but without this info a common pattern would be to use something like DataBricks to normalize the data for the target platforms and do the insert. If you are looking to homogenize your on-prem and in cloud solution to this problem, that this or Swen's K8s solution might be overkill.
If this solution stays fully Windows/Azure then in years past I have used IIS/WAS with WCF to integrate Solace into Windows Applications, this can provide things like autoscaling with order. Probably better/more portable solutions in the .Net.Core or .Net.Framework might be worth looking at.

Pubsub+ with serverless cloud components (esp azure) in the data flow: which tech? How?

Comments

Categories

This Month's Leaders

This Week's Leaders