Deduplication of events (business events vs granular events)

sjaak
sjaak Member Posts: 109 ✭✭✭
edited October 2024 in General Discussions #1

Hi all,

I'm interested in your thoughts on this matter. Our current workaround does not ensure guaranteed delivery, so we risk losing messages. 

Use Case

How can we eliminate duplicate events with Solace from applications that generate lots of granular events?

What it solves

1. Reduce unnecessary process executions (overhead).

2. Decrease the number of subscriber callbacks to source applications that have strict API limits.

Example

Microsoft ERP platforms can produce up to 10 events for the same item when updating a single inventory entry. In this scenario, you receive one event per table rather than a single "business event" per item.

Our initial recommendation was to implement deduplication logic on the application side. However, this option is often not feasible with cloud applications due to limited customization options. Moreover, some clients may reject this approach, seeing it as an integration issue rather than a problem related to the application itself.

Current Workaround

1. Boomi Process 1 (Scheduled)

  • Utilize the Boomi GET operation with a batch size of 1,000 messages to consume from the queue
  • Sort the events in memory by ID and remove duplicates.
  • Send the deduplicated events to a Solace queue.
  • Guaranteed delivery: No; messages will be lost if the Boomi process crashes.

2. Boomi Process 2 (Real-Time Listener)

  • Use the Boomi LISTEN operation with a batch size of 1 message to consume and process from the queue
  • Guaranteed delivery: Yes

Remarks

  • We attempted a "browse-only" GET operation and manually acknowledged the events; however, the Boomi PubSub+ connector does not technically support this functionality. 
  • The question arises: Why doesn't the Boomi GET operation function the same way as the Boomi LISTEN operation? For instance, a listener can retrieve a maximum of 1,000 messages per execution from the queue, and the Solace connector acknowledges them once the process has been completed.

Comments

  • Aaron
    Aaron Member, Administrator, Moderator, Employee Posts: 664 admin

    Hey @sjaak, how's it going? Hope you're well.

    Is this de-dupe behaviour something you want to implement in Boomi specifically? From a related concept: I built a Solace monitoring environment at a large enterprise customer. They had a problem where if a WAN link flapped, a single appliance could generate 100s of alerts / support tickets, for each VPN bridge that disconnected. These alerts were happening off the brokers' event.log. So I implemented a simple windowing correlator using an app called SEC (Simple Event Correlator) on my centralized Syslog server… if a certain number of similar events from the same broker occured within 5 seconds, supress all, gather, and emit one "aggregate" event.

    Tools like Flink are probably what I'd recommend today for building a aggregator / correlator of message streams… have it setup in the same way, a defined time-bounded window to gather up related events, and then emit/publish just one to your downstream queue.

    RE: your other comments about "GET": I'm not sure… is that just a queue browser? Queue browsers can ACK as well. Perhaps this needs to be an enhancement request for the connector?

  • sjaak
    sjaak Member Posts: 109 ✭✭✭

    Hey @Aaron ,

    All is well here! 😊

    We're working with a queue browser, and once the events are successfully processed, we want to consume them per batch. The current max is 1000 messages per Boomi/Solace GET operation.

    To deduplicate events within a specific time frame (like minutes or hours), we need to implement this with a scheduled Boomi process, as a GET request is the only option available.

    Please refer to the help documentation regarding the Boomi Pub/Sub connector works.

    I aim to keep our integration patterns as simple and stateless as possible. Ideally, there should be no custom integration components outside of Boomi, and we should avoid stateful components unless necessary.

    We could resolve this quickly with a simple database, which increases TCO and adds complexity. We want to challenge the Solace community to implement this use case using only Boomi and Solace.

    An alternative approach could be “delayed delivery” or setting a Time-To-Live (TTL) for each message, with a fixed time slot like "wait until 09:00". We might be able to implement this using a calculated TTL, but we could still lose messages due to the Boomi GET operation.

    Here’s an example of the event stream:

    Item A event 10:00:15
    Item A event 10:00:15
    Item A event 10:00:16
    Item B event 10:00:21
    Item B event 10:00:21
    Item B event 10:00:22
    Item A event 10:01:32

    If we run the deduplication process at 10:05:00, we would expect the following result, reducing the total from 7 to just two distinct events:

    Item A event
    Item B event

    For your information, we have use cases where 80% of our events are duplicates. This creates a flood of callbacks to source applications, and cloud platforms, in particular, do not tolerate this well and may throttle traffic due to API limits.

  • Kevin
    Kevin Member Posts: 5

    Hey there!

    Thanks for the detailed explanation! It sounds like you're on the right track with using a scheduled Boomi process for deduplication. I completely agree with keeping our integration patterns simple and stateless—minimizing complexity is crucial for maintainability.

    The example you provided really highlights the need for effective deduplication, especially given the high percentage of duplicate events you’re encountering. I think exploring the "delayed delivery" approach or implementing TTL could be a good way to manage message flow while still leveraging the capabilities of Boomi and Solace.

    Let’s keep the conversation going as we challenge the Solace community! I'm looking forward to seeing how we can tackle this use case effectively.