Help finding reason for Solace Pubsub+ Standard on GCP Kubernetes Autopilot costing so much

CloudGod
CloudGod Member Posts: 24 ✭✭
edited September 2023 in General Discussions #1

Dear all,

It's a long shot, but this community has pulled off some amazing solutions, so it's worth a try.

We're positioning Solace as part of a business solution for one of our customers.

The solution as a whole is deployed in a single GCP Kubernetes autopilot cluster, including the Solace PubSub+ Standard workload.

Now, we're seeing a somehow odd behavior in the monitoring and metrics console.

When we look at these reports, we see a tremendous difference between the Solace workload CPU and Memory hours request and all the other components, which are more or less evident depending on the time range of the analysis.

On an hourly analysis, there are no significant differences between Solace and the other workloads. On a daily analysis, we begin to see significant differences, which are exponentially bigger depending on the time range of the analysis (monthly is huge).

My guess is that Solace and GCP's autopilot don't get along, and a ton of CPU hour and memory hour requests are generated which then are not properly used.
As we're using GCP's Autopilot, we have no control on the number and size of the K8S nodes.

The example shown in the picture is from a development environment with little to no real workload.

Has anyone experienced something like this before?

Cheers

Tagged:

Comments

  • marc
    marc Member, Administrator, Moderator, Employee Posts: 954 admin

    Hi @CloudGod,

    Just a heads up that I checked with some folks and couldn't find anyone else running into this issue. Were you able to find a resolution?

  • CloudGod
    CloudGod Member Posts: 24 ✭✭

    Thanks @marc !

    Still nothing. This is really strange.

    I'll keep you posted.

  • marc
    marc Member, Administrator, Moderator, Employee Posts: 954 admin

    Hi @CloudGod,

    I was chatting with one of our broker experts about what you're seeing and this is what he is thinking:

    My best guess is that the other workloads are more spikey that allows autopilot to optimize the resources allocated to those workloads. The broker has a pretty constant memory requirement (because it largely statically allocates its memory) and a relatively high CPU utilization even at idle. If autopilot can dial down the other workloads a lot when they are idle, the broker would look to be a high resource user over longer periods of time because it cannot really be dialed down.

    He also mentioned that if you aren't specifying memory and CPU requirements in autopilot to run the broker you should try that. (Note if you are using the helm chart this is probably already the case)

    Hope that helps!