Try PubSub+
If you haven't already, check out our new Developer Portal! You'll find useful information about Solace PubSub+ as well as handy resources to get you started.

Writing to Solace with Spark

harryharry Member Posts: 10

i have a spark dataframe and i want to write it to solace i.e. i want to send the rows of dataframe as events to solace(publishing basically) in a distributed way . like spark to kafka.

Comments

  • AaronAaron Member, Moderator, Employee Posts: 62 Solace Employee

    Hey @harry. I did some searching and found this, might it help? https://github.com/jksinghpro/spark-jms

    My colleague @Ken Overton did a Spark-JMS integration, but on the other side: https://github.com/koverton/spark-jms-connector I'm wondering if he might have any insight for taking data out of Spark and posting on Solace.

  • Pankaj AroraPankaj Arora Member Posts: 1

    @harry I had created a Structured Streaming with Datasource V2 https://github.com/pankajsa/solace-spark to connect Solace & Spark. The current code has implemented the reader interfaces and you can extend it by implementing the writer interface.

  • harryharry Member Posts: 10

    @Aaron @Pankaj Arora @Ken Overton thanks a lot guys , let me check out those links, although i wrote a fair share of implementation ==> for each spark partition by making a connection object , a message producer and then iterating over the rows of spark dataFrame for that partition and sending them to messageProducer.send() of jms api and my solace on docker received those events which i sent in distributed way.
    although i need to figure out some cases and improve efficiency but this has been a big boost . Will get back shortly to you guys.

  • AaronAaron Member, Moderator, Employee Posts: 62 Solace Employee
    edited May 7

    Awesome @harry . Looking forward to seeing the results!

    A couple thoughts... I don't know Spark, but I do know that creating new Connections and Producers is kind of an "expensive" operation (vs. sending a single message is "cheap"), so hopefully those objects are somewhat long-lived, and you're not creating and destroying Connections constantly... more efficient for the connection to remain up as long as possible, and just send the rows as messages as you iterate over. I think that sounds like what you're doing, but just wanted to be clear.

    Are you sending to a dynamic topic? E.g. pulling out some piece of data from the row (id?) and inserting that into the topic that you're publishing on? E.g. spark/data/$id/update or something..?

  • harryharry Member Posts: 10

    hey @Aaron , yeah the connection and producer is created per partition and closed per partition, so it has the upper limit of number of spark partitions which we can control, which is generally not much and i can introduce some throttling here. and to answer your question , yes! i am iterating over rows per partition and sending the rows as message as i iterate over, and no the topic is not dynamic in my case , so i think we are good. Couple of questions?
    Q1) Does solace (and i am assuming it does) have a connection pool and limit to how many connections any client can make with solace using the jms Api's?
    Q2) Since spark can process large amount of data as it is distributed over several nodes , let's say i read huge data and try to send it over solace , i don't want to bombard solace's queue's with my messages/events. how does solace control it. Like it sotres the incoming events in a buffer? and then pass on to broker ? how does it handle very huge number of events coming to it in a very short span, what can a client do/configure to avoid this?
    thanks.

  • harryharry Member Posts: 10
    edited May 7

    @Ken Overton i looked at the link, i think we both are doing the same thing when it comes to the CORE code with map partitions and iterating over rows. i have couple of issues need to be figured out
    Q1) how would you ensure an event at time t0 ( breach happened) will arrive at solace before t0 +delta( nothing to worry , all clear for breach, system error) when sending rows from a static data frame to solace.
    Q2) the same thing i mentioned above as to how to control flow so as not to bombard solace with huge amount of messages from dataframe as (the data is distributed and it can be millions of rows or even more).

    For question 1 , what if i assign a (sequence number +time) as id and do a logical partition by object id (so that same type of events falls in same partition and arrive almost immediately ) but this might lead to man y partitions , i am not sure?
    Any insights bud?
    @Pankaj Arora @Aaron

Sign In or Register to comment.