Publish unicode string using solclientjs

soham
soham Member Posts: 5

Hi Experts,

I am trying to publish a non-ascii string to a solace topic using the solclientjs library in nodeJs, and consume the same using the golang solace messaging library, and vice-versa (i.e. golang to nodeJs).

Example string: abc-nonascii_ãçï_স

I understand I have three options to do this in solclientjs: as BinaryAttachment, as XMLContent, or as SDT Field.

However, the first option of BinaryAttachment seems to support only latin1 encoding, and the non-ascii characters are not sent correctly. The third option of SDT Field of ByteArray type also seems to have the same issue. The second option of XMLContent seems to be the only option, however I read that it is a legacy type and the golang library also does not have it as an explicit option.

I have tried changing the SolClientFactory Profile to version10_5, which makes the consumer correctly decode the unicode bytes, but the publisher still fails to encode the string correctly. I also see the same mentioned in the docs.

Hence I have two questions:

  1. What is the recommended way to transfer unicode strings in solace client libraries?
  2. Is it expected/desirable for the publisher to behave differently from the consumer when using the same factory profile version?

Thanks

Tagged:

Best Answer

  • Aaron
    Aaron Member, Administrator, Moderator, Employee Posts: 662 admin
    edited April 2023 #2 Answer ✓

    Hi there @soham. Thanks for trying to research this out yourself and post a well-thought-out question. 👍🏼

    First: don't use the XMLContent portion of the payload. That's old and legacy.

    2nd: I'd suggest either using a regular TextMessage, or possibly a BytesMessage / binary. Either way. TextMessage is a special type of SDT, with just a single field (the text). It's (supposed to be!) UTF-8 encoded, so it definitely (should?) handle non-ASCII text. Since JavaScript is naturally UTF-16, it should handle the conversion of the text for you. I think! This is not true for a plain binary attachment (more on that later).

    For JS sending / receiving a TextMessage:

    // send
    var msg = solace.SolclientFactory.createMessage();
    msg.setSdtContainer(solace.SDTField.create(solace.SDTFieldType.STRING, "here is my text."));
    
    // receive
    var payload;
    if (msg.getType() == solace.MessageType.TEXT) {  // in case someone sends text message
        try {
            payload = JSON.parse(message.getSdtContainer().getValue());
        } catch(e) {
            subscriber.log(e);
            return;
        }
    

    If trying to send a JS string as just plain binary attachment, you need to convert from UTF-16 to UTF-8. Same with receiving a binary attachment string from another app. I ran into this myself when I had a little JS app that was trying to show my colleague's non-Latin-spelling names (e.g. Chinese, Japanese, ...). These are the little helper methods I ended up finding and using:

        //http://ecmanaut.blogspot.hk/2006/07/encoding-decoding-utf8-in-javascript.html
        function decode_utf8(s) {
            return decodeURIComponent(escape(s));
        }
    
        function encode_utf8(s) {
            return unescape(encodeURIComponent(s));
        }
    

    Then the JS code for sending/receiving a BytesMessage with just a UTF-8 string as attachment should look like:

    // send
    var message = solace.SolclientFactory.createMessage();
    message.setDestination(solace.SolclientFactory.createTopic(topic));
    var jsonPayload = JSON.stringify(payload);  // still UTF-16
    message.setBinaryAttachment(encode_utf8(jsonPayload));  // change to UTF-8
    // or just message.setBinaryAttachment(encode_utf8("my js string"));
    
    // receive
    var payload;
    if (msg.getType() == solace.MessageType.BINARY) {
        try {
            payload = JSON.parse(decode_utf8(message.getBinaryAttachment()));
            // or just: var str = decode_utf8(message.getBinaryAttachment());
        } catch(e) {
            subscriber.log(e);
            return;
        }
    

    I think that should run. I just copied/pasted from some old examples I have, so hopefully this just works. Let us know!

    EDIT: let me know if the JS Text/SDT approach works. I'll test it myself eventually if you don't get back to me. You might need to do that encode/decode thing for the TextMessage as well..?

Answers

  • Aaron
    Aaron Member, Administrator, Moderator, Employee Posts: 662 admin
    edited April 2023 #3 Answer ✓

    Hi there @soham. Thanks for trying to research this out yourself and post a well-thought-out question. 👍🏼

    First: don't use the XMLContent portion of the payload. That's old and legacy.

    2nd: I'd suggest either using a regular TextMessage, or possibly a BytesMessage / binary. Either way. TextMessage is a special type of SDT, with just a single field (the text). It's (supposed to be!) UTF-8 encoded, so it definitely (should?) handle non-ASCII text. Since JavaScript is naturally UTF-16, it should handle the conversion of the text for you. I think! This is not true for a plain binary attachment (more on that later).

    For JS sending / receiving a TextMessage:

    // send
    var msg = solace.SolclientFactory.createMessage();
    msg.setSdtContainer(solace.SDTField.create(solace.SDTFieldType.STRING, "here is my text."));
    
    // receive
    var payload;
    if (msg.getType() == solace.MessageType.TEXT) {  // in case someone sends text message
        try {
            payload = JSON.parse(message.getSdtContainer().getValue());
        } catch(e) {
            subscriber.log(e);
            return;
        }
    

    If trying to send a JS string as just plain binary attachment, you need to convert from UTF-16 to UTF-8. Same with receiving a binary attachment string from another app. I ran into this myself when I had a little JS app that was trying to show my colleague's non-Latin-spelling names (e.g. Chinese, Japanese, ...). These are the little helper methods I ended up finding and using:

        //http://ecmanaut.blogspot.hk/2006/07/encoding-decoding-utf8-in-javascript.html
        function decode_utf8(s) {
            return decodeURIComponent(escape(s));
        }
    
        function encode_utf8(s) {
            return unescape(encodeURIComponent(s));
        }
    

    Then the JS code for sending/receiving a BytesMessage with just a UTF-8 string as attachment should look like:

    // send
    var message = solace.SolclientFactory.createMessage();
    message.setDestination(solace.SolclientFactory.createTopic(topic));
    var jsonPayload = JSON.stringify(payload);  // still UTF-16
    message.setBinaryAttachment(encode_utf8(jsonPayload));  // change to UTF-8
    // or just message.setBinaryAttachment(encode_utf8("my js string"));
    
    // receive
    var payload;
    if (msg.getType() == solace.MessageType.BINARY) {
        try {
            payload = JSON.parse(decode_utf8(message.getBinaryAttachment()));
            // or just: var str = decode_utf8(message.getBinaryAttachment());
        } catch(e) {
            subscriber.log(e);
            return;
        }
    

    I think that should run. I just copied/pasted from some old examples I have, so hopefully this just works. Let us know!

    EDIT: let me know if the JS Text/SDT approach works. I'll test it myself eventually if you don't get back to me. You might need to do that encode/decode thing for the TextMessage as well..?

  • soham
    soham Member Posts: 5

    Hi @Aaron,

    Appreciate your detailed response. Below are my findings:

    • Publishing using the SDTContainer with SDTFieldType.STRING is working for me, and I am able to receive the unicode string correctly as binary attachment at both golang and nodejs consumers (using factoryProfile version10_5).
    • The encode/decode approach may not work for all characters, as discussed in the blog comments (http://disq.us/p/fg74xj). Also, it would require the conversion to be done at both producer and consumer, and any external or non-JavaScript client which does not have the capability will not work.

    Thanks

  • soham
    soham Member Posts: 5
    edited May 2023 #5

    An update:

    In the nodejs --> solace --> nodejs flow, when publishing using SDTContainer and consuming from solace using getBinaryAttachment(), I found that it prefixes a stray inverted comma (') in the message. This may be a bug? @Aaron

    On using message.getSdtContainer().getValue() the issue could be mitigated.

    Hence I had to include conditions in my consumer code to check for all 3 types of messages, for compatibility with diverse clients.

  • Aaron
    Aaron Member, Administrator, Moderator, Employee Posts: 662 admin
    edited May 2023 #6

    Yeah that's because you're serializing the message as an SDT (with extra bytes defining how big the SDT field is) and you're deserializing it as raw binary. You can see this in the dump() of Text messages, like here from JCSMP sample:

    Destination:                            Topic 'solace/samples/jcsmp/hello/aaron'
    Priority:                               4
    Class Of Service:                       USER_COS_1
    DeliveryMode:                           DIRECT
    Message Id:                             5
    Binary Attachment:                      len=26
      1c 1a 48 65 6c 6c 6f 20    57 6f 72 6c 64 20 66 72    ..Hello.World.fr
      6f 6d 20 41 61 72 6f 6e    21 00                      om.Aaron!.
    

    Note 1c 1a at beginning of text field. That's the SDT encoding of a "text message". If I was to just take a UTF-8 string and stick it as binary payload, it wouldn't have that.

  • Aaron
    Aaron Member, Administrator, Moderator, Employee Posts: 662 admin
    edited March 2024 #7

    Hello hello! I am updating this thread..! 🎉 I have recently stumbled onto this particular issue again where a JavaScript publisher was sending a String as a raw binary attachment, and it contained the GBP symbol £. And it wasn't getting encoded properly into UTF-8, it was sending it as byte that doesn't exist in UTF-8. I've done some research and thought I'd post an update.

    I started off using my approach above for converting to a UTF-8 string:

    function encode_utf8(s) {
        return unescape(encodeURIComponent(s));
    }
    

    And it still works great, as expected. But did some research and turns out escape() / unescape() have been deprecated for a LONG time. So while this approach still works, it is not current best practices.

    I found some other posts that talk about TextEncoder object, and it seems to work well. TextEncoder can either generate a new Uint8Array on each encode() invocation (for low performance apps), or you can predefine the array and reuse it with encodeInto(array) to save on memory thrashing.

    However! I noticed that I could not get my subscriber to properly detect when I was sending a raw array in the binary attachment… it was always returning a type of String. So I checked the Docs, and noticed that for "older" versions of our JavaScript API, it always returns a Latin1 string. To fix this, all I had to do was update the factory profile to the newer 10.5: 👈🏼

    var factoryProps = new solace.SolclientFactoryProperties();
    factoryProps.profile = solace.SolclientFactoryProfiles.version10_5;
    solace.SolclientFactory.init(factoryProps
    

    Then my subscriber's call to getBinaryAttachment() was returning an array as expected. Unexpectedly, I didn't even have to use the TextDecoder on the other side, JavaScript just knew that it was a UTF-8 String!?

    Hopefully this helps anyone in the future stumbling onto this. The publisher should do something like:

    const weirdText = "Hello World! £¥→ÐĞ🎅🏼🎉";
    const encoder = new TextEncoder();  // probably best to make this global and reuse
    const u8array = encoder.encode(weirdText);
    message.setBinaryAttachment(u8array);
    

    Then on the subscriber side, make sure you're using SolclientFactoryProfiles.version10_5 and the String will pop out properly formatted as expected..! 🙌🏼

    If the factory profile version is left at 10, it looks like this:

    [19:09:36] solace/js/test/topic: Hello World! £¥→ÐĞ🎅🎉