Forum Discussion

dbeavon's avatar
dbeavon
Contributor
3 years ago

REST API reliability challenges in Azure

For those of us that are software developers, we are accustomed to this message.


"The underlying connection was closed: A connection that was expected to be kept alive was closed by the server. This issue occurred after 68 seconds."

When this issue comes up, it is normally because of some event that breaks TCP connectivity.  Occasionally this happens because of network misconfiguration, or load-balancer misconfiguration.

 

This is one of two errors we are seeing after migrating our REST api workloads to onestream's Azure environment.  It happens fairly frequently, albeit in an unpredictable way.  I haven't found a pattern yet.  It doesn't happen on-premise.

 

If anyone has experience deploying a REST api solution to Azure, please let me know if this is an error message you encounter frequently.  It may not even be a problem with onestream's software; perhaps the root cause is with the Azure load-balancer.  I am also looking into other possible explanations, like SSL inspection.  Any tips would be greatly appreciated!

 

 

 

  • We heavily use the REST api to retrieve financial data.  Recently our onestream platform was migrated to Azure and we started receiving failures that never occurred on-premise.

     

    Here is the second type of failure message that has been affecting us since moving to Azure:

     

    ( A payload size constraint?)

     

     

    The error indicates a parsing problem at a certain character position within the results.  Basically what we are seeing is that the json payload that is being delivered via the REST api is unexpectedly truncated.  As a result, we lose some portion of the data, and the JSON reader is unable to interpret the entire document.

    Please let me know if anyone has encountered this type of a REST api issue after migrating to Azure.

  • We heavily use the REST api to retrieve financial data.  Recently our onestream platform was migrated to Azure and we started receiving failures that never occurred on-premise.

     

    Here is a third error that started affecting us in Azure, but did not ever occur on-premise.


    System.Exception: 'Invalid response from query in SendHttpRequest. Status code is Unauthorized. Content is "Error processing External Provider Sign In information. The remote name could not be resolved: 'login.microsoftonline.com'".   '

     

    This occurs AFTER we have used the AAD to authenticate and calculate an oauth bearer token.  It happens within the subsequent request to the REST api.  What this is essentially saying is that the onestream application server in Azure is not able to contact "login.microsoftonline.com".  Or rather, it can't even resolve the I.P. address for login.microsoft.com.

     

    Obviously it is a problem if/when any application that is hosted in Azure is unable to contact an identity server that is also hosted in Azure.  I'm assuming the purpose of contacting that identity server is to validate the oauth bearer token in the request header.

     

    If anyone is familiar with this error, or with the resolution, please let me know.  Migrating our onestream platform to Azure has been an interesting experience, with quite a few challenges.  We are opening support tickets as well, but new problems seem to be coming up faster than we can fix them.  If onestream support is able to find fixes to these, I will remember to circle back with a reply for the forum.

  • JackLacava's avatar
    JackLacava
    Honored Contributor

    I took the liberty to condense your posts into one thread, as they are all effectively about one topic (making the REST api work reliably in Azure).

    • dbeavon's avatar
      dbeavon
      Contributor

      JackLacava Please don't.  They have entirely different root causes.  A DNS failure is different than a load balancer constraint, which is different than SSL inspection client-side issues.

      • JackLacava's avatar
        JackLacava
        Honored Contributor

        They are all precise infrastructure issues that are better discussed with our Support guys; so to start with, reading you've already opened tickets, I was tempted to just archive them. However, from the perspective of forum readers, who are mostly application administrators and application developers, they all originate from using the REST api in Azure and the reliability challenges that arise from that scenario - I think you understand that, considering you've effectively linked them together with that "part 1/2/3" in the title. So I thought there was the chance of a useful strategic discussion if we framed it like that. I'm really sorry if it looks harsh; I'm just trying to keep this space look more like a discussion area than a support-ticket queue.

        Edit: thinking about it a bit more, regardless of the principle, I should have probably discussed it with you before taking action. Sincere apologies, I'll do that in the future.

  • JackLacava The only reason I opened these issues here is because I believe they will frequently be encountered by other customers who migrate an on-prem environment to Azure. I intended to save some of them lots of time, and save your support organization from lots of unnecessary tickets. 

     

    These are technical issues, but not overly so.  Any customer who is using your REST api will need to understand the potential technical issues that might arise in Azure, even if these issues aren't happening on-prem.  Ideally these sorts of technical discussions would take place on stackoverflow.  But the community of onestream customers doesn't seem to understand the benefit in that (yet).  From what I gather, there is no onestream presence in stackoverflow,  so we are left with no choice but to discuss technical issues in this curated/moderated forum instead.

     

    As it turns out, two of my issues were, in fact, related and had the same root cause.  Someone had neglected to disable SSL inspection, and that was causing both the TCP disconnection issues and the truncated payload issues.

     

    The third issue (DNS remote name could not be resolved) is something that tech support says they have seen in the past, but it is rare and they say it might be associated with their maintenance work within Azure last night.  I am going to monitor it for another week or so, to determine whether the occurrences are as rare as people say.  A DNS failure that happens once a week seems reasonable, but not hourly or daily. ...  From what I recall, I think the SLA for most DNS service providers is 100%.