Function

If you implemented an Azure function that depends on a resource which may fail transiently, it’s good to implement a retry policy. The problem with functions is that you get billed on the processing time, which makes it suboptimal to implement a “respectful” retry policy (i.e. which waits before retrying) inside the function. Luckily, you can make use of a basic functionality of Azure Storage Queues to do so: message visibility timeout.

Our goal here is to implement something that is similar to a peek-lock mechanism.

The idea is that when a message arrives on the queue, we actually don’t dequeue it right away, but instead mark it as locked for some time, so no-one else processes it, peek it so we know what’s in it, then delete on successful processing, or just leave it like this, then the message reappears once lock expires, and we try to process it again. After number of attempts, just send it to the poison queue.

Conceptually, here is how it works:

sequenceDiagram
    client->>queue: publish(message)
    queue->>function: event(new message here!)
    function->>queue: peek-lock(message, 10 seconds)
    queue-->>function: message
    function->>service: doStuff(message)
    service-->>function: error: failed, retry later
    Note over queue: 10 seconds later
    queue->>function: event(new message here!)
    function->>queue: peek-lock(message, 10 seconds)
    queue-->>function: message
    function->>service: doStuff(message)
    service-->>function: success
    function-->>queue: delete(message)

Azure Functions are in fact doing that for the message to be reprocessed if an error occurs. They also implement a mechanism over it: when the message gets released, if the processing was successful, it is deleted, else its visibility is reset to VisibilityTimeout. The documentation doesn’t tell about it, though. Fortunately, it’s possible.

This is the visibility logic:

graph TD;
A[processing finished] -->B{success?}
B -- yes --> C[Delete message]
B -- no --> D{Retries > max}
D -- no --> E[Set visibility timeout to x seconds]
D -- yes --> F[Send to poison queue]

Code

Jumping into code to forget the precariousness of our lives, I’m gonna build up a function that picks a message from the queue, doesn’t do anything with it1, just fails, then wait for the message to re-appear after some time, and repeat, until the message hits the dequeue count limit and disappears in the Limbos of the Internets2.

The project is here on GitHub.

Prepare the resource group in Azure

You need:

  1. A function app
  2. A storage account

Create the function

Create a new node function (doesn’t matter the language, but this example is node). Create a connection to the storage account you created, then set a trigger on a queue (e.g. js-queue-items).

As promised, it basically does nothing but fail3 (although it waits 5 seconds to simulate that it’s doing something):

module.exports = function (context, message) {
    context.log(context.bindingData.id, ' - dequeueCount =', context.bindingData.dequeueCount);
    setTimeout(()=>context.done("failure"), 5000);
};

Open the app service editor (in Function app settings → ‘Go to App Service Editor’):

app service editor

then edit the host.json and add a configuration for queues (or edit the existing one if you do have one).

{
  "queues": {
		"visibilityTimeout": "00:00:10",
		"maxDequeueCount": 3
	}
}

What this means, is that if the processing of a message is failing, it’s going to stay in the queue for 10 seconds, and after 3 attempts to process the message, be sent a poison queue.

Send messages to the queue

Using a client to send messages there, trigger your function:

azure = require "azure-storage"

queueService = azure.createQueueService()

queueService.createQueueIfNotExists 'js-queue-items', (err, result, response) ->
  if err
    console.log err
  else
    queueService.createMessage 'js-queue-items',  'aGVsbG8gd29ybGQ=', (error) ->
      console.log "Event inserted #{error ? "without error"}"

Open the logs and observe:

logs

Here is what happens:

  • Message is dequeued a first time
  • Functions waits for 5s, then fails
  • Message visibility timeout is reset to 10s
  • Life flows by for 10 seconds
  • Message is dequeued a second time, 15s after the initial dequeue (5s processing + 10s waiting)
  • etc.

Considerations

  1. It’s a nice enough way of doing retries, that uses guaranteed delivery from Azure Storage Queues.
  2. The timeout is from the time you finished processing. If your processing time is 5s, then from the first call to the next will be 15s.
  3. If your trigger is not a queue, you can still just do a pass-thru function that leverages it. If you can’t, the retry should probably happen somewhere else.
  4. You need to return properly from the Azure Function (e.g. using context.done())
  5. This doesn’t let you implement back-off policies. It shouldn’t be used if slave system is unavailable because of a too busy error, or you’ll just backup more work4, but more to cover yourself against race conditions and the likes.

Edits

Notes

  1. At this point you may be wondering: do I not care about my impact on the environment? Consuming a message like that and not doing anything with it seems wasteful. And I had an answer to that.
  2. Where it will stay in oblivion with thousands of AIM email addresses
  3. Please don’t judge it, after all, aren’t we all the same?
  4. Instead, an exponential retry policy should be used