Troubleshooting Azure SNAT Port Exhaustion for the Loome Agent

This article provides guidance on resolving intermittent connection timeout issues that may occur when the Loome Agent is deployed within an Azure Virtual Network (VNet).

The Problem

When the Loome Agent is deployed to an Azure Container Instance (ACI) behind an Azure Virtual Network, it may experience intermittent timeouts or dropped connections when communicating with our platform. This can lead to delays in data processing and unexpected failures.

The Root Cause

By default, Azure Container Instances deployed to a VNet and placed on a delegated subnet without a configured outbound method (such as a NAT Gateway or Firewall) will use a shared pool of public IP addresses and a limited number of SNAT (Source Network Address Translation) ports for all outbound connections.

This can become an issue when an application, such as the Loome Agent, maintains persistent, long-lived connections. Each of these connections consumes a port from the shared pool. Because the agent’s connections hold on to these ports, they are not quickly released and returned to the pool. Consequently, in a high-volume environment, the shared pool can become exhausted, preventing new connections from being established and causing timeouts.

This is a networking limitation of the default Azure setup and not an issue with the Loome Agent itself.

Solution Path 1: Update to the Latest Loome Agent

We have released an updated version of the Loome Agent that is more resilient to network failures. This version includes enhanced retry logic and backoff strategies to gracefully handle connection drops and timeouts.

Recommended Action: We strongly recommend updating the Loome Agent to the latest version. This is the simplest and most effective solution to stabilize communication without requiring any changes to your Azure network configuration.

Solution Path 2: Enhance Your Azure Network Route

While we strongly recommend updating the agent, a network-level solution can also be configured if you require it for other services. This provides a dedicated pool of IP addresses and SNAT ports, preventing exhaustion.

You have two primary options:

  1. Deploy an Azure NAT Gateway:
    • What it is: A highly scalable and resilient Network Address Translation service.
    • How it helps: A NAT Gateway provides a dedicated public IP address and a large pool of SNAT ports for your VNet’s subnet. All outbound traffic from the Loome Agent is routed through this gateway, ensuring a stable and predictable connection. This is the recommended and simplest networking solution for this issue.
  2. Deploy an Azure Firewall:
    • What it is: A managed, cloud-based network security service that provides threat protection for your VNet resources.
    • How it helps: By routing all outbound traffic through an Azure Firewall, you gain a static public IP address and a dedicated SNAT port pool. In addition to solving the connection stability issue, it also allows you to centrally manage and filter all outbound traffic from your VNet.

Both options provide a robust solution by eliminating the shared IP address pool, but we recommend the NAT Gateway as the most straightforward and cost-effective option for addressing the specific SNAT port exhaustion issue.

Summary

The most direct solution is to upgrade to the latest Loome Agent. However, if you are managing a large-scale or complex Azure environment, configuring an explicit outbound route with an Azure NAT Gateway or Azure Firewall is a best practice that will provide network-wide stability for all your services.