September 6, 2023September 5, 2023 Joe Carlyle

Azure Networking BCDR – Azure Back to School 2023

Introduction

We are now several days into another year of Azure Back to School, I hope you’ve enjoyed the content so far as much as I have, and thanks again to the team for organising and having me back for another year, I can’t wait for the rest of the month. Check it all out over at – https://azurebacktoschool.github.io/

This year, I’m going to take a look at some of the challenges that BCDR (Business Continuity and Disaster Recovery) can pose to Azure Networking. This is something I have seen pop up quite a lot recently, as companies move to solidify their footprint, close gaps, and make use of all Azure has to offer to “keep the lights on”.

High-Level Architecture

As with many of my articles like this, it is important to call out the scope of the discussion. Azure is a vast platform, and I will be the first to say that every environment is unique. As such, this doesn’t aim to be an exhaustive, several thousand word long piece covering every scenario. For the sake of discussion, it will focus on these core network services – Virtual Network, Virtual Network Gateway, Azure Firewall, Network Security Group, Route Table, and Public IP.

Core BCDR Components

In a similar introductory fashion, it is also important to highlight the Azure BCDR relevant concepts that are included in discussion. Essentially an understanding of what an Azure region is, and what Availability Zones are will cover you here.

Network arch and scenario to cover outage

OK, so let’s look at a typical production network setup in Azure. Heading over to the Azure Architecture Center, we can find some excellent materials and guides, however, we’re going to focus on this one – Hub-Spoke

As you can see from the diagram, this visually includes several services I have mentioned, some secondary, like Public IP addresses, are not there explicitly, but we all know that Bastion, Firewall and VPN Gateway require one.

Network Services Alignment

So let’s look at where these services align to core Azure BCDR requirements. One thing to note here is that Azure divides its services up into different categories based on their regional availability by design:

Foundational – Available in all recommended and alternate regions when a region is generally available, or within 90 days of a new foundational service becoming generally available.
Mainstream – Available in all recommended regions within 90 days of a region’s general availability. Mainstream services are demand-driven in alternate regions, and many are already deployed into a large subset of alternate regions.
Strategic – Targeted service offerings, often industry-focused or backed by customized hardware. Strategic services are demand-driven for availability across regions, and many are already deployed into a large subset of recommended regions

It also then divides how Azure services support Availability Zones:

Zonal – A resource can be deployed to a specific, self-selected availability zone to achieve more stringent latency or performance requirements. Resiliency is self-architected by replicating applications and data to one or more zones within the region. Resources are aligned to a selected zone.
Zone-Redundant – Resources are replicated or distributed across zones automatically. Think ZRS Storage Account as an example.
Always-Available – Always available across all Azure geographies and are resilient to zone-wide outages and region-wide outages.

Finally, before we get to our specific services, remember that not all Azure regions are equal. Some have all services, some don’t. Some support Availability Zones, some don’t. Make sure you are confirming your requirements against your proposed region – every time – as updates happen quickly!

Ok, so onto our specific network services and how they align:

Virtual Network – Foundational & Zone-Redundant
Virtual Network Gateway – Foundational & Zone-Redundant*
Azure Firewall – Mainstream & Zonal & Zone-Redundant*
Network Security Group – Foundational & Zone-Redundant
Route Table – Foundational & Zone-Redundant
Public IP – Foundational & Zonal & Zone-Redundant*

*SKU dependent, not all SKUs have the same feature set.

First thing that you should notice, is that all of these Networking services have really strong coverage for BCDR. However, not one of them is regionally resilient. That means regardless of our in-region, zonal design, we may need additional regional configuration and deployment, depending on your requirements.

Let’s look at within a single region first, using our same example deployment architecture. Within a fully supported region, (remember, always check!) such as North Europe, we can deploy the entire architecture to be zone-redundant. This means that should an entire zone be lost, our network services will stay active. This is the equivalent of a 99.99% SLA in Azure terms. Obviously this requires some small tweaks during deployment to achieve, and a slight uptick in cost due to SKU requirements, but this is honestly an excellent baseline to work from.

Challenges

One challenge here, I am not aware of a service that allows you to modify zonal deployment/configuration after deployment. You must do it at deployment. This means if you’re approaching an existing environment with this in mind, you might have quite a few maintenance windows and rebuilds etc. Bicep is your friend here for testing and deployment.

Obviously we then have the regional challenge. And by challenge, I guess that ultimately means if you need your services, should a region go down, how do you deal with that in advance? When it comes to networking in Azure, there is no replication service, or tick box to make it multi-region. Why not you ask? That’s for a different post, but let’s look at what is needed.

Generally you would deploy several elements ahead of time, when it comes to networking as per our example design. You could in fact deploy the whole thing, if you have the budget for Azure Firewall in both regions. The network would then be viewed as a hot secondary, allowing you to run individual workloads there permanently, or as part of testing. By deploying these elements ahead of time, it greatly reduces your RTO times, and if you have VMs, you will definitely need at least the Virtual Network as a target for Azure Site Recovery. Again, Bicep can really help here, but ultimately I would recommend having everything within budget deployed ahead of time. Small items, like where a Public IP is on an allow-list, catch you with BCDR. Azure only allocates these on deployment (if a Prefix, and if you’re not using prefixes, why not!?), so get them deployed and added to your vendors etc ahead of time. Similarly, you can plan and runbook changes required based on existing configuration.

Unavoidable Issues

With zone redundancy deployments, I would call out two issues and they have already been highlighted in brief. It has to be actioned at time of deployment, and SKU costs. Configuration wise, for networking it’s fairly simple and shouldn’t pose challenges.

With regional redundancy, there are quite a few more. A lot of it based on the complexity demanded by running two regions, two footprints and that replication methods do not exist for all services – for example replicating a Virtual Machine vs no ability to replicate Azure Firewall natively. There is also cost of course, having two footprints, in theory means double your network costs. Unfortunately, as we all know, cost is only a challenge before an outage, you would have unlimited budget to recover!

Closing Recommendation

To sum up so – Azure Networking BCDR – Zonal Redundancy for a standard footprint is very achievable, and is definitely the way to go. If you need regional redundancy, try build ahead everything you can to mirror the primary region.

March 15, 2023April 22, 2023 Joe Carlyle

Exploring – Azure Firewall Basic

Anyone who follows this blog knows that Azure Firewall is a key resource for me in successful Azure deployments. Its combination of ease of deployment and functionality easily outpace alternative vendor choices on Azure. Up until now, we have had a Standard and Premium SKU. The Premium SKU introduced new features to Standard. Now, we have a Basic SKU and several features have been removed. Let’s explore what the Basic SKU offers.

First up, deployment and infrastructure. At it’s core, Basic is the same resource. Meaning it still has built-in HA. However, it is a fixed scale, meaning two instances only. However, Availability Zones are still covered, meaning choices up to 99.99% for SLAs are achievable. Fixed scale does mean a more limited bandwidth capability, Basic has up to 250Mbps in comparison to Standard which is up to 30Gbps. That’s not a typo!

Microsoft call out the fact they are targeting SMB customers with this SKU. But that doesn’t mean that the features of Basic wouldn’t suit an Enterprise spoke, or specific environment requirement where cost vs features work.

So let’s take a look at the features included. The basics are all the same, multiple Public IPs, inbound/outbound NAT etc. (there is a full list here) but some specifics worth calling out are:

Network Rules – As Basic does not support DNS Proxy, you can only use standard, non-FQDN rules in Network filtering. More on that for Standard here.
Threat Intelligence – While it can be enabled, it can only be used in alert mode. This means you would have to accept this, and/or monitor logs to adjust rules based on alerts.

This means that once you are aware of the functionality and limitations, Basic may be a great choice for your environment. Especially when you consider one of its main benefits – cost. There are two costs associated directly with Azure Firewall:

Deployment
Data Processing

Deployment wise, Basic is considerably cheaper versus Standard. Deploying to North Europe, Basic should be approximately €266/month in comparison to Standard at circa €843/month.

However, data processing is more expensive on Basic. 1Tb of data processing for Basic, in North Europe will be approximately €62/month, which is quite a bit more than Standard coming in at around €15/month. So this is definitely one to keep an eye on in your environment. There is no reservation or similar choice here, Standard and Premium simply have a lower processing price.

Thankfully, integration with Azure Monitor is unchanged across SKUs, so you can capture all of the data you need.

The experience within portal, or via shell for deployment and management is also unchanged. The portal dynamically calls out what is allowed/functional when using a Basic policy, so confusion is avoided.

In conclusion, I think Basic is a great addition to the AFW family. I would have liked to see DNS Proxy included in the feature set, I see this deployed everywhere now and the Network rule functionality it adds is excellent. I am also interested to see how/if that throughput limit will come into play for specific scenarios.

As always, if there are any questions, please get in touch!

September 15, 2022September 12, 2022 Joe Carlyle

We Need to Talk About VNETs – Azure Back to School 2022

Virtual Networks are arguably one of the most common resources in Azure. You will find them in the vast majority of environments facilitating some form of private or static network functionality. However, they haven’t always been around.

For those that remember the older Azure days, the ASM model didn’t have the same network concept. We have only had Virtual Networks since the introduction of ARM and they have changed drastically over the years.

Don’t get me wrong, at it’s core a Virtual Network is still just an address range. A private network of at least one subnet that sets your connectivity boundary. However, it has been a long time since I have seen a Virtual Network only operate in its basic capacity.

As a result of this ever growing list of services a Virtual Network offers, I think it is about time we talk about VNETs!

But aren’t VNETs straight forward? I deploy them all the time etc. Yes they can be, but they also offer an enormous range of network services. As an abstract piece of evidence, did you know that if you download the Virtual Network documentation page on Docs to a PDF, it is 848 pages!?

First off, what this article is not – an explanation of the basic elements of a VNET. For example subnets and address spacing. So presuming you have some familiarity, let’s discuss what I like to call secondary services.

So, what is a secondary service? For me, it’s a service that cannot exist, or serves no purpose without requiring a VNET. Think Bastion, Route Server, NSGs etc. they all serve specific purposes but commonly enhance the functionality of a VNET. Some of these, like Bastion, I feel would be better if included within a VNET resource, like Service Endpoints. However, that is for another post!

They are also drastically different in their complexity. For example, a Service Endpoint can be deployed without much effort and barely any planning (just double check those routes). However, Route Server requires significant elements of both.

While this list of secondary services is ever growing, I do not necessarily think this is a bad thing. I am always all for extra functionality. However, understanding that you cannot simply deploy a VNET and have the majority of network features that most people use is something that should be clearer. There are new services released like Network Manager that will help with management, but none offer a single view of everything.

To both convey the complexity, but help simplify things (weird I know) I thought it best to pick two services, create a test environment so you can try them, and discuss some of the components. So I’ve chosen two of the newer services:

Azure Route Server
NAT Gateway

Azure Route Server

Route Server is an excellent addition to routing services within Azure. Previously, there could be some routes originating from Virtual Network Gateways via BGP, the system routes and everything else would be via Route Table. While this works, it can be cumbersome and management at scale of Route Tables is almost non-existent. Route Server solves some of those problems by allowing BGP interaction between NVAs, Virtual Network Gateways and your VNETs system routing table. It’s also nice that it is a managed service, and HA out-of-the-box.

The objective of Route Server is to simplify and centralise routing management. This is helped by using a default peering process, meaning if your NVA supports BGP – it should work with Route Server. It also natively supports peering of VNETs with the same switch as “use remote gateway” meaning it slots very neatly into Hub-Spoke designs. Including the use of Virtual Network Gateways as peers (note VPN VNGs have to be configured in active-active mode).

As with many secondary services, Route Server requires a dedicated subnet in your VNET and each VNET can only have a single Route Server. The subnet does not allow the addition of an NSG or a UDR. This may flag as a concern as Route Server now requires a Public IP, however, this is only to guarantee access to management services and does not open the VNET (according to Microsoft). Also, no data traffic is sent between Route Server and your NVAs.

However, it’s important to note some of the configurations where Route Server alone is not the answer and in some cases begs the question that if I still have to use UDRs for that, why should I bother with Route Server? For example, ExpressRoute will advertise routes that will be preferred over Route Server routes, meaning you would need to overwrite this with a UDR. You cannot simply turn off the ER advertisement as this runs over the same peering functionality. A nice fix here would be to split that choice into two switches. One for VNGs, one for Route Server.

Another element that may be important is price. VNETs are free, UDRs are free, Route Server is far from that. On many large environments, this may be a negligible cost. However, you should weigh up the benefits vs the cost with introducing Route Server.

So to help, as promised, here is a repo that will build a test footprint for you. I’ve taken the Route Server tutorial using Quagga and integrated it with theother services from this article. You can follow the steps to complete the configuration and confirm you have a functioning peer. You should see output similar to the below from Cloud Shell:

NAT Gateway

Implicit internet outbound is potentially one of the Azure network features that surprise most people. Deploy a VM into a VNET and you will be able to reach the internet with a random IP from the region deployed. Not exactly a dream scenario for many admins!

However VMs are not where I see this used most often. That doesn’t mean it’s not a good solution for VMs, it works exactly the same and works well. I just more commonly see this to facilitate static outbound IPs for PaaS resources. Like an App Service that requires a static IP due to a vendor allow list.

One interesting piece here, NAT Gateway when configured on a subnet, will take precedent over locally attached Public IPs and Standard Load Balancer Outbound NAT rules. However, UDRs will still overwrite this when advertising 0/0. Another item of note, no ICMP support, only TCP/UDP.

To try out NAT Gateway, I have again included it within a repo. This will also deploy a Ubuntu VM, which you can use Bastion to connect to and login. This VM has a Public IP locally attached but is deployed to a subnet with a NAT Gateway. So, use Bastion to connect, then simply copy the ipcheck script and paste it into the command line, it will give you an output similar to the below which you can then verify against your NAT Gateway resource. Proving NAT Gateway is taking precedence over the locally attached IP.

NAT Gateway seen as the outbound IP publicly

Roundup

In closing, I think that more and more secondary services does two things.

Makes networking in Azure ever more complex
Solidifies VNETs as the most important core resource

Now, everyone should agree with number one. However, two may cause some concern, but hear me out. Regardless of your resource deployment, your application architecture etc. 99/100 you will deploy a VNET and 9/10 you will need at least one secondary service. This means that getting it right, having it well designed for deployment, management etc is crucial. Not everyone loves networking, but within Azure at the moment – you’ve gotta learn it!

Speaking of learning, if networking is your thing, check out the most appropriate Azure exam – Microsoft Certified: Azure Network Engineer Associate.

July 14, 2022July 15, 2022 Joe Carlyle

Exploring – Azure Firewall Analytics

Azure Firewall is ever growing in popularity as a choice when it comes to perimeter protection for Azure networking. The introduction of additional SKUs (Premium and Basic) since its launch have made it both more functional while also increasing its appeal to a broader environment footprint.

For anyone who has used Azure Firewall since the beginning, troubleshooting and analysis of your logs has always had a steep-ish learning curve. On one hand, the logs are stored in Log Analytics and you can query them using Kusto, so there is familiarity. However, without context, their formatting can be challenging. The good news is, this is being improved with the introduction of a new format.

Previously logs we stored using the Azure Diagnostics mode, with this update, we will now see the use of Resource-Specific mode. This is something that will become more common across many Azure resources, and you should see it appear for several in the Portal already.

What difference will this make for Azure Firewall? This will mean individual tables in the selected workspace are created for each category selected in the diagnostic setting. This offers the following improvements:

Makes it much easier to work with the data in log queries
Makes it easier to discover schemas and their structure
Improves performance across both ingestion latency and query times
Allows you to grant Azure RBAC rights on a specific table

For Azure Firewall, the new resource specific tables are below:

Network rule log – Contains all Network Rule log data. Each match between data plane and network rule creates a log entry with the data plane packet and the matched rule’s attributes.
NAT rule log – Contains all DNAT (Destination Network Address Translation) events log data. Each match between data plane and DNAT rule creates a log entry with the data plane packet and the matched rule’s attributes.
Application rule log – Contains all Application rule log data. Each match between data plane and Application rule creates a log entry with the data plane packet and the matched rule’s attributes.
Threat Intelligence log – Contains all Threat Intelligence events.
IDPS log – Contains all data plane packets that were matched with one or more IDPS signatures.
DNS proxy log – Contains all DNS Proxy events log data.
Internal FQDN resolve failure log – Contains all internal Firewall FQDN resolution requests that resulted in failure.
Application rule aggregation log – Contains aggregated Application rule log data for Policy Analytics.
Network rule aggregation log – Contains aggregated Network rule log data for Policy Analytics.
NAT rule aggregation log – Contains aggregated NAT rule log data for Policy Analytics.

So, let’s start with getting logs enabled on your Azure Firewall. You can’t query your logs if there are none! And Azure Firewall does not enable this by default. I’d generally recommend enabling logs as part of your build process and I have an example of that using Bicep over on Github, (note this is Diagnostics mode, I will update it for Resource mode soon!) However, if already built, let’s look at simply doing this via the Portal.

So on our Azure Firewall blade, head to the Monitoring section and choose “Diagnostic settings”

We’re then going to choose all our new resource specific log options

*New resource specific log categories in Portal*

Next, we choose to send to a workspace, and make sure to switch to Resource specific.

*Workspace option with Resource option chosen in Portal*

Finally, give your settings a name, I generally use my resource convention here, and click Save.

It takes a couple of minutes for logs to stream through, so while that happens, let’s look at what is available for analysis on Azure Firewall out-of-the-box – Metrics.

While there are not many entries available, what is there can be quite useful to see what sort of strain your Firewall is under.

Hit counts are straight forward, they can give you an insight into how busy the service is. Data Processed and Throughput are also somewhat interesting from an analytics perspective. However, it is Health State and SNAT that are most useful in my opinion. These are metrics you should enable alerts against.

For example, an alert rule for SNAT utilisation reaching an average of NN% can be very useful to ensure scale is working and within limits for your service and configuration of IPs.

Ok, back to our newly enabled Resource logs. When you open the logs tab on your Firewall, if you haven’t disabled it, you should see a queries screen pop-up as below:

You can see there are now two sections, one specifically for Resource Specific tables. If I simply run the following query:

AZFWNetworkRule

I get a structured and clear output:

To get a comparative output using Diagnostics Table, I need to run a query similar to the below:

// Network rule log data 
// Parses the network rule log data. 
AzureDiagnostics
| where Category == "AzureFirewallNetworkRule"
| where OperationName == "AzureFirewallNatRuleLog" or OperationName == "AzureFirewallNetworkRuleLog"
//case 1: for records that look like this:
//PROTO request from IP:PORT to IP:PORT.
| parse msg_s with Protocol " request from " SourceIP ":" SourcePortInt:int " to " TargetIP ":" TargetPortInt:int *
//case 1a: for regular network rules
| parse kind=regex flags=U msg_s with * ". Action\\: " Action1a "\\."
//case 1b: for NAT rules
//TCP request from IP:PORT to IP:PORT was DNAT'ed to IP:PORT
| parse msg_s with * " was " Action1b:string " to " TranslatedDestination:string ":" TranslatedPort:int *
//Parse rule data if present
| parse msg_s with * ". Policy: " Policy ". Rule Collection Group: " RuleCollectionGroup "." *
| parse msg_s with * " Rule Collection: "  RuleCollection ". Rule: " Rule 
//case 2: for ICMP records
//ICMP request from 10.0.2.4 to 10.0.3.4. Action: Allow
| parse msg_s with Protocol2 " request from " SourceIP2 " to " TargetIP2 ". Action: " Action2
| extend
SourcePort = tostring(SourcePortInt),
TargetPort = tostring(TargetPortInt)
| extend 
    Action = case(Action1a == "", case(Action1b == "",Action2,Action1b), split(Action1a,".")[0]),
    Protocol = case(Protocol == "", Protocol2, Protocol),
    SourceIP = case(SourceIP == "", SourceIP2, SourceIP),
    TargetIP = case(TargetIP == "", TargetIP2, TargetIP),
    //ICMP records don't have port information
    SourcePort = case(SourcePort == "", "N/A", SourcePort),
    TargetPort = case(TargetPort == "", "N/A", TargetPort),
    //Regular network rules don't have a DNAT destination
    TranslatedDestination = case(TranslatedDestination == "", "N/A", TranslatedDestination), 
    TranslatedPort = case(isnull(TranslatedPort), "N/A", tostring(TranslatedPort)),
    //Rule information
    Policy = case(Policy == "", "N/A", Policy),
    RuleCollectionGroup = case(RuleCollectionGroup == "", "N/A", RuleCollectionGroup ),
    RuleCollection = case(RuleCollection == "", "N/A", RuleCollection ),
    Rule = case(Rule == "", "N/A", Rule)
| project TimeGenerated, msg_s, Protocol, SourceIP,SourcePort,TargetIP,TargetPort,Action, TranslatedDestination, TranslatedPort, Policy, RuleCollectionGroup, RuleCollection, Rule

Obviously, there is a large visual difference in complexity! But there are also all of the benefits as described earlier for Resource Specific. I really like the simplicity of the queries. I also like the more structured approach. For example, take a look at the set columns that are supplied on the Application Rule table. You can now predict, understand, and manipulate queries with more detail than ever before. You can check out all the new tables by searching “AZFW” on this page.

Finally, a nice sample query to get you started. One that I use quite often when checking on new services added, or if there are reports of access issues. The below gives you a quick glance into web traffic being blocked and can allow you to spot immediate issues.

AZFWApplicationRule
| where Action == "Deny"
| distinct Fqdn
| sort by Fqdn asc

As usual, if there are any questions, get in touch!

June 30, 2022July 15, 2022 Joe Carlyle

How To – Enable Azure Firewall Resource Specific Diagnostics

There is a new format of logs coming to Azure resources. Currently most people are familiar with what is called Diagnostics Table logs. The resource log for each Azure service has a unique set of columns. The AzureDiagnostics table includes the most common columns used by Azure services. If a resource log includes a column that doesn’t already exist in the AzureDiagnostics table, that column is added the first time that data is collected. If the maximum number of 500 columns is reached, data for any additional columns is added to a dynamic column.

Resource Specific logs however are platform logs that provide insight into operations that were performed within an Azure resource. The content of resource logs varies by the Azure service and resource type. Resource logs aren’t collected by default.

So onto enabling them. Via the Portal, this is straight forward in terms of choice and is well documented here. However, when I went to include this enablement in a Bicep build that I have, I noticed there wasn’t anything clearly documented. So, here is an example using Azure Firewall.

Normally, my diagnostics resource looks like the below and this enables Diagnostics table logs:

resource azfwDiags 'Microsoft.Insights/diagnosticSettings@2021-05-01-preview' = {
  name: '${afwName}-diags'
  scope: azFW
  properties: {
    logs: [
      {
        category: 'AzureFirewallApplicationRule'
        enabled: true
        retentionPolicy: {
          days: 90
          enabled: true
        }
      }
      {
        category: 'AzureFirewallNetworkRule'
        enabled: true
        retentionPolicy: {
          days: 90
          enabled: true
        }
      }
    ]
    workspaceId: log
  }
}

However, to enable Resource Specific, a few changes are required. Obviously the category names are different however you also need to include the Property – logAnalyticsDestinationType as you see below on line 5.

resource azfwDiags 'Microsoft.Insights/diagnosticSettings@2021-05-01-preview' = {
  name: '${afwName}-diags'
  scope: azFW
  properties: {
    logAnalyticsDestinationType: 'Dedicated'
    logs: [
      {
        category: 'AZFWApplicationRule'
        enabled: true
        retentionPolicy: {
          days: 90
          enabled: true
        }
      }
      {
        category: 'AZFWNetworkRule'
        enabled: true
        retentionPolicy: {
          days: 90
          enabled: true
        }
      }
    ]
    workspaceId: log
  }
}

Using the resource above within your Bicep code will allow you to deploy Resource Specific diagnostics settings as needed.

As usual, if there are any questions get in touch!

wedoAzure

A blog about Microsoft Azure

Networking