How to – Enable Alerts for Azure ExpressRoute

Azure ExpressRoute is Microsoft’s recommended method of accessing resources in Azure over a private connection. It is unique to the platform and can offer unparalleled resilience and performance. It also has the capability to allow you connect to Azure Public Resources, such as Storage Accounts, Azure SQL, over the same circuit and therefore, private connection.

With the above being possible, it’s clear that ExpressRoute, once implemented, becomes the key resource in your environment. As such, you would be wise to implement some level of monitoring and alerting for it.

Here, I’m going to show you how to very quickly put in place alerts should either of the required BGP sessions drop for your ExpressRoute circuit.

To add some context to that, each ExpressRoute circuit requires a pair of BGP sessions to meet SLA requirements. They will function with just one, but that’s not fully supported and definitely not recommended! So, presuming you have both active, we’ll base our alert off of that metric. We’re going to look specifically at Private Peering as it’s the most common implementation, but the logic is identical for Microsoft Peering.

If both are active and configured correctly, you should see the same routes being advertised within the circuit if you check your route table from the Peering blade

Next, let’s look at the Metric we will be using for the alert. In your ExpressRoute Circuit blade, choose Metrics. Then we’ll choose the ‘BGP Availability’ Metric and ensure Aggregation is ‘Avg’

Next, to future proof the alert, or if you have a Microsoft Peering, we’re going to apply some filtering. So choose ‘Add filter’ (highlighted above). Then we’re going to select the ‘Peering Type’ Property, and choose ‘private’ from the Values drop down and click the tiny blue tick to the right (You won’t see Microsoft if you don’t have one, but do this anyway to ensure your alert is tightly scoped).

Now, we’re looking at the average BGP availability, for our Private Peering only.

To make life easier, you can now just click ‘New Alert Rule’ on the top right to use this scoped metric for your alert.

This applies the right resource scope and brings the metric across too, you now just have to configure some of the required choices on specifics.

First up, let’s confirm the condition and signal logic. Click on the blue text under Condition. You will see Private is already selected for you. I’m going to change Operator to ‘less than or equal to’ and add a value of 95.

The above settings mean that the BGP Availability metric for my private peering, over the last 5 minutes, will be assessed every minute. Should it drop below or equal to 95%, an alert will fire. You can adjust this as needed to suit your own environment.

Now, we need to define what happens when they alert fires. Click Done on the signal logic pop out, this will bring you back to the alert configuration blade.

I would recommend you create and use an Action Group, even something simple like a distribution group to email. More on Action Groups – https://docs.microsoft.com/en-us/azure/azure-monitor/platform/action-groups

Next, define the final details for the alert. Use a descriptive name and the appropriate severity for your environment/system.

Now, the alert itself can take some time to become fully active, so I would recommend waiting about thirty minutes before attempting a test. But, please ensure you test the alert!

And that’s it. Once tested, you have now setup a fully scoped alert rule for your ExpressRoute Private Peering. You can repeat the above, and change the scope to Microsoft Peering to cover that too if you have it.

As always, if there are any questions, get in touch!

How to – Troubleshoot Azure Firewall

Networking in Azure is one of my favourite topics. As my work has me focus primarily on Azure Virtual Datacenter builds, networking is key. When Microsoft introduced Azure Firewall (AFW), I was excited to see a platform based option as a hopeful alternative to the traditional NVAs. Feature wise in preview, AFW lacked some key functionality. Once it went GA a lot of the asks from the community were rectified however there are still some outstanding issues like cost, but all of that is for a blog post for another day!

AFW is used in a lot of environments. It’s simple to deploy, resilient and relatively straight forward to configure. However, once active in the environment, I noticed that finding out what is going wrong can be tricky. Hopefully this post helps with that and can save you some valuable time!

I don’t know about you, but the first thing I always check when trying to solve a problem is the most simple solution. For AFW that check is to make sure it’s not stopped. Yes that’s right you can “stop” AFW. It’s quick and easy to do via shell:

# Stop an existing firewall

$azfw = Get-AzFirewall -Name "FW Name" -ResourceGroupName "RG Name"
$azfw.Deallocate()
Set-AzFirewall -AzureFirewall $azfw

But how do you check if it has been stopped? Very simply, via the Azure Portal. On the overview blade for AFW it shows provisioning state. If this is anything but “Succeeded” you most likely have an issue.

So, how do you enable it again should you find your AFW deallocated? Again, quite simply via shell, however, it must be allocated to the original resource group and subscription. Also, while it deallocates almost instantly, it takes roughly the same amount of time to allocate AFW as it does to create one from scratch.

# Start a firewall

$azfw = Get-AzFirewall -Name "FW Name" -ResourceGroupName "RG Name"
$vnet = Get-AzVirtualNetwork -ResourceGroupName "RG Name" -Name "VNet Name"
$publicip = Get-AzPublicIpAddress -Name "Public IP Name" -ResourceGroupName " RG Name"
$azfw.Allocate($vnet,$publicip)
Set-AzFirewall -AzureFirewall $azfw

So, your AFW is active and receiving traffic via whatever method (NAT, Custom Route Tables etc.) and you have created rules to allow traffic as required. Don’t forget all traffic is blocked by default until you create rules.

By default the only detail you can get from AFW are metrics. These can show a small range of traffic with no granular detail, such as rules hit count.


To get detailed logs, like other Azure services, you need to enable them. I recommend doing this as part of your creation process. In terms of what to do, you have to add a diagnostic setting. There are two logs available, and I recommend choosing both.

  • AzureFirewallApplicationRule
  • AzureFirewallNetworkRule

In terms of where to send the logs, I like the integration offered by Azure Monitor Logs and there is a filtered shortcut right within the AFW blade too.

Once enabled, you should start seeing logs flowing into Azure Monitor Logs within five to ten minutes. One aspect that can be viewed as a slight negative is that logs are sent in JSON. As a result most of the interesting data you want is part of an object array:

{
  "category": "AzureFirewallNetworkRule",
  "time": "2018-06-14T23:44:11.0590400Z",
  "resourceId": "/SUBSCRIPTIONS/{subscriptionId}/RESOURCEGROUPS/{resourceGroupName}/PROVIDERS/MICROSOFT.NETWORK/AZUREFIREWALLS/{resourceName}",
  "operationName": "AzureFirewallNetworkRuleLog",
  "properties": {
      "msg": "TCP request from 111.35.136.173:12518 to 13.78.143.217:2323. Action: Deny"
  }
}

So, when running your queries, you need to parse that data. For those who have strong experience in Kusto, this will be no problem. For those who don’t, Microsoft thankfully provide guidance on how to parse both logs including explanatory comments

For ApplicationRule log

AzureDiagnostics
| where Category == "AzureFirewallApplicationRule"
//using :int makes it easier to pars but later we'll convert to string as we're not interested to do mathematical functions on these fields
//this first parse statement is valid for all entries as they all start with this format
| parse msg_s with Protocol " request from " SourceIP ":" SourcePortInt:int " " TempDetails
//case 1: for records that end with: "was denied. Reason: SNI TLS extension was missing."
| parse TempDetails with "was " Action1 ". Reason: " Rule1
//case 2: for records that end with
//"to ocsp.digicert.com:80. Action: Allow. Rule Collection: RC1. Rule: Rule1"
//"to v10.vortex-win.data.microsoft.com:443. Action: Deny. No rule matched. Proceeding with default action"
| parse TempDetails with "to " FQDN ":" TargetPortInt:int ". Action: " Action2 "." *
//case 2a: for records that end with:
//"to ocsp.digicert.com:80. Action: Allow. Rule Collection: RC1. Rule: Rule1"
| parse TempDetails with * ". Rule Collection: " RuleCollection2a ". Rule:" Rule2a
//case 2b: for records that end with:
//for records that end with: "to v10.vortex-win.data.microsoft.com:443. Action: Deny. No rule matched. Proceeding with default action"
| parse TempDetails with * "Deny." RuleCollection2b ". Proceeding with" Rule2b
| extend 
SourcePort = tostring(SourcePortInt)
|extend
TargetPort = tostring(TargetPortInt)
| extend
//make sure we only have Allowed / Deny in the Action Field
Action1 = case(Action1 == "Deny","Deny","Unknown Action")
| extend
    Action = case(Action2 == "",Action1,Action2),
    Rule = case(Rule2a == "",case(Rule1 == "",case(Rule2b == "","N/A", Rule2b),Rule1),Rule2a), 
    RuleCollection = case(RuleCollection2b == "",case(RuleCollection2a == "","No rule matched",RuleCollection2a),RuleCollection2b),
    FQDN = case(FQDN == "", "N/A", FQDN),
    TargetPort = case(TargetPort == "", "N/A", TargetPort)
| project TimeGenerated, msg_s, Protocol, SourceIP, SourcePort, FQDN, TargetPort, Action ,RuleCollection, Rule

For NetworkRule log

AzureDiagnostics
| where Category == "AzureFirewallNetworkRule"
//using :int makes it easier to pars but later we'll convert to string as we're not interested to do mathematical functions on these fields
//case 1: for records that look like this:
//TCP request from 10.0.2.4:51990 to 13.69.65.17:443. Action: Deny//Allow
//UDP request from 10.0.3.4:123 to 51.141.32.51:123. Action: Deny/Allow
//TCP request from 193.238.46.72:50522 to 40.119.154.83:3389 was DNAT'ed to 10.0.2.4:3389
| parse msg_s with Protocol " request from " SourceIP ":" SourcePortInt:int " to " TargetIP ":" TargetPortInt:int *
//case 1a: for regular network rules
//TCP request from 10.0.2.4:51990 to 13.69.65.17:443. Action: Deny//Allow
//UDP request from 10.0.3.4:123 to 51.141.32.51:123. Action: Deny/Allow
| parse msg_s with * ". Action: " Action1a
//case 1b: for NAT rules
//TCP request from 193.238.46.72:50522 to 40.119.154.83:3389 was DNAT'ed to 10.0.2.4:3389
| parse msg_s with * " was " Action1b " to " NatDestination
//case 2: for ICMP records
//ICMP request from 10.0.2.4 to 10.0.3.4. Action: Allow
| parse msg_s with Protocol2 " request from " SourceIP2 " to " TargetIP2 ". Action: " Action2
| extend
SourcePort = tostring(SourcePortInt),
TargetPort = tostring(TargetPortInt)
| extend 
    Action = case(Action1a == "", case(Action1b == "",Action2,Action1b), Action1a),
    Protocol = case(Protocol == "", Protocol2, Protocol),
    SourceIP = case(SourceIP == "", SourceIP2, SourceIP),
    TargetIP = case(TargetIP == "", TargetIP2, TargetIP),
    //ICMP records don't have port information
    SourcePort = case(SourcePort == "", "N/A", SourcePort),
    TargetPort = case(TargetPort == "", "N/A", TargetPort),
    //Regular network rules don't have a DNAT destination
    NatDestination = case(NatDestination == "", "N/A", NatDestination)
| project TimeGenerated, msg_s, Protocol, SourceIP,SourcePort,TargetIP,TargetPort,Action, NatDestination

Using either query gives you clear readable that you can filter. One tip however, is to add a sort command to the end of the queries, normally I use by TimeGenerated to show me the latest data. So to condense and add that for the NetworkRule query above, it would look like:

AzureDiagnostics
| where Category == "AzureFirewallNetworkRule"
| parse msg_s with Protocol " request from " SourceIP ":" SourcePortInt:int " to " TargetIP ":" TargetPortInt:int *
| parse msg_s with * ". Action: " Action1a
| parse msg_s with * " was " Action1b " to " NatDestination
| parse msg_s with Protocol2 " request from " SourceIP2 " to " TargetIP2 ". Action: " Action2
| extend SourcePort = tostring(SourcePortInt),TargetPort = tostring(TargetPortInt)
| extend Action = case(Action1a == "", case(Action1b == "",Action2,Action1b), Action1a),Protocol = case(Protocol == "", Protocol2, Protocol),SourceIP = case(SourceIP == "", SourceIP2, SourceIP),TargetIP = case(TargetIP == "", TargetIP2, TargetIP),SourcePort = case(SourcePort == "", "N/A", SourcePort),TargetPort = case(TargetPort == "", "N/A", TargetPort),NatDestination = case(NatDestination == "", "N/A", NatDestination)
| project TimeGenerated, msg_s, Protocol, SourceIP,SourcePort,TargetIP,TargetPort,Action, NatDestination
| sort by TimeGenerated desc

Using parsed data, you can immediately see all the traffic hitting AFW and, for example, filter on options such as Action to see only denied traffic.

Microsoft also provide pre-cooked visualisation should you prefer it, you can download from here – https://raw.githubusercontent.com/Azure/azure-docs-json-samples/master/azure-firewall/AzureFirewall.omsview – then import into Azure Monitor. The detail is great for quick glance work, I really like the ApplicationRule breakout

Application rule log data

That about sums it up. Hopefully you are now informed and equipped to troubleshoot traffic issues in your Azure Firewall instance. As always, if there are any questions, please get in touch!

If you need more info on how to enable logs – https://docs.microsoft.com/en-us/azure/firewall/tutorial-diagnostics

Log and metrics concepts – https://docs.microsoft.com/en-us/azure/firewall/logs-and-metrics

Achievement Unlocked: Azure Hero

Recently, Microsoft announced a new recognition program for those contributing to the Azure community named “Azure Heroes”. Which they describe as “A new and fun way to earn digital collectibles for meaningful impact in the technical community.”

Built via collaboration with Enjin, Azure Heroes is a blockchain based recognition program. There are several categories that have been created and are represented by digital “badgers”. These are tokenised into a digital asset on the Ethereum public blockchain.

The current categories are below:

  • Inclusive Leader (limited to 100)
  • Content Hero (limited to 250)
  • Community Hero (limited to 550)
  • Mentor (limited to 800)
  • Maker (limited to 10k)

You can read the full details of the badgers here – https://www.microsoft.com/skills/azureheroes

Last week, I was delighted to receive the Content Hero badger! You can view the asset here https://enjinx.io/eth/asset/63152039

If you feel someone you follow, or you yourself deserve a badger, get nominating! You can do it via a simple form here http://aka.ms/we.nominate

Ignite the Tour: London

Microsoft Ignite is the big tech conference every year for Azure fans. However, it’s always in the USA and as it’s a full week long it can be a challenge for many people to travel.

Thankfully, Microsoft introduced a touring version of the conference in 2018 and has continued it through this year.

Ignite the Tour kicks off in London today and runs until tomorrow evening. It should be an excellent event with many MVP and Microsoft FTE peeps presenting great sessions.

If you’re attending, I’m on a panel with some excellent and intelligent people discussing Cloud Adoption, please come and check it out! You can find it with code – THR10012

Converting Classic Azure Resources (ASM) to Azure Resource Manager (ARM) – Where to Start?

During a couple of recent client engagements, I realised there are still quite a few people using Azure Classic Resources. The simple reason for this is that they work and we all know in IT, if it ain’t broke…etc.

Now, if you only recently began working with Azure, you may never have even seen a Classic Resource. The change to Azure Resource Manager (ARM) started back in 2016. Classic Resources do have some similar objects now that they are fully in the “new” portal (yes there was an old portal, it was…odd) but the concepts are different. This is because the focus was on services, hence it being referred to as Azure Service Management (ASM). Now, Azure is focused on resources, therefore Azure Resource Manager (ARM).

Azure Cloud Services diagram
ASM Example

So if you discover, or own some classic resources and would like to move them to ARM, hopefully this blog post will help you make a start.

First, do you need to migrate away from ASM? So far, Microsoft has said no, they’re “not deprecating ASM in the near future”. However, if you note the use of “near” that could mean that as the Azure Roadmap progresses, ASM is eventually deemed redundant.

Updated 2020-02-28, Microsoft have announced deprecation of ASM VMs and dependent resources on March 1st, 2023.

The other reason for a migration may be that you want to avail of some of the features only supported in ARM or want to gain greater granular control of your individual resources. Whatever your reason, the approach is the same:

  • Plan
  • Validate
  • Prepare
  • Migrate
Plan

In this stage, you need to understand and most likely document all of the resources in scope for the migration. Once you have this, you can compare and contrast to options supported and changes required for migration to be successful. The following resources are supported for migration:

  • Virtual Machines
  • Availability Sets
  • Cloud Services with Virtual Machines
  • Storage Accounts
  • Virtual Networks
  • VPN Gateways
  • Express Route Gateways (in the same subscription as Virtual Network only)
  • Network Security Groups
  • Route Tables
  • Reserved IPs

The list for currently unsupported features is here.

While the migration operation for the most part takes place in the management plane, and there is therefore no downtime, you could proceed with a migration without impact to your users. However, I’d recommend either completing the migration out of business hours or planning a maintenance window to account for any unforeseen issues. I’d always rather have the window and not need it than the opposite!

It’s also important at this point to understand the order resources should be moved too. Microsoft have this flowchart, the lines are bit confusing but it’s accurate.

Screenshot that shows the migration steps

The most important point to take from the above is that if your VMs are all in the same vnet, well then you migrate the vnet, it will bring all dependent resources with it. Then your next step is any storage accounts.

Validate

Microsoft have provided a helpful validation operation within the portal and via Poweshell. This will give you a very quick look into your resources and their config to see if it’s compatible with a migration or not. However, it can not take into account all unsupported features, such as an issue on the ARM side, so ensure you complete your own checks!

Prepare

This is a key phase. The migration process occurs on the control plane, the data plane is unaffected at all times. The preparation operation allows you to create the ARM metadata while leaving your current ASM resources in place. This is represented in the graphic below:

Diagram of the prepare phase

Once the operation begins, the management plane is locked for the duration. This means you cannot make any changes to your resources. However, your users are unaffected. With one exception, if your VM is not in a vnet ( I know, crazy in ARM!) it will be stopped and started during Prepare.

If Prepare is successful, you can continue with the migration or, abort and fix any issues. The operation creates a non-editable Resource Group for each Cloud Service, copying the name and appending “-Migrated”. Nice and neat, relatively low impact and risk. Exactly what is useful during a migration! Remember, the management plane is locked, so if all is OK complete the migration to unlock. If not, abort to unlock.

Migrate

Once you’re good to go, the last step is to migrate. Again, this is only a control plane change. The operation is idempotent, so if it fails it can be retried. However, once complete you cannot rollback. Once complete, you will now be in the state represented below:

Diagram of commit step

The Classic Resources are removed and only the ARM versions remain. As the models are different certain services and aspects are renamed or incorporated in a different way, a full list of these mappings is here and is very useful during the Prepare phase and for future use once migrated.

All the steps required to migrate all VMs in the same vnet are documented and I’d recommend reviewing all steps before running any.

If you take your time and follow the recommended methods, you should be able to make your move from ASM to ARM pain free.

As always, if there are any questions comment or get in touch on Twitter!