The case of the Fiddler heisenbug

There is a presentation I give to our graduates during their first week with us, the second slide is:


This is taken from the multi-media overload that was U2s Zoo TV tour. I use it to try to get our graduates to accept that they are really back at the start of their learning process. This is pretty much how I felt a week or two back when one of our consultants said that they were seeing lots of HTTP 401 authentication traffic while our application was running. I’d personally spent a lot of time over the years trying to make sure that we were as efficient as possible so I was sceptical to say the least…


The services architecture for the product I work on follows the Command Query Responsibility Separation approach which I’ve talked about before. In summary we fetch data from an OData service provided by WCF Data Services and then make updates via a suite of services implemented using regular SOAPy WCF. We closely monitor the message exchange between our applications and services to ensure that we aren’t too chatty, messages aren’t too big and so on – we do this using the excellent Fiddler. Many moons ago, I spent quite some time getting my head around how to correctly configure IIS and WCF to use Kerberos to allow the services to be scaled out over a web farm. By now I’ve run through this on numerous test environments and real world environments so I was pretty confident I know how it works.

The Problem

Our software runs on-premise within the walled garden of the corporate network. We support some of the largest law firms in the world and so on occasion have to deal with some very wide area networks. The connection from desktop to server can take place over long distances with the characteristics of high latency and low bandwidth; any messaging overhead can be painful. For years now we’ve used Fiddler to look at our services as all the call activated services use HTTP. At one client, Fiddler was not working [which turned out to be a conflict with the McAfee software they used] and so they used Wireshark instead. When observing the HTTP traffic in Wireshark, our consultants and the client saw many HTTP 401 authentication responses, far more than we expected. Each 401 response results in additional latency delay and requires additional messages to be exchanged between the client and the server. In our testing to date, we believed we had tuned the services to require only a single 401 authentication response and then to cache and present the credentials on each subsequent request.



To stop a WCF Data Services request, secured using Windows Authentication, requiring authentication on every call – you need to set the PreAuthenticate flag to true on the HttpWebRequest via the SendingRequest2 event on the generated context. Fiddler (and Web Proxy in the Microsoft Message Analyzer) hides this from you because it implements a connection pool of Keep Alive connections.


Reproducing the issue

The first task was to reproduce the behaviour inside one of our test environments. I’m fortunate to have a very well spec’d HP Z420 on my desk which is a great Hyper-V server. Inside Hyper-V I have a private domain set up which has a couple of load balanced application servers running our software. First off, I ran the client software on both Windows 7 and Windows 8.1 with Fiddler running in the background, no sign of the additional 401s. I then switched over to using a lower level network monitoring but rather than using Wireshark, I decided to try out the Microsoft Message Analyzer. This is Microsoft’s replacement for the Network Monitor tool, it provides a number of different filters, two of which were of interest:

  • web proxy – same deal as Fiddler, looking at HTTP
  • local link layer – all traffic on the NIC

Using the web proxy produced the same results as Fiddler however using the local link layer filter showed lots of additional 401 responses – when I ran the Message Analyzer with both web proxy and the local link layer filters there was no additional 401s. We had hit a Heisenbug, when observing the HTTP traffic through a web proxy, the proxy was changing the behaviour of the traffic.

Confirm our current understanding

My faith in our current collective understanding of what was happening was pretty shaken so I ran through the various settings that I previously thought would avoid these 401s:

1. Is the URL of the service trusted? Windows must consider the service URL to be trusted to pass Kerberos tickets. Any easy way to check the zone of any URL is the following code snippet:

var zone = System.Security.Policy.Zone.CreateFromUrl("");

If necessary, add the service host URL or a matching pattern to the Local Intranet Zone via IE:

In this example, * has been added to the local intranet zone.


2. Are the load balanced services running as a domain account? Does this account have an appropriate HTTP SPN registered against it?


3. Do the various IIS web applications have the useAppPoolCredentials flag set in configuration? This instructs IIS to expect the Kerberos SGT (service granting ticket) to be encrypted using the credentials of the account used by the mapped application pool, rather than the default machine account.


4. Is Kerberos configured to use a transport session rather than a connection per call for authentication? This is set in IIS against the web application using the authPersistNonNTLM setting.

This adds a Persistent-Auth header to the HTTP response (seen here using Message Analyzer):


These settings are available from within the IIS Manager using the Configuration Editor:


Navigate to the system.webServer/security/authentication/windowsAuthentication settings:


Set the properties as required. If you want to programmatically set these values via script, IIS will helpfully generate the scripts for you. Look over on the right hand side of the Configuration Editor and you’ll see a ’Generate Script’ option.


Clicking on this will generate a change script for you in a number of technologies, I tend to favour PowerShell:


All this checked out on my environment but I wanted to ensure that NTLM was not in play (here). To do this I enabled NTLM logging on the domain controller using group policy. Using gpedit.msc, I enabled the ‘Network Security: Restrict NTLM: Audit Incoming NTLM Traffic’ and  ‘Network Security: Restrict NTLM: Audit NTLM authentication in this domain’ policies [under Windows Settings, Security Settings, Local Policies, Security Options]:


Interesting it showed that there was unexpected NTLM traffic – from the AppFabric services to the SQL Server. The MSSQLService was set-up to run as a domain account, service.sql, but the appropriate SPN had not been mapped to that account:

> setspn –a MSSQLSvc/ service.sql

> setspn –a MSSQLSvc/SqlServer2012:1433 service.sql

I mapped both the FQDN and the NETBIOS name formats just to be sure. This resolved the issue and I no longer saw NTLM traffic.


What Next?

At this point I thought the environment was configured as it should be but I was still seeing the additional 401s. After a lot of searching and head scratching I came across this post from Fiddler author, Eric Lawrence. The rub being:


In some cases, the time required to open a new network connection to the server is greater than the time required to send the request and download the response. Therefore, if the client opens a new connection for every request, the application’s performance is greatly degraded. The practice of reusing a single TCP/IP connection for multiple requests is called “keep-alive” and it’s the default behaviour in HTTP/1.1. However, clients or servers may choose to disable keep-alive by either sending a Connection: close header or by abruptly closing the connection after each transaction.

Fiddler maintains a “connection pool” of idle keep-alive connections to the server. When the a client request comes in, this pool is first checked to determine if an existing connection is available on which the request can be sent. Even if the client specifies a Connection: close request header, that only causes Fiddler to close the client’s connection after the response is sent—the server connection is returned to the pool (unless it too disabled keep-alive).

What this means is that if your client isn’t using Keep-Alive connections, its performance can be severely impacted. However, when Fiddler is introduced, performance is improved because “expensive” server connections are reused.(Since Fiddler and the client are (typically) running on the same computer, establishing a new connection from the client to Fiddler is very fast.)

The fix for this problem is simple: Ensure that your client is using KeepAlive connections. That’s as simple as:

  1. Ensure that you’re using HTTP/1.1
  2. Ensure that you haven’t disabled Keep-Alive (e.g. set the KeepAlive property of the HTTPWebRequest object to true)
  3. Don’t send Connection: Close headers

Note that creating connections to servers can be even more expensive than the simple TCP/IP establishment cost. First, there’s TCP/IP Slow-Start, a congestion-management feature of the protocol that means that new connections have a slower transfer rate than longer-lived connections. Next, if you’re using HTTPS, there’s an expensive cryptographic handshake which must be performed on each new connection. Lastly, if your connections use either the NTLM or Negotiate authentication protocols, you may find that each new connection requires a 3-step handshake (e.g. the server sends a HTTP/401 challenge, the client resends the request, the server sends another HTTP/401 challenge, the client resends the request with a challenge-response, and the server finally sends a HTTP/200). Because these are “connection-oriented” authentication protocols, subsequent requests over an existing connection may be able to avoid these extra round-trips.

Here is the heisenbug, Fiddler is maintaining a Keep-Alive connection to the server even though my call may not be.

So how does this relate to the WCF service calls? For the basicHttpBinding, the Keep-Alive behaviour is enabled by default, it can optionally be turned off via a custom binding, see here.

Back to Basics

At this point I was still convinced I should not be seeing those additional 401s, so I decided to build a very simple secured WCF service and generate a proxy to the standard OData service we use.

Here is a WCF Service that simply says Hello to the calling Windows user.


WCF Configuration as follows:


Visual Studio created a service reference for me an I simply called the service a number of times: both reusing the proxy as well as closing the proxy and recreating it:


The link layer trace was as follows:


This was as expected, a single 401 but then 200s on subsequent calls. Kerberos was being used successfully and a transport level session was established! Just for completeness I could see the HTTP Keep-Alive header in the POST:



OK, on to the WCF Data Service. Again in Visual Studio I generated a service reference then:


This resulted in:


And the following trace:


At last here was the repeated 401/200 behaviour.

I checked for the Keep-Alive header in the request:


And looked for the Persistent-Auth header in the response:


Both present.

More head scratching.

More searching.

Then I posted this question to the Microsoft WCF Data Services forum.

While waiting for an answer, a colleague and I took at look at the System.Data.Services.Client.DataServiceContext base class for the generated context object. Working through that code, I came across the HttpWebRequest class which had a PreAuthenticate property which looked exactly what I wanted. A little more digging and then I found I could do this:

var context = new ExpertDbContext(…

context.Credentials = CredentialCache.DefaultNetworkCredentials;

context.SendingRequest2 += context_SendingRequest2;


static void context_SendingRequest2(object sender, SendingRequest2EventArgs e) {

((HttpWebRequestMessage)e.RequestMessage).HttpWebRequest.PreAuthenticate = true;



This was it!

Testing the code with this small change and the 401s were gone from the WCF Data Service traffic. Just as I was grabbing a celebratory cup of coffee, a colleague asked if I had seen the response to my question on the forum? I had not; it validated the above approach – Thank you Fred Bao.


Wrapping Up

This took about a week elapsed to work through, we’ve now updated our query service (OData) proxy to set the PreAuthenticate flag and can see improved system performance, particularly over constrained WAN connections. That Fiddler hid this really threw me, heisenbugs are really hard to dealt to.


Securing WF & WCF Services using Windows Authentication

To finish off the DEV404 session Pete and I presented at TechEd NZ, I gave a brief run through of the steps required to get Windows Authentication working in a load balanced environment using kerberos. Given the number of camera phones that appeared for snaps I’m going to assume this is a common problem with a non-intuitive solution…

The product I work on is an on-premise enterprise solution that uses the Windows Identity to provide an authenticated credential against which to authorize user requests. We host our services in IIS/Windows Server AppFabric and take advantage of the Windows Authentication provided by IIS. This allows one of two protocols to be used: kerberos and NTLM, which have quite separate characteristics.

Why Use Kerberos?
There are two main reasons we want to use kerberos over NTLM:

1. Performance: NTLM uses a challenge response pattern for authentication which leads to a high network utilization. During performance testing we saw a high volume of NTLM challenges which ultimately throttled our ability to serve requests. Kerberos uses tickets which can be cached permitted a better performing protocol.

1. Double hops: NTLM does not flow credentials – the canonical example is a user requesting serviceA on server1 to access a secured resource on server2. Server1 cannot flow the users identity to server2.

Kerberos and Load Balancing
We want to run our services within a load balanced cluster to avoid single points of failure and to be able to grow resources to meet demand as required, without having to adopt bigger tin. The default configuration of IIS does not encourage this… the Application Pools run as a local machine account. This is a significant issue for Kerberos because of the manner in which the protocol encrypts the tickets passed between client, TGS and target server. The password of the account running the service is used to encrypt tickets so that only a process running under that account can decrypt the message. The default use of a machine specific account prevents a ticket granting access to serviceX on server A also being used to access serviceX on server B.

The following steps are required to fix this:

1. Use a common domain account for the applications pools.

We use a DOMAIN\ account to run our services. This domain account is granted log on as a service and log on as a batch job rights on each of the application servers.

2. Register an SPN mapping the service class to the account.

We run our services on HTTP and so register the load balancer address with the domain account used to run the services:

>setspn -a HTTP/clusteraddress serviceAccount

We are using the WCF BasicHttpBinding which does not require the client to ensure the service is running as a particular user (to prevent man in the middle attacks). If you are using any other type of binding then the client needs to state who it expects the service to be running as.

3. Configure IIS to use the application pool account rather than a machine account

system.webServer/security/authentication/windowsAuthentication useAppPoolCredentials must be set to true.

4. Configure IIS to allow kerberos authentication tokens to be cached

system.webServer/security/authentication/windowsAuthentication authPersistNonNTLM must be set to true.

See also

5. Ensure the cluster address is considered to be in the Local Intranet zone

Kerberos tokens are not supported in the Internet zone, therefore the URL for your services must be considered to be trusted. The standard way to implement this is to roll out a group policy that adds your domain to the local intranet zone settings.

The slide deck for the talk is available from

A solution to WinRM in a NLB cluster…

I’ve written a couple of posts discussing the remoting options for PowerShell:
• fan-out model – Windows Remote Management service (WinRM)
• fan-in model – IIS hosted PowerShell endpoint (using the IIS WinRM extension)

When running load balanced WCF services in IIS that are secured using Windows Authentication, the web applications are mapped to app pools that use a domain account. This is required by kerberos to ensure that the encrypted messages can be decoded using a common set of credentials. By default, the HTTP SPN would be registered against the machine account, however this is changed to map to the domain account. This broke WinRM which is also an HTTP endpoint but runs as the network service, therefore the kerberos authentication failed because it is expected to be running under the domain account.

PowerShell supports two machine name formats, when setting the Invoke-Command -ComputerName parameter: the NETBIOS name and the fully qualified domain name (FQDN). To be able to call the WinRM service and authenticate using kerberos, you need to use the machine name format that is not used in the SPN. For example, if


is the SPN registered against the domain account used by the application pools, then

PS>icm -ComputerName -scriptblock { ‘foo’}

will fail, however

PS>icm -ComputerName myserver -ScriptBlock {‘foo’}

will succeed. It works because the SPN must be an exact match for the machine name used (though case insensitive on Windows). If HTTP/myserver was registered, the command would fail. [I tried using the IP address too but PowerShell reports an error saying it does not support that scenario unless the IP address is in the TrustedHosts list]. This is still a little ‘magic’ and the better way to do this is to enable CredSSP in PowerShell.

This discovery removes the need to use the fan-in model, which we’ve found to be more problematic than the WinRM Windows Service:
• Cannot use the IIS:/AppPools/ path, returns no results
• Cannot use IIS:/Sites/, throws a COM exception
• AppPool identity must have ‘Generate Security Audit’ right on the machine
• Intermittent failures with the Windows Process Activation

Another recent discovery is around the effect of the NETBIOS name with IE zone security. If a resource is consider to be outside the local intranet or trusted sites zone, then kerberos does not work – the ticket is not issued. Therefore using the FQDN requires the domain to be added to the local intranet zone sites. The use of the NETBIOS name however is considered to be within the local intranet zone and therefore no amendment to the zones are required.

One last tangential gotcha… it is possible to extend the probe path that IIS uses when looking for assemblies beyond the standard bin directory.

    <assemblyBinding xmlns="urn:schemas-microsoft-com:asm.v1">
        <probing privatePath="bin;SharedBin" />

However files not in the normal bin directory are not shadow copied, therefore you can get file locking that you don’t expect when updating the files – in the case above, the SharedBin.

Configuration for Kerberos

This is a summary of the voodoo required to get WCF services hosted in IIS to work with a load balancer and kerberos. This took me way longer than I had hoped to figure out so I hope I can save someone else that pain.

We have recently been running some load and stress tests against our latest Golden Gate SP1 product which supports the horizontal scale out of workflow services. This scale out capability is one of the core features of Windows Server AppFabric. Our software is designed to run in an ‘on premise’ scenario and leverages Windows integrated security for authorization of users. A major performance improvement we discovered during our original Golden Gate testing was to ensure kerberos was used rather than NTLM when performing Windows Authentication. We wanted to ensure that our new services were using kerberos for Windows authentication since we had moved some of our services from being hosted as a Windows Service to being hosted in IIS, in particular the workflow services.

Note: in addition to performance advantages, you need to use Kerberos if you want to achieve multi-hop delegation of credentials, NTLM does not support this. The resources at the end of this post discuss this further.

In this post I’m going to walk through a worked example and give a checklist to follow. In a later post I may drill down into a little more of the background, in the meantime I’ll include some additional resources at the end.

The scenario involves three application servers that are configured into a network load balanced (NLB) cluster using NLB in Windows Server 2008. The machine names are:

The virtual host name for the NLB is

The NLB is set-up to load balance traffic on port 80, for our HTTP based services and the port range 18180-18199 for our Windows Services. Each of the servers runs all of the services that we support horizontal scale out for and one of the servers (310) runs the services that only support a single instance. In a typical installation we have around 15 services, rather than list out all of these I’ll concentrate on two types:
• services hosted in IIS that expose HTTP endpoints
• services hosted as Windows Services that expose net.tcp endpoints

Alongside the three application servers is a database server that hosts the ADERANT Expert database, the AppFabric monitoring database and the AppFabric workflow persistence database.

The basicHttpBinding configuration used to enable Windows authentication is as follows:

        <binding name="expertBasicHttpBinding" maxReceivedMessageSize="2147483647">
          <readerQuotas maxArrayLength="2147483647" maxStringContentLength="2147483647" />
          <security mode="TransportCredentialOnly">
            <transport clientCredentialType="Windows" proxyCredentialType="Windows">
              <extendedProtectionPolicy policyEnforcement="Never" />

1. The servers must be in the local intranet zone of any calling machines.
As of Windows Server 2003, by default only the local intranet zone supports the passing of credentials for Windows Integrated authentication between machines. This makes sense as you rarely want to pass your Windows credentials beyond your own domain. At ADERANT we have a group policy set-up so that all machines have any machine with a name matching * registered in the local intranet zone.

You can explicitly name the servers for the zone, also ensure that the servers are not listed in the Trusted Sites zone.

2. Windows Services exposing WCF net.tcp endpoints must have SPNs registered for both the application server and the network load balancer addresses.

When a non-basicHttpBinding is used, such as net.tcp, the WCF infrastructure checks to ensure that the service is running under the identity that the client expects. This prevents ‘man-in-the-middle’ attacks where someone spoofs the service you want to call with their own for some nefarious purpose. When you generate a service proxy against a net.tcp endpoint you’ll see something similar to the following configuration snippet in the app.config:

      <servicePrincipalName value="CalculatorSvc/" />

There is an identity element that specifies the expected identity of the service host. There are two different options supported: and . If your service is published on a domain and you always expect the client calling the service to be online, then the userPrincipalName is easiest to configure. The value attribute contains the identity that the service is running as, e.g. value=“ADERANT_AP\”.

Alternatively you can set a servicePrincipalName, as above. The service principal name (SPN) is broken down into three parts:

serviceClassName / address [: portNumber]

The service class name is a token that uniquely represents the service. Common service classes are HTTP and HOST, the example above is using CalculatorSvc to uniquely identify a calculation service. At ADERANT we use class names such as ExpertConfigurationSvc. After the service class name comes the machine name, e.g. SVEXPGG310. Note that the NetBIOS name and the fully qualified domain names are considered to be different, it is common place to register both. For example:


Once we have an SPN, it must be registered in Active Directory (AD) against the user account used to run the service. We recommend a service account along the lines of myDomain\ to run the ADERANT services. To register this account with an SPN there is a command line tool setspn:

setspn -A ExpertConfigurationSvc/

As part of our deployment tooling we automatically generate a batch file containing all the SPNs that require to be registered in AD for a given environment. An SPN must not be registered twice, this will cause errors. To see the SPNs currently registered against a user you can use the setspn tool using the -L option and passing the account name:

setspn -L

If we take our configuration service as an example, we need the following SPNs registered in AD for the scenario environment:


If you are running a development workstation, you will often see HOST/localhost as the SPN generated by the svcutil for locally hosted WCF services. This indicates that the service is expected to be running on the local machine.

If the service needs to support delegation then the AD account used to run the service must have this enabled:

The account must also be granted ‘Log on as a service’ rights on the application server hosting the service. This can be set-up using the local machine policies admin tool or pushed out via group policy.

3. Load balanced WCF Services hosted in IIS, using HTTP bindings, must have HTTP SPNs added for the account of the application pool.

By default an SPN is created in AD for the machine account of a server running IIS, for example HTTP/SVEXPGG310. In a load balanced scenario the machine account SPN cannot be used to issue a kerberos ticket because it is different for each machine in the application farm. Instead the kerberos ticket needs to be issued using the identity of the application pool that the web service is running under. If you have multiple application pools, these must all be running under the same account. The application pool account must have SPNs registered for the HTTP service as follows:

setspn -A HTTP/
setspn -A HTTP/svnlb301
setspn -A HTTP/
setspn -A HTTP/svexpgg310
setspn -A HTTP/
setspn -A HTTP/svexpgg311
setspn -A HTTP/
setspn -A HTTP/svexpgg312

Here we have both the NetBIOS and FQDNs for the servers and the load balancer.

4. Load balanced WCF services hosted in IIS, using HTTP bindings, must use the Application Pool credentials to issue kerberos tickets.

In addition to adding the SPNs in 3, now change IIS so that it uses the app pool credentials for the kerberos ticket. This can be done either through the configuration manager in IIS or from the command line.

The obscured section path is system.webServer/security/authentication/windowsAuthentication.
From a command line:
appcmd set config /section:windowsAuthentication /useAppPoolCredentials:true

This has to be set on all of the application servers within the application farm.

While in IIS configuration, it is also worth setting authPersistNonNTLM to true, see for details.

5. Enabled Windows Authentication on the required web applications in IIS.
There are two parts to this, the first of which is to ensure that the Windows Authentication provider for IIS is installed. This can be checked in the Windows features control panel.

The next step isto enable the Windows Authentication on the website itself. From the dashboard for the site, open the Authentication manager and then ensure that Windows Authentication is enabled:

While you are here, it’s worth checking the advanced properties of the Windows Authentication (available from the context menu) to ensure that Kernel-mode authentication is set.

This can also be set programmatically:

appcmd set config “Default Web Site/MyWebService” -section:system.webServer/security/authentication/windowsAuthentication /enabled:true /commit:apphost

Wrap up & Testing
Those are the key steps required to get kerberos working in a load balanced environment:
1. ensure the servers are in the local intranet zone.
2. create and register SPNs for net.tcp services for all app servers and the load balancer.
3. create and register HTTP SPNs for all app servers and the load balancer.
4. take care to avoid duplicate SPNs.
5. understand that NetBIOS and FQDNs require separate SPNs.
6. set useAppPoolCredentials to true on all IIS servers in the app farm.
7. run all application pools using a common domain service account, give this account permission to delegate and log on as a service.
8. ensure the web applications for the services have Windows authentication enabled.

It’s mostly straight forward once you’ve been through the steps once.

The easiest tool to test with is a browser and Fiddler. From within Fiddler you can look at the authorization headers for the HTTP requests which will show you if kerberos or NTLM is used. We expose an OData service which requires Windows authentication, it was very easy to trace the authentication negotiation going on for this site within Fiddler.

Security in WCF (MSDN Magazine):

Patterns & Practices Kerberos Overview:

Patterns & Practices WCF Security Guide: