Service Deployment

Deployment is one of those tasks that can often be left late in the development lifecycle, though it is a non-trivial problem. The adoption of continuous integration as part of an agile approach encourages the deployment aspects to be undertaken along side the development so that at the end of each sprint, the stakeholder has an installable piece of software delivered. When creating a service orientated architecture the deployment problem increases in complexity. Gone are the days of a SQL script for the database server and an installer for the client machine. Now there are often tens of servers interacting in a medium scale solution, often in a web or application server farm to provide both resilience and scale out capabilities. Almost two years ago I took a step back and looked at how we were deploying software and saw that there had to be a better way. We were installing early versions of the Golden Gate software onto customer sites and experiencing a lot teething problems getting the system running. Often the problems were due to the servers not having the required pre-requisites installed such as the .NET framework, they did not have the correct services running and so on. In an attempt to document the installation process we ended up with an installation guide that was rapidly approaching 100 pages. There had to be a better way…

Environment and Role Manifests
I’m on occasion reminded that I’m primarily paid to think so I took a deep breath and started to think about the problem. What would the ideal situation be? The first, and in many ways the biggest, realization is that we wanted to treat the deployment of the whole system as a unit of work. We wanted to allow an administrator to define where they wanted our software to be deployed into their site and then they simply click ‘go’.

The definition of the system would include a list of the servers they wanted to use and the roles they wanted the server to perform. Windows Server has the concept of a role, when setting up a new installation you choose what you want the server to do; is it the active directory controller, is it an application server, is it a file server, is it a web server? Depending upon which roles you allocate, different features are available. Some roles are incompatible on the same server, some roles are dependent upon other roles being satisfied by other servers. The role concept was something we also required as we had a number of different server components: configuration, security, workflow, messaging and application services. Each component was a unit of deployment, a server could be allocated the workflow role for example, which contained a number of services such as instance management and task management . We did not want to have to walk / remote onto to each server and perform an installation, we wanted a central process co-ordinate and manage the installation across all of the servers.

We needed a collective term for the definition of a complete deployment and in the end I chose the term environment. This came from my days working for an internet bank where we had a strictly defined set of staging platforms (environments) that code had to work its way through on the way to production; integration test, system test, user acceptance test, pre production. The environment is the root level object in a system deployment and contains information such as the environment name, the list of servers to install to, common file locations such as the install directory and others. A firm is expected to have multiple environments, as a minimum: development, test and production/live.

The concepts of the environment and the role are similar to the two manifests that ClickOnce uses to control client installations: the publisher manifest and the application manifest. The publisher manifest is owned by the company that is running the software and it includes information specific to them such as the installation URL. The application manifest is owned the the company who authored the software and includes all of the files required on the client to run the software (amongst other details). In fact I drew a lot of inspiration from ClickOnce, what we wanted was a ClickOnce mechanism for the server deployment. ClickOnce is driven from the two XML manifest files that declare what is required, these are given to the ClickOnce engine to action and the deployment takes place. I’m a big fan of both declarative programming and modeling so I wanted a deployment model that could be actioned. This was 12 months before all the excitement around Oslo and DSLs flared up (and then died down again). We had seen that both WPF and WF worked well as XAML driven runtimes (in .NET 3.X) and so the basic concepts of a deployment model and runtime took shape.

In summary an environment contains a mapping of servers to roles. A role represents an installable server component. Both the environment and role details are captured as manifest files which can be described in XML.

Environment Manifest
The environment manifest is quite simple and most easily explained with an example:

<environment    name="Local" 
  <expertDatabaseServer serverName="" serverInstance="">
    <databaseConnection     databaseName="Expert" 
                            password="eo4G3S2KLO05EzgQb3Q==" />
    <server name="" 
            skipPrerequisitesCheck="false" servicesWebsite="Default Web Site">
        <role type="configuration"/>
        <role type="customworkflows"/>
        <role type="employeeIntake"/>
        <role type="fileopening"/>
        <role type="identity"/>
        <role type="messaging"/>
        <role type="queryservice"/>
        <role type="security"/>
        <role type="workflow">
            <roleParameter name="defaultSmtpHost" value="" />
            <roleParameter name="defaultSmtpPort" value="25" />
            <roleParameter name="defaultFromEmailAddress" value="" />

This example manifest captures the environment details specific to the installing firm such as the server names, database details, installation source and so on. In this simple example only one application server is specified for brevity, which runs all of the roles. In reality there would be multiple servers listed each running the roles in a load balanced configuration.

Role Manifest
A role manifest defines the pre-requisites, the files and the services deployed as a unit.

Prerequisite Checking
As mentioned, the first problem we hit during a deployment was pre-requisites. How could we be sure that a server was capable of running our software? There were a number of aspects to this:
• was a supported OS installed
• were the correct operating system components installed
• were third party dependencies met
• were the correct supporting services running
• were the components correctly configured

The pre-requisites vary by component so in the role definition we have a section of checks that must all pass before the deployment can proceed. One of the first examples we saw was that the Microsoft Distributed Transaction Co-ordinator (MSDTC) was not enabled on many of the servers. If it was enabled, then the configuration was incorrect and the machine would not accept remote transactions. For Windows Services, the service control manager (SCM) can be queried to find the state of a service and the registry contained the configuration keys for the component settings. The big problem here was the poor support for remote processes in Windows, coming from a UNIX background it has always frustrated me. At the time Powershell v1 was full of promise but it did not support remote sessions, that was coming in v2. Powershell v2 was a CTP and did not look like it would be ready in time. While a number of shell commands have built-in support for running against a remote machine, there were enough gaps, version incompatibilities between 2003 and 2008 or performance issues that in the end I wrote a Windows service that would perform the checking. Using an xcopy deployment and the SC command it is possible to remotely deploy, register and start a Windows service. This service accepts a list of pre-requisite to check and returned a list of results: pass or fail. The pre-requisites required by a role are defined within the role manifest, examples are:

 description="MSDTC configured to allow remote access." />

      description="Ensure Windows Remote Management (WS-Management) service is available" />

Required Files
A role contains a list of the files required to be installed on the server and where the files need to go. An installation of Expert has a root directory specified by the installing administrator and then the structure is fixed under that:

Each file to be copied is captured in a files section in the role manifest, an example is:

    <file   filename="Aderant.Framework.Notes.dll"
            targetRelativePath="LegacyServices" />
    <file   filename="Aderant.Framework.Notes.Presentation.dll"
            targetRelativePath="LegacyServices" />
    <file   filename="Aderant.Framework.Notes.Services.dll"
            targetRelativePath="LegacyServices" />

In order to be flexible, the file specification allows the source and target paths to be specified as well as the source and target filenames. This allows us to perform any manipulation of the file structure that we need to.

In Golden Gate SP1 we support host services either as Windows Services under the SCM or in IIS under AppFabric. We are in the process of moving all of our services to AppFabric/IIS however this is not yet complete. Therefore a role manifest may contain a section for Windows Services:

  <serviceHost exeName="Expert.Notes.Service"
             displayName="ADERANT Notes Services ({{Name}} instance)"
             description="Host for Notes Services for the {{Name}} environment."
      <service name="Notes"
               serviceName="ADERANT Notes Service"
               port="[[notesServicePort]]" />

and AppFabric hosted services:

      <applicationPool name="[[workflowApplicationPool]]"
                       netVersion="V4.0" />
      <applicationPool name="[[workflowApplicationPool]]"
                       netVersion="V4.0" />
        allowWindowsAuthentication="true" />

In both cases, the information required to create an host a service is provided. For Windows based services we have a reusable service host exe, AppFabric extends IIS and WAS to provide the hosting.

Deployment Engine
Up to this point we really been looking at the deployment model and how it is captured in the two manifests. These manifests are just an XML serialization of a deployment model. When we load an environment we just map from the XML into an in memory object graph of the environment. We now need something to action the model, and this is the deployment engine.

The deployment engine itself is the coordinator that executes a number of deployment actions. A deployment action performs a piece of work required in a deployment, its interface is as follows:

namespace Aderant.Framework.Deployment.Actions {
    public interface IDeploymentAction: IDeploymentMessage {
        void Deploy(Environment environment);
        void Clean(Environment environment);
        void Validate(Environment environment);

The deployment engine supports a set of actions that can be performed to an environment. The three key actions are: deploy, remove (clean in the interface) and validate. When the deployment engine is asked to perform a ‘deploy’, it asks each of the deployment actions in turn to ‘deploy’. We have a library of around 30 deployment actions, examples are:

• AppFabricHostingAction
• FileDeploymentAction
• LoadBalancingConfigurationAction
• ServiceHostBuilderAction
• SQLScriptRunnerAction

Each action in turn knows how to deploy, remove and validate its role in a deployment. The validate action is very important, it allows an administrator to check to see if a pre-installed environment still meets the pre-requisites, still has the required files in place and has the required services up and running. For example it allows an administrator to easy see that a registry setting is no longer correctly set. The deployment actions in turn rely on a set of controller classes that interact with external components such as AppFabric, the file system, the Windows service manager, MSMQ and others. The separation of controller from the deployment actions also a high degree of code re-use as well as better unit testing.

While the deployment engine is currently C# code, it would be relatively easy to move it to a workflow. The deployment engine is a coordinator and therefore the control flow would be quite naturally captured as a workflow. The deployment actions would become an activity library.

As it stands the deployment engine is a command line utility, however it does have a WPF UI that calls through to it (in a very similar model to AppFabric calling the Powershell API from the IIS Manager add-in).

The environment manifest in the screenshot above shows a small load balanced environment being used to host multiple instances of our services.

The declarative deployment model and runtime is a good candidate for a DSL. In fact we prototyped a visual DSL using the Visual DSL toolkit for Visual Studio. This allowed an administrator to literally draw out the deployment diagram for an environment, which was then transformed via a T4 template into an environment XML file. This could then be executed via the deployment engine and used to deploy a full system.

Hunting Zombies (orphaned IIS Web Applications)

Following on from the previous post, it’s time to look at one of the more sensitive areas of AppFabric… the IIS configuration.

When you run many of the AppFabric configuration commands via Powershell or the IIS Manager, the result is a change to a web.config file. IIS configuration is hierarchical with settings being inherited from parent nodes as we saw with connection strings. The implication of this is that when determining the correct settings for a web application, a series of configuration files are parsed. An error in any one of these configuration files can lead to a broken system. The event logs mentioned in the previous post are a good place to look for these errors, the offending configuration files will often be named in the log entry.

[Update: AppFabric has a one time inheritance model for its configuration, if you choose to provide a configuration setting at a node then this overrides the configuration set at a parent node. The scope / granularity of this is all AppFabric config. Microsoft tried to provide a merged inheritance model but it is a non-trivial problem and did not make v1.]

A common issue on a development workstation is the configuration getting left behind due to poor housekeeping. For example, you map a folder into IIS as a web application, this folder contains other subfolders which in turn are also mapped as web applications. If you remove the parent web application without first removing the child applications then the child configuration remains. It cannot be seen via IIS Manager as there is no way to reach it, however you can easily see it through Powershell. One of the many awesome features in Powershell is the provider model which allows any hierarchical system to be navigated in a consistent way. The canonical example is the file system, we are all used to: cd, dir, etc to navigate around. Well, these same commands (which are actually aliases in Powershell to standard verb-noun commands) can be used to navigate other hierarchies, for example IIS.

From a Powershell console running with elevated status (run as Admin), you can do the following:

First you need to add the IIS Management module to the session:

PS> import-module WebAdministration

You can the navigate the IIS structure by changing the ‘drive’ to be IIS:

> IIS:
> ls

Both the dir and ls commands are mapped to get-childitem powershell command via an alias providing a standard Windows console or UNIX console experience. Listing the children at the root level gives us access to the application pools, web sites and SSL bindings. Following through the example above, we navigate to the default web site and then list all of its children. In my case this maps exactly to what is shown in IIS:

Hunting Zombies
So, let’s makes some zombies…

I created a new folder C:\ZombieParent and added two sub folders, ZombieChild1 and ZombieChild2. I then mapped the parent folder to a web application called Zombies and converted the two sub folders also to web applications. Re-running the get-childitem commands now shows:

You can see the three web applications at the end of the list, in IIS Manager we have:

Let’s now remove the parent Zombies web application:

In IIS Manager we no longer see the ZombieChild1 or ZombieChild2 web applications that we can still see via Powershell.

This can be the source of many weird and wonderful errors when working with AppFabric as it tries to parse configuration for zombie web applications. If you are getting strange behavior it is well worth launching a Powershell console and going on a zombie hunt. The web applications left behind can be removed via the console:

Powershell can be a sensitive soul…

I’ll mention another gotcha that tripped me up… case sensitivity. IIS allows you to promote a physical path, to a virtual directory, to a web application. E.g.

> cd \inetpub\wwwroot\
> mkdir test
> IIS:
> cd '\Sites\Default Web Site'
> dir

 directory test C:\inetpub\wwwroot\test

 > new-webvirtualdirectory test -physicalpath 'c:\inetpub\wwwroot\test'
 > dir

 virtualDirectory test C:\inetpub\wwwroot\test

 > remove-webvirtualdirectory test
 > dir

 directory test C:\inetpub\wwwroot\test

However if the case of the directory/virtual directory/web application does not match exactly then you get the following behavior:

> import-module WebAdministration
 > cd \inetpub\wwwroot\
 > mkdir test
 > IIS:
 > cd '\Sites\Default Web Site'
 > dir

 directory test C:\inetpub\wwwroot\test

 > new-webvirtualdirectory Test -physicalpath 'c:\inetpub\wwwroot\test'
 > dir

 directory test C:\inetpub\wwwroot\test
 virtualDirectory Test C:\inetpub\wwwroot\test

 > remove-webvirtualdirectory Test
 > dir

 directory test C:\inetpub\wwwroot\test
 virtualDirectory Test C:\inetpub\wwwroot\test

Here we created a new physical directory under the wwwroot folder and then mapped a virtual directory to this location but used a name of Test rather then test. When we get-childitem and we see two entries: ‘test’ for the physical path and ‘Test’ for the virtual directory. Then we remove the virtual directory but it is not deleted and no error is reported.

This caused a heap of confusion for me when automating our deployments so beware of case! This has been raised with Microsoft as an issue. I found that the ConvertTo-WebApplication cmdlet worked for my needs without the case issues.

How to diagnose errors in AppFabric monitoring configuration

It wasn’t the best Friday, my external hard drive died taking my work iTunes library with it and I wasn’t having much fun with AppFabric either. The dashboard showed no data and the Windows application event log kept filling up with login errors. Looking back, the afternoon was useful since I learned that little bit more about AppFabric though I didn’t get any ‘real’ work done.

I started off reading this: before getting stuck in.

AppFabric has two data stores: a monitoring store and a workflow persistence store. These stores are paired with two Windows services, an event collection service paired with the monitoring store and a workflow management service paired with the workflow persistence store.

Lets start with the event collection service and monitoring store. This service is responsible for capturing the WF and WCF events emitted by services hosted in IIS/WAS and storing them in the monitoring store. These events are used to populate the dashboard that is integrated into IIS Manager. To enable capture of events you can use the ‘Manage WF and WCF Services | Configure…’ option in the web application context menu or the Powershell commands Set-ASAppMonitoring and Start-ASAppMonitoring. For help on these commands call get-help, e.g. ‘get-help Set-ASAppMonitoring’, from a Powershell command line.

When you set up monitoring you need to provide a connection string name and set the monitoring level. As a minimum, the level needs to be set to Health Monitoring to populate the AppFabric dashboard. Below this are the levels Off and Errors Only which are self explanatory. Above this level are End-to-End Monitoring and Troubleshooting both of which capture additional information. End-toEnd Monitoring adds a header into WCF traffic to allow a logical call sequence to be followed. When a WCF service calls another WCF service the header is flowed across the call providing a correlation token for querying by. Note that the capture levels are cumulative, the higher level setting includes all of the events from the settings below. The higher the setting, the greater the impact on the performance of the system as more resources are required to capture and log the monitored events. For day to day operations health monitoring is recommended with the more verbose options used when required to aid troubleshooting. The connection string is a named connection string value, set as a property of the web application (or one of its ancestors). The connection string dashboard page is available from the ASP.NET section of the Features View for the web application.

Clicking on the Connection Strings option brings up the following:

Note that IIS configuration is hierarchical, the connection strings available to the Magic8Ball web application are both inherited which means they are defined at a higher node in the tree. In this case the strings are defined in the machine web.config found at %SystemDrive%\Windows\Microsoft.NET\Framework64\v4.0.30128\Config (I’m using 64-bit Windows and .NET 4.0 RC). When installing AppFabric the default connection strings are written into the machine level web.config. In my case, both connection strings are set-up to use integrated security.

The event collection service is a Windows Service and so managed through the services administration snap-in, services.msc. To help set up integrated security from Windows through to SQL Server, I run the services under a domain account. Note that if you plan to use a machine that is not always on a domain, you need to use a local machine account.

This account needs to have login rights to the SQL Server and should be mapped to the ASMonitoringDbWriter role. In my case I’ve mapped the user to all three roles set up in the monitoring store.

There are four Jobs managed by the SQL Agent that are used to populate and manage the tables in the monitoring database. These are:

The SQL Server Agent must be running on for the tables to be populated. The Import*Events jobs run every 10 seconds by default, if they are not correctly set up your application event log soon fills up with errors and warnings (as I found). These jobs call stored procedures defined in the monitoring database: ASImportTransferEvents, ASImportWcfEvents, ASImportWFEvents and run as the AS_MonitoringDbJobsAdmin. The AutoPurge job is scheduled to run once every minute and calls the ASAutoPurge stored procedure. These stored procedures in turn call ASInternal_* versions of themselves and you can drill into the SQL to see exactly what they do. To housekeep the monitoring database you can use the Clear-ASMonitoringSqlDatabase command. An other option is to move the events to an archive database so that the queries feeding the dashboard remain responsive, see Set-ASMonitoringSqlDatabaseArchiveConfiguration. The archive database can then be managed as per any audit requirements you may have.

To monitor the SQL Agent jobs, you can use the Job Activity Monitor:

The Windows Event Viewer is a great help tracking down the cause of issues and AppFabric sets up a couple of customs logs.

To see the Debug and Analytic logs you need to set the following:

Right click on a debug or analytic log and enable it. Make sure you disable it when you are finished to prevent performance degradation due to high volume event capture.

From these logs I could determine that my IIS configuration had invalid entries, the SQL Server login was failing for the Event Collector and so on. I’ll talk more about diagnosing IIS configuration issues and the workflow persistence store in the next post…