One large federal government agency utilizes a system to assess environmental incidents and disasters. The system is a heterogeneous collection of GIS services in support of response operations to major natural disasters in which the environment, originally constructed by Esri as a minimally viable proof-of-concept, consisted of a single EC2 instance running Esri’s ArcGIS Server with one Microsoft SQL Server EC2 instance in support. In the wake of Hurricane Sandy, the system was pressed into service to support emergency disaster operations, and usage immediately exceeded capacity.
eGlobalTech (eGT) was contracted to rapidly improve the existing solution to create a resilient, production-ready platform. Since engaging with this customer, eGT has continued to provide enhancements and operational support for the system.
Specific Business, Organizational, and Technical Challenges
The existing single-node solution had predictable elasticity issues and was not reliable under any significant user load. As a result, it was not compliant with the AWS Well-Architected Framework. Specifically, eGT noted the system:
- Was unable to handle necessary load during disaster support activities
- Had no fault tolerance or any component service in the environment
- Had significant scaling and resource locking issues
- Did not meet mandatory federal security compliance requirements
- Did not implement a sufficiently granular access control model for permissions
- Offered users no ability to request help from a service desk, nor was there a ticketing system to track issues
Detailed Technical Design
The most urgent requirement was to provide a more elastic and resilient solution for the ArcGIS Server and supporting services. ArcGIS Server can be configured as a cluster, with multiple nodes accessing a shared configuration store. This became the basis of eGT’s approach.
As the proof of concept environment did not fully address federal security requirements, we revised the architecture to incorporate tools and configurations that provided the necessary compliance. To address the specific deficiencies we identified during our initial assessment, we introduced the following technical features:
- AutoScaling and Application Load Balancers to increase durability with non-fault tolerant software
- Route53 failover for high availability services that cannot use a load balancer
- Reserved instances to minimize the cost for high memory EC2 resources
- End-to-end automated provisioning and configuration management of resources using AWS Ruby SDK and Chef
Notable technical challenges
In testing with early versions of ArcGIS Server, we determined that horizontal scaling was limited, due to lock contention and performance issues with the software. After a detailed analysis, we determined the optimal configuration was two pairs of server nodes sharing a config store, each pair comprising a different “sub-cluster” (an ArcGIS artifact) within the overall cluster.
Outbound traffic to ports such as TCP 1433 are denied by default for users inside this federal agency’s network, requiring specific whitelisting to destination host/port combinations by the agency’s networking group. The change control process for managing these whitelist entries is cumbersome, effectively requiring that this and similar services remain on permanent fixed, known IPs.
The environment contains multiple VPCs. One network contains production resources, another development resources. The third, a management VPC, houses eGT’s Mu automation platform, security compliance tools, Active Directory infrastructure, and source control services. There is no direct communication between development and production environments. Management services reach into each environment, via VPC peering connections.
eGT’s internal cloud deployment and management toolkit, Mu, was in part developed in response to the federal agency’s system requirements. Mu supports provisioning of new AWS resources in a repeatable fashion, and host-level configuration with Chef to perform integration with the rest of the environment, security compliance, etc. Mu also automatically enables Nagios monitoring of hosts and services running on those hosts, supplementing CloudWatch. Mu dynamically assigns Route53 DNS to all addressable resources in the environment. Some public-facing resources that require high availability (MSSQL, SFTP), but which could not leverage an ELB or ALB, use Route53 health checks and failover records to maximize resilience.
Most services are implemented with EC2 resources. ArcGIS Server is a heavyweight Windows-based application unsuitable for containerized or serverless models. Most secondary applications were developed with a traditional data center model in mind, and thus were also not candidates for a microservices approach. The vendor-recommended memory configuration for ArcGIS server maps to r4.2xlarge instance types. MSSQL database nodes in production must be similarly sized. These and other resource-heavy instance types dominate the environment, so eGT has leveraged 1-year reservations for most instances, for significant costs savings compared to on-demand instances.
ArcGIS clusters are each fronted by an Application Load Balancer (as of early 2017; original implementation used Classic ELBs), using ArcGIS-native health check URLs. As these health checks are often an inaccurate reflection of overall cluster health – particularly under high load – we altered these checks to verify functionality of a specific application service endpoint instead. ArcGIS Configuration Store nodes act purely as fileservers. In this design they remain a single point-of-failure for each ArcGIS cluster. This design predates the availability of FSx for Windows File Server, which would likely provide a far more durable, less costly solution. Newer iterations of ArcGIS are also fully functional on Linux, which would introduce the option of using EFS or GlusterFS to provide a more durable managed storage solution. Other, non-ArcGIS application services were deployed onto a general-purpose application stack using a standard
three-tier (Autoscaled proxies => Autoscaled application nodes => RDS database) architecture. Each layer of this hosting stack is fronted by an Application Load Balancer.
ArcGIS Config Store nodes use EBS storage for their core data directories. EBS volumes containing this and other critical application components are snapshotted daily with a Python scheduled task on each instance. Additional backups of each ArcGIS config store’s data are pushed to S3 nightly, which is accessible to GIS administrators who may not have the AWS expertise needed for working with EBS snapshots.
The original environment included a large, publicly-readable data tree located on a single Windows node running IIS. This service cohabited with the main MSSQL server backing many ArcGIS services, as well as some .NET applications, and the three often contended for resources in the environment’s early stages. To keep these services from interfering with one another, we moved the .NET applications to a dedicated application hosting stack. These applications are consumed by in-house ArcGIS services, which could be reconfigured appropriately, and thus did not present a major logistical hurdle to migration.
Next, we migrated the public tree data to more resilient storage. Prior to the general availability of EFS, we constructed a GlusterFS cluster, with storage replicated across four Availability Zones, and migrated the data tree there. The tree was later migrated to EFS, which served as a drop-in replacement for GlusterFS, at a significant savings in management overhead and storage cost. In front of this storage, we built a dedicated group of servers to expose the data in three ways: 1) a publicly browsable HTTP repository, 2) a Samba-shared network service available to internal ArcGIS and application hosting nodes, and 3) an authenticated SFTP service for users with privileges to manage the file repository.
The system’s core database services are hosted on its SQL Server 2012 AlwaysOn cluster, implemented on EC2. This cluster supports most of the GIS applications in the environment, as well as a small number of specialized .NET applications. The AlwaysOn cluster consists of two nodes, a primary and secondary, using MSSQL’s internal failover mechanism in tandem with Route53 health check failover. A third node, the watcher, servers as a tiebreaker in case of split-brain failures, as well as a file repository for synchronization between the two database servers. The databases themselves reside on EBS volumes local to each server.
Metrics for Success
The original solution could host a maximum of ~200 live ArcGIS services. The current environment has hosted between 1,100 and1,600 active services at various points.