Dashboard > CI Engineering > ... > Architecture and Design > CIAD SV 04 Physical Plant Network
Log In   View a printable version of the current page.
CI Engineering
CIAD SV 04 Physical Plant Network
Added by Michael Meisinger , last edited by Michael Meisinger on Feb 18, 2010  (view change)
Labels: 
(None)

4.4.3 Network Strategy

4.4.3.1 Secure Scalable Service Platform Deployment Pattern

This section specifies a logical deployment pattern for a secure, high-availability, scalable service installation. This pattern is applied in the deployment of each terrestrial CI CyberPoP, as defined above. The pattern is described independent of the OOI CI. The pattern was first developed in the context of business information and exchange systems and was adapted to fit the needs of the OOI CI deployment.

The service installation focuses on satisfying the following requirements in priority order: Security, Performance, High Availability, Scalability and Offsite Management. Figure 4.4.3.1-1 illustrates the logical deployment pattern for the CI CyberPoP in relationship to the different users that need to interact with the Installation and the Internet as the intervening communication infrastructure between them.

Figure 4.4.3.1-1 Logical Deployment Model (SV-2)

Security is achieved by the isolation of access and the separation of functionality. Figure 4.4.3.1-1 shows the complete separation of the production environment from the management environment. It also shows the isolation of the "end user", external logic from the internal service components. Isolation has to do with a layered defense, just as any corporation employs with a DMZ model. The principle is that nothing of intrinsic value is deployed on the External Production servers that are placed on the Public network. The Internal Production servers are placed on the Service network, which is an isolated and secure network. The Service network has no direct connectivity in or out to the Internet. All production assets of value are stored on servers attached to this network or it sister network, the Data network. The Management network is completely inaccessible from the production environment. All connections between the Management and the Production environments must be established from the Management network. Services from one network are deployed to other networks (e.g., DNS or ODBC) by presenting them as Virtual IP addresses. One of the fundamental design principles from both a security and performance point of view is not to use IP packet routing "Layer 3 Routers" in the network infrastructure. It is too easy to open up holes by accident, and based on the flow of production traffic, routing does not serve production value.

Performance is obtained by removing as much in-line packet inspection on the outbound traffic as possible and through scalable concurrent execution. The first aspect is implemented by only using a switch network infrastructure. The second is achieved by decomposing the logic of the system into independent functional concerns that can be linearly scaled through the addition of incremental resources. The lines of decomposition for the Service are ordered: first by user session independent processing, then by shared read-only processing, and finally by shared read/write (transactional) processing.

High Availability (HA) is the result of implementing a fully redundant system such that no single failure will bring down the whole system and regular testing of failure scenarios. There are multiple scales of concern when addressing this aspect of the system. The scales range from ensuring that every hardware system is receiving power from two independent sources and pathways to those sources, through the duplication of every hard and soft system component in the installation, to duplication of the installation in geographically-diverse locations. The implementation of a HA system is not an all or nothing strategy. With regard to scope of redundancy to be addressed, there are risk vs. cost trade-offs. Lower level redundancy like power and Internet connectivity can be delegated to the Hosting (a.k.a. Collocation, "Colo") facility. Geographical diversity can be address at a later phase in the project's maturity. The recommendation for this system is addressing only system redundancy within the Installation and operating on the premise that the Colo will provide uninterrupted power and continuous bandwidth within an environmentally resilient facility.

The network infrastructure (i.e., Firewalls, Application Routers and Ethernet Switches) is all deployed in pairs. Based on final cost constraints, a decision has to be made on whether to run these pairs in Active/Active or Active/Passive mode. Active/Passive mode is typically less expensive because the passive system in a pair is considered to be running a backup copy of the Active system's operating system. Active/Passive has approximately half the performance of an Active/Active configuration. The Ethernet Switches are deployed in an Active/Active configuration with VLANs trunked across them. All network infrastructure components should be run with redundant power supplies, the appropriate cross-connects between HA pairs and multi-homed uplink connections between the HA network infrastructure layers (i.e., App Routers, Ethernet Switches, and the Firewalls, see Figure 4.4.3.2-2).

Different options for Server redundancy can also be considered. Both the power and network connectivity can be duplicated for a full HA solution. Duplicating the power is only a matter of cost. Duplicating network connections to both switches (multi-homing) and configuring the network interface cards (NIC) to failover the connection when the one of the switches fails is quite doable, but complex to implement and test. The recommendation is that servers be only homed to one switch and one power source. If a switch fails, the portion of the computing capacity connected to that switch will be lost. At the smallest scale deployment, this can represent half of the capacity of the installation. If a power supply or the NIC fails the server is lost. As the installation scales, the loss of an individual server becomes less significant.

The recommendation is made to run the initial production installation with separate physical server pairs for the Service, Data and Management VLANs. This places six servers in operation within the installation. If any of the four production servers fail, a portion of the one of the Management servers can be configured into the production network until the failed server can be replaced. The Management servers will have separate duties within the Management VLAN, but should be resynced to ensure that the full capabilities of both servers are available in case one of the physical Management servers fails.

Scalability of the computing and storage infrastructure is achieved by adding more units. In the case of the Application and MySQL servers this can be done dynamically without requiring any portion of the Service to shutdown. This is implemented by the Application Router though its mapping of VIPs to pools of servers. In the case of the MySQL Data Cluster, scaling is achieved through partitioning of the database tables to more Data Node servers. This can only be achieved through a reconfiguration of the Data Cluster. This will require a re-initialization of the Data Cluster that will take the tables associated with the Data cluster offline for a short period of time (~1-2 min). At a minimum, Subscriptions and Push Key will not function during this outage.

Management covers a number of independent concerns and roles involved with operations and maintenance of the Installation. The main decomposition of concerns is between content, application and system level management responsibilities. The Management VLAN is designed with an independent access mechanism to ensure the separation between users and staff. The VPN ensures the confidentiality of the communication between the Installation and any Management Point, as well as providing the convenience of being on Management VLAN at the remote Management Point after establishing the connection. The VPN system gives the Installation administrator the ability to govern from where (i.e., IP address) the Installation can be managed, independent of who (i.e., username) can manage it. The recommendation is a VPN be used with a hardware token generator (Secure ID from RSA) for two-factor authenticated access.

4.4.3.2 Virtual Technologies

Figure 4.4.3.2-1 illustrates the recommended deployment of service components within the virtual network and server environment. These components (IP addresses, LANs, Servers) all have physical and virtual representations.

Figure 4.4.3.2-1 Component Deployment Model (SV-2)

Figure 4.4.3.2-2 illustrates the actual physical assets and their interconnectivity. To understand how the Installation can economize its use of few physical assets to implement the more complex Network Deployment Model, it is necessary to have a basic understanding of the mapping of Virtual to real LANs, IP addresses, Private Networks and Servers.
Figure 4.4.3.2-2 CyberPoP Hardware Deployment Model (SV-2)

The Virtual LAN (VLAN) is used to segment an Ethernet Switch into multiple completely isolated networks. Assigning physical ports on the switch to one or more VLANs accomplishes this. An important addition to the VLAN concept is the "tagged" VLAN. This allows a single connection between two devices (a physical network segment) to carry traffic for multiple VLANs. The implication of these two capabilities is that one connection between the Application Router and the Switch can carry the traffic from all the isolated VLANs that need to present their services to another VLAN. This means one Application Router can be used to support remote introduction of VLANs and alteration to the placement of servers on multiple VLANs without having to make physical wire modification to the Installation.

The Virtual Private Network (VPN) allows a device in one network segment to join another network segment by making an IPsec tunnel over the intervening inter-network. Typically, once a device is incorporated into a VPN, it may not communicate with devices on its network of origin directly. This prevents the incorporated device from becoming a router between the two networks, thus operating outside the control of the network environment of the VPN. The capability of tunneling is a significant productivity enhancement for remote management of the production installation. VPNs are usually combined with strong authentication such as certificates and/or single use token mechanisms. It is recommended that the VPN be protected with the use of single use tokens (i.e., Secure ID token generators in combination with Radius and LDAP servers). This access mechanism only applies to the content, application and system management personnel.

The use of a Virtual IP address (VIP) allows a cluster of IP addresses to handle the traffic directed at a single IP address. The class of networking equipment originally called "Load Balancers" and most recently "Application Routers" manage VIPs and their associated pool of IPs. The names arise out of their capability of providing a wide range of rules for how the IP traffic arriving at the VIP is directed to its pool of IP addresses. If all the connections to the VIP are stateless, such as HTTP requests, than a simple round robin or server load based routing rule works. If the connections are stateful, such as an ODBC connection, then a partitioning of the traffic can be based on the source IP address of the connection. In this case, all traffic for an established connection is routed to the same IP address in the IP pool. Based on newer technologies, routing can be based on content within the packet, commonly referred to as Content-based Routing. It is recommended that this feature be employed at the point when it becomes advantages to support the notion of a stateful user session cached on a server.

The most recent addition to the family of virtualized components ready for production use is the Virtual Machine (VM). Advances in the past five years in VM efficiencies, the continual improvements in CPU performance and the emergence of multi-core processors in the presence of tens of Gigabytes of memory makes this technology an extremely effective deployment choice. A single physical server with two quad core processors and 16 Gigabytes of RAM can be segmented into 1 real and 7 Virtual Machines each securely isolated with a single processor and 2 Gigabytes of RAM. Each virtual machine can be an independent set of functionality based on separate Operating Systems communicating on different VLANs.
This capability could be used for a low cost initial deployment of all the Production and Management Services on a single redundant pair of physical servers (not our recommendation). One open issue that needs testing is whether tagged VLANs and VMs are compatible and secure. If not, it will only mean that each physical server will need a physical network interface for each VLAN in it is a member.

4.4.3.3 Physical Site Deployment Strategy

The decomposition of the CI Services into deployment packages evolves with the understanding of the nature of user demand and the behavior of the code base. That said, the lines of decompositions chosen for a distributed system at the outset of a development effort tend to persist for a long time given the difference between local and remote execution methodologies for invoking functionality. It is important to identify the major lines of decomposition and validate them early in a product's lifecycle. The recommendation is:

  • Separate the business logic from the data management,
  • Separate the End User and the Content Management business logic,
  • Separate the Content Files (i.e., images, binaries, etc.) from the Database,
  • Using the End User perspective, separate the Read-only Database tables from the transactional tables.

As per Figure 4.4.3.2-1, this results in five functional groups: two business logic, two database and one file management. The recommendation is that the two business logic and two database groups be incorporated into four separately deployable VM packages:

  • User Session package containing WWW and Rights Management logic,
  • Content Management package containing Content and Catalog update logic,
  • Database package containing the top level database logic and all the catalog specific logic,
  • Data Cluster package containing the transactional logic of the database.

The File Management functionality amounts to distributed file replication mechanism relying on "rsync" and driven by Content Management business logic so does not need a separate VM package.
In addition to the VM packages directly associated with the Service, the recommendation is to assemble a set of Installation Support packages:

  • A Base Server package from which all other packages are derived contain the standard security and management functionality expected of systems in the Installation,
  • System Services package containing the system level services that support the business service (i.e., HTTP proxy, DNS, NTP, Send Mail)
  • The Management Services package containing all of the monitoring and service validation functionality,
  • The AAA Services package containing the management environment authentication, authorization, and auditing (logging) functionality.

It is recommended that all VM packages be maintained in a central repository and scripts developed to parameterize their deployment into the different Installations and specific host environments on which they will be run.

Based on deployment model of supporting multiple independent distribution Channels with the same Service platform, it is recommended that these packages be made available to the distribution channels as a shared resource and give the Channels the responsibility to localize, deploy and manage their own Installation. This may cause some difference is quality of delivery across the channels by that can be mitigated by:

  • Developing local packaging mechanisms that can incorporate localized functionality on top of the base platform without altering the base, and
  • Providing a validation/certification program for Channel installations.

To ensure efficient and reliable evolution of the Service, it is recommended that a formal environment and set of procedures be established for the promotion of new versions from development to production through the controlled exposure of the new release to different testing regimes. The recommended pipeline of testing regimes is:

  • Development Installation for development driven unit and regression testing as well as performance analysis
  • QA Installation for the formal in-house regression, performance and user testing,
  • Staging Installation for formal external user and integration testing as well as the product promotion of the next release

The installation of the development test environment needs to reflect the service decomposition only. It can reside on a different hardware configuration. This can be as minimal as a single server supporting multiple VMs. The QA installation should represent the service and system level configuration. It need not address the High Availability aspect of the Production environment. The Staging installation should mimic the production environment exactly to ensure all installation scripts and network configurations have been thoroughly vetted prior to deployment in production. With some care and concentration using the deployment package repository and the VM configurations, the server and networking infrastructure for development, QA and Staging can be shared. This will limit the total cost of the complete testing environment to 1x of the minimum Production installation.

The Service version upgrade strategy in Production is a non-trivial concern, especially if it is to be accomplished while the system remains running. It is important to institutionalize the fact that the QA process continues into the initial days of a Release into Production. To support this premise and mitigate any serious disruption to the service, it is strongly recommended a minimum strategy for immediate fallback to the previous Service Release be put in place during the initial period of any upgrade. More sophisticated schemes can be devised that are based of concurrent support for two versions in a phased exposure of a new Release to the user base. Both models can be supported with the Installation architecture proposed and the recommendation is to start with the simpler switch over model, until demand (scale) and experience in the Service are gained. The fundamental tools in managing the Release in both cases is the use of Application Routers VIPs to direct traffic to a particular set of VMs that represent the two releases.
Scaling most of the Service Installation can be accomplished while the service is in operations. That said all alteration to the service should be done during quite periods in the Service's weekly usage cycle until a thorough understanding of the upgrade procedures has been developed and practiced. The network infrastructure can for the most part be incrementally scaled through equipment upgrades and incremental addition of duplicate equipment. Scaling the services and servers are also only a matter of adding duplicate equipment. With one exception these can be added while the services is running. The exception is scaling the Data Cluster as mentioned above; it will require a restart of the Data Cluster to incorporate additional data nodes. To do this while the Service is running will require a Service software strategy to be devised for suspending interactions with transactional tables (i.e., User, Phone, Subscription, Transaction) for a period of time, on the order of less than 2 minutes.

Powered by Atlassian Confluence 2.7.1, the Enterprise Wiki. Bug/feature request - Atlassian news - Contact administrators