Reliability, Availability, & Serviceability

Organizations run mission-critical projects and operations on the VGS platform. That's why we are committed to delivering stable and secure products, applications, and networks at scale.

At VGS, we treat the security and reliability of our cloud platform and the data it hosts with the utmost importance. Building trust and assurance with our customers is a crucial commitment for us.

An Overview of VGS

Section 1: Cloud Infrastructure

Data and Databases
Physical Location
Distributed Services Architecture
Compute Infrastructure for VGS Services

Section 2: Service Design

Cell-Based Architecture

Section 3: Quality and Release Practices

Quality Control Strategy
Release Schedules
Release Strategy

Section 4: Business Continuity

RPO and RTO

An Overview of VGS

VGS products are designed to deliver enterprise-leading compliance and performance by descoping customers from systems that exchange electronic transactions initiated by cardholders using payment cards.

VGS offers low-latency and high-throughput transaction processing and is additionally connected with highly redundant networking to maintain a strict performance envelope.

All components described here are designed with redundancy and high availability in mind to ensure that we continue to meet your enterprise processing and availability requirements.

Cloud Infrastructure

VGS uses infrastructure provided by Amazon Web Services (AWS), the world's leading Cloud Service Provider with whom we maintain a strategic partnership. AWS is responsible for protecting the infrastructure, which includes the hardware, software, networking, and data centers that run AWS cloud services.

VGS uses several AWS cloud services, such as EC2 and RDS, for its applications. It has designed its security infrastructure and configuration using AWS-recommended best practices for security and cloud architecture.

Data and Databases

VGS services and databases use a multi-AZ (Availability Zone) deployment strategy to provide enhanced availability and durability by deploying database replicas across multiple availability zones in a region. This whitepaper explains the fault isolation benefits that AWS availability zones and regions offer. VGS is constantly working on improving its High Availability (HA) posture by expanding our fault-handling boundaries across regions and geographies. Talk to our team about our future direction.

High availability

Our core services, such as Tokenization Vault Service, use this specific strategy whereby a hot standby database instance, replicated synchronously with the active one, is readily available in a secondary AZ within the same global region. This provides high availability for our application services with an AWS-managed automatic database failover that completes as quickly as 60 seconds with zero data loss and no manual intervention. (See this for additional information).

Backups

VGS performs regular backups of our database. The backups are stored in our primary region and replicated in another region. In the rare event of data loss, VGS can restore from one of the last saved snapshots. Database snapshots are encrypted and retained for multiple days with support for point-in-time recovery.

Physical Location

In the US geography, the VGS platform services are provisioned in the US East (Northern Virginia) region. We are provisioned in the EU Central (Frankfurt) region in the EU geography. Within those regions, VGS uses multiple Availability Zones that are interconnected using low latency, high throughput, and highly redundant networking. Read more about AWS global infrastructure here.

VGS has purposefully built geo-isolation between its production and pre-production (e.g. dev/test) environments. Our pre-production environments are provisioned in the US West (Oregon) region. VGS is constantly working on expanding our regional presence in other world geographies. Talk to our team about our future direction.

Compute Infrastructure for VGS Services

Most of our application services are deployed in modern containerized form, and we use an industry-standard container orchestration framework called Kubernetes. More specifically, we use an AWS-managed Kubernetes service called EKS, designed and built by AWS with resiliency in mind. AWS fully manages the EKS control plane. AWS deploys replicas of the control plane services across multiple AZs. Our services run on “worker nodes,” for which VGS uses EKS-managed node groups to automate the provisioning and lifecycle management of the underlying EC2 nodes. Our worker nodes span multiple AZs, allowing seamless and hands-free AZ-level failure handling for our application services. (See this for additional information.)

Service Design

Unlike API calls to additional services, which incur additional round-trip time to the service provider, a proxy-based architecture is fundamentally more conducive to high-throughput / low-latency transaction processing. The proxy acts like a hop in your regular transaction flow, similar to how ubiquitous tech like load balancers, CDNs, or firewalls might act for other use cases.

Our algorithms for secure vaulting are highly efficient and do not add significant compute / processing time to the transaction. Finally, using a modern and cloud-native tech stack (Java, Kubernetes, AWS) allows VGS to effectively balance the pace of innovation with reliability.

Cell-based Architecture

VGS has implemented a cell-based architecture to enhance the resilience of our services. Cells are clusters of machines—with robust bulkheads, creating hardened fault-isolation boundaries that substantially limit the impact of failures. This design ensures that any issue within a specific workload remains contained, minimizing overall impact.

We have engineered these cells for both redundancy and independent deployment. Additionally, services within these cells are containerized, enhancing our ability to maintain and manage system integrity. Our deployment processes have been refined to evenly distribute services across multiple cells, enabling us to undertake scaling and repairs, up to and including rapid re-provisioning of cells without disrupting the entire system. We can also provide additional segmentation (“isolation stacks”) based on volume or traffic management concerns.

Quality and Release Practices

VGS Engineering follows modern SaaS practices steeped in DevOps and SRE culture. Our SDLC is tuned for two-week Sprints. Our engineers leverage modern tooling all the way from building / releasing (CI / CD) to monitoring / operating (observability, on-call paging, status page, synthetic testing).

This is woven with security practices at all stages - designing, coding, testing, and post-deployment monitoring. In addition, we follow a code promotion strategy where we run appropriate tests at each pre-production environment step as we propagate the change all the way to production.

Quality Control Strategy

VGS follows best practices for testing our services and APIs to ensure all new functionality is delivered against a written test strategy. This aligns test objectives against KPIs and requirements set forth by the business and acceptance criteria defined by the product engineering teams. These artifacts are delivered as a set of unit, integration, performance, and acceptance tests run against VGS services and systems during the development and deployment of each release through our continuous integration (CI) practice.

Adjacent to the VGS CI practice, synthetic transactions are continuously run against VGS services to simulate user activity. They are monitored alongside actual user transactions to ensure our systems perform within expected, acceptable limits.

Release Schedules

VGS platform updates (for hardware, software, performance, or scale) are hassle-free and transparent to our customers. We offer a high level of predictability while also providing a continuous stream of new features and fixes.

VGS typically updates its applications during off-peak hours. The only time we make an exception is to deliver “hot fixes” for critical service issues. Regardless of the hour, our maintenance activities are generally performed without causing any downtime.

Release Strategy

VGS releases are not monolithic in nature: we only deploy the set of services that need to change and can roll them back individually if required. This allows us to isolate potential issues to a specific component of one application and prevent it from affecting the update of other applications.

Our releases are performed by expert service owners who effectively function as “release managers.” The service owners are specifically trained to ensure a high level of discipline in change management and risk mitigation. In addition, we have managerial governance & oversight for production releases.

Business Continuity

RPO and RTO₁

Given 1) the nature of the continuous replication of VGS databases across multiple AZs, and 2) the provisioning of the compute infrastructure for our application services spanning multiple AZs, as described earlier in this document, it's possible for VGS to have an in-region RPO (Recovery Point Objective) that is close to zero and one of the industry's best RTO (Recovery Time Objective).

RTO (Recovery Time Objective) scenarios are based on automatically re-establishing the VGS applications in the scenario where a primary database replica becomes unavailable and / or multiple application service compute nodes become unavailable.

Featured

Are you ready for PCI DSS v4.0?

Featured Content

Announcement

Featured

Getting Started with VGS

Reliability, Availability, & Serviceability

An Overview of VGS