Reliability, Availability, and Serviceability (RAS) at VGS

Organizations run mission-critical projects and operations on the VGS platform. That's why we are committed to delivering a stable, secure, and scalable infrastructure that ensures seamless performance and uninterrupted service.

icon

At VGS, we treat the security and reliability of our cloud platform and the data it hosts with the utmost importance. Building trust and assurance with our customers is a crucial commitment for us.

An Overview of VGS

Section 1: Cloud Infrastructure

  • Data and Databases
  • Physical Location

Section 2: Compute Infrastructure for VGS Services

  • Distributed Services Architecture
  • Optimized Global Performance for Mission-Critical Operations

Section 3: Scalable, Resilient, and Reliable Platform

  • Service Design
  • Cell-based Architecture
  • Resilience Through Periodic Testing

Section 4: Quality and Release Practices

  • Quality Control Strategy
  • Release Schedules
  • Release Strategy

Section 5: Business Continuity

  • RPO and RTO

An Overview of VGS

VGS products are designed to deliver enterprise-leading compliance and performance by descoping customers from systems that exchange electronic transactions initiated by cardholders using payment cards.

VGS offers low-latency and high-throughput transaction processing and is additionally connected with highly redundant networking to maintain a strict performance envelope.

All components described here are designed with redundancy and high availability in mind to ensure that we continue to meet your enterprise processing and availability requirements.

Cloud Infrastructure

VGS uses infrastructure provided by Amazon Web Services (AWS), the world's leading Cloud Service Provider with whom we maintain a strategic partnership. AWS is responsible for protecting the infrastructure, which includes the hardware, software, networking, and data centers that run AWS cloud services.

VGS uses several AWS cloud services, such as EC2 and RDS, for its applications. We have designed our security infrastructure and configuration using AWS-recommended best practices for security and cloud architecture.

Data and Databases

Data and Databases

VGS databases are designed with multi-zone and multi-region failover redundancy, ensuring your data maintains integrity and is always accessible.

Key Features

  • Multi-AZ Deployment: At VGS, databases are deployed across multiple Availability Zones (AZs) within a region, providing enhanced availability and durability.
  • Automatic Failover: In the event of a failure within a single AZ, our systems automatically switch to a standby database instance, with failover completing in a few seconds and zero data loss.
  • Regular Backups: Database snapshots are encrypted, stored in multiple regions, and retained for point-in-time recovery. This ensures data can be restored quickly in the rare event of data loss and is practiced periodically to validate.
  • Continuous Replication: Data is continuously replicated between live database replicas across zones and regions, achieving a Recovery Point Objective (RPO) close to zero and minimizing data loss during disruptions.
Physical Location

Physical Location

In the US geography, VGS services are provisioned in multiple AWS regions within North America. In the EU geography, VGS services are provisioned in AWS region within Central Europe. In each region, VGS uses multiple Availability Zones that are interconnected using low latency, high throughput, and highly redundant networking. Read more about AWS global infrastructure here.

VGS provides isolation between its production and pre-production (e.g., sandbox/live) environments. Customers may use these isolated sandbox environments to test their implementation without customer or business impact. VGS is constantly working on expanding our regional presence in other world geographies and you can check out services available in each zone/region here.

Compute Infrastructure for VGS Services

Compute Infrastructure for VGS Services

Distributed Services Architecture

VGS leverages a variety of services to build a secure, distributed, and resilient infrastructure, ensuring high availability, scalability, and fault tolerance. Our architecture is designed to handle failures gracefully while maintaining optimal performance. We build and maintain a multi-tenant application architecture that has been designed to enable our services to scale to billions of cards.

Key Features

  • High Availability: AWS EKS replicates control plane services across multiple Availability Zones (AZs), minimizing downtime and ensuring operational resilience.
  • Horizontally scalable architecture: VGS automates provisioning, scaling, and lifecycle management of our compute infrastructure, growing and condensing with our customer demand.
  • Multi-AZ Deployment: Our worker nodes are distributed across multiple Availability Zones (AZ), providing built-in redundancy and seamless failover capabilities in case of AZ-level disruptions.
  • Resilient Networking: Services within our architecture are interconnected using low-latency, high-throughput, and highly redundant networking, ensuring secure and efficient data exchange across environments.

VGS continues to evolve its infrastructure to enhance scalability and reliability while expanding into new regions.

Distributed Services Architecture

Optimized Global Performance for Mission-Critical Operations

VGS utilizes a globally distributed edge network to ensure the highest levels of reliability, availability, and serviceability for our customers. This network, powered by AWS Global Accelerator, intelligently routes traffic across strategically positioned edge regions, optimizing performance and minimizing latency for mission-critical operations. Read more about the benefits of edge regions and secure / private networking benefits of AWS Global Accelerator here.

How Our Edge Network Enhances Services:

  • Optimized Performance: Our edge network significantly accelerates API workloads and enhances network performance, ensuring rapid and efficient data processing.
  • Seamless Failover: In the event of regional disruptions, traffic is rerouted to the nearest healthy edge region, minimizing downtime and ensuring uninterrupted service continuity.

This strategic investment in a distributed edge infrastructure underscores our commitment to providing a robust, dependable, and high-performance platform that empowers businesses to operate with confidence.

Scalable, Resilient, and Reliable Platform

Service Design

Service Design

At VGS, our proxy-based architecture which is used by our Secure Data suite seamlessly integrates into your transaction flow, optimizing for high-throughput and low-latency processing without introducing additional round-trip time to external service providers. Acting as a transparent hop, our proxy secures communication with essential technologies like load balancers, CDNs, and firewalls—facilitating efficient and secure data handling. Our other APIs, such as those used by our Credential Management suite, are designed with strict latency requirements and are continuously monitored to ensure that performance of these endpoints matches or exceeds best in class performance.

Our proprietary algorithms for secure vaulting are engineered for minimal impact on compute and processing time, ensuring rapid transaction flows. Leveraging a modern, cloud-native tech stack (Java, Python, Kubernetes, AWS), VGS strikes an ideal balance between rapid innovation and unwavering reliability.

Cell-based Architecture

Cell-based Architecture

VGS has further innovated by adopting Cell-based Architecture to enhance the resilience and scalability of our services. This advanced architectural pattern is central to our strategy for fault isolation and service continuity, particularly within a multi-tenant public cloud environment hosted on AWS.

Key Features

  • Fault Isolation & Resilience: Each cell represents a cluster of machines equipped with robust bulkheads, creating hardened fault-isolation boundaries. This setup limits the impact of any service disruption, containing issues within specific workloads to ensure minimal effect on overall system operations.
  • Multi-Tenancy & Logical Separation: Our architecture supports logical separation of client workloads, protecting against noisy neighbor issues and minimizing blast radius in case of unforeseen changes. Each client's environment is logically isolated, offering a tailored and secure experience.
  • Progressive Deployment Strategies: We utilize automated, progressive canary deployments to introduce updates gradually both within and across cells, reducing the risk of widespread disruptions while continuously delivering innovative capabilities.
Resilience Through Periodic Testing

Resilience Through Periodic Testing

VGS proactively tests its multi-zone and multi-region failover mechanisms to ensure continuous reliability. A small percentage of traffic is regularly routed through secondary regions to validate redundancy, while automated failover drills and disaster recovery tests confirm our ability to meet strict RTO and RPO targets. Additionally, infrastructure audits and continuous health checks help identify potential risks before they impact service. This ongoing testing ensures our platform remains resilient, delivering best-in-class availability and performance for our customers.

Quality and Release Practices

VGS Engineering follows modern SaaS practices steeped in DevOps and SRE culture. Our SDLC is tuned for two-week Sprints. Our engineers leverage modern tooling all the way from building / releasing (CI / CD) to monitoring / operating (observability, on-call paging, status page, synthetic testing).

This is woven with security practices at all stages - designing, coding, testing, and post-deployment monitoring. In addition, we follow a code promotion strategy where we run appropriate tests at each pre-production environment step as we propagate the change all the way to production.

Quality Control Strategy

Quality Control Strategy

VGS follows best practices for testing our services and APIs to ensure all new functionality is delivered against a written test strategy.  This aligns test objectives against KPIs and requirements set forth by the business and acceptance criteria defined by the product engineering teams. These artifacts are delivered as a set of unit, integration, performance, and acceptance tests run against VGS services and systems during the development and deployment of each release through our continuous integration (CI) practice.

Adjacent to the VGS CI practice, synthetic transactions are continuously run against VGS services to simulate user activity. They are monitored alongside actual user transactions to ensure our systems perform within expected, acceptable limits.

Release Schedules

Release Schedules

VGS platform updates (for hardware, software, performance, or scale) are hassle-free and transparent to our customers. We offer a high level of predictability while also providing a continuous stream of new features and fixes.

VGS typically updates its applications during off-peak hours. The only time we make an exception is to deliver “hot fixes” for critical service issues. Regardless of the hour, our maintenance activities are generally performed without causing any downtime.

Release Strategy

Release Strategy

VGS releases are not monolithic in nature: we only deploy the set of services that need to change and can roll them back individually if required. This allows us to isolate potential issues to a specific component of one application and prevent it from affecting the update of other applications.

Our releases are performed by expert service owners who effectively function as “release managers.” The service owners are specifically trained to ensure a high level of discipline in change management and risk mitigation. In addition, we have managerial governance & oversight for production releases.

Business Continuity

At VGS, our infrastructure is designed for high availability and rapid recovery, ensuring minimal service disruption in the event of failures. Our Recovery Point Objective (RPO) and Recovery Time Objective (RTO) are among the best in the industry, thanks to:

  1. Continuous Database Replication: VGS databases are replicated across multiple AWS Availability Zones (AZs) and regions in real time, ensuring near-zero data loss in the event of a failure.
  2. Multi-AZ & Multi-Region Compute Infrastructure: Our application services are deployed across multiple AZs and regions, enabling automatic failover, seamless continuity, and geographic redundancy. Read more about our comprehensive CRDR efforts here.
Scenario Chart

Conclusion

VGS is dedicated to building and maintaining a platform that delivers *exceptional reliability, availability, and serviceability*. Our advanced architectural patterns, multi-region redundancy, and commitment to continuous improvement ensure that your mission-critical operations are always secure, accessible, and performant.