AWSARE.com
Architectually Resilient Environment Discovery Interview
security reliability performance cost optimization operational excellence
AWSARE.com
Architectually Resilient Environment Discovery Interview
security reliability performance cost optimization operational excellence

Reliability

REL 1 How are you managing AWS service limits for your accounts?
  AWS accounts are provisioned with default service limits to prevent new users from accidentally provisioning more resources than they need. AWS customers should evaluate their AWS service needs and request appropriate changes to their limits for each region used.  
Best practices:
  * Monitor and Manage Limits Evaluate your potential usage on AWS, increase your regional limits appropriately, and allow planned growth in usage.

  * Set Up Automated Monitoring Implement tools, e.g., SDKs, to alert you when thresholds are being approached.
 
  * Be Aware of Fixed Service Limits. Be aware of unchangeable service limits and architect around these.
 
  * Ensure There Is a Sufficient Gap Between Your Service Limit and Your Max Usage to Accommodate for Failover.

  * Service Limits are Considered Across All Relevant Accounts and Regions.

REL 2 How are you planning your network topology on AWS?
  Applications can exist in one or more environments: EC2 Classic, VPC, or VPC by Default. Network considerations such as system connectivity, Elastic IP/public IP address management, VPC/private address management, and name resolution are fundamental to leveraging resources in the cloud. Well- planned and documented deployments are essential to reduce the risk of overlap and contention.  
Best practices:
  * Connectivity Back to Data Center not Needed.

  * Highly Available Connectivity Between AWS and On-Premises Environment (as Applicable) Multiple DX circuits, multiple VPN tunnels, AWS Marketplace appliances as applicable.

  * Highly Available Network Connectivity for the Users of the Workload Highly available load balancing and/or proxy, DNS-based solution, AWS Marketplace appliances, etc.

  * Non-Overlapping Private IP Address Ranges The use of IP address ranges and subnets in your virtual private cloud should not overlap each other, other cloud environments, or your on-premises environments.

  * IP Subnet Allocation Individual Amazon VPC IP address ranges should be large enough to accommodate an application’s requirements, including factoring in future expansion and allocation of IP addresses to subnets across Availability Zones.

REL 3 How does your system adapt to changes in demand?
  A scalable system can provide elasticity to add and remove resources automatically so that they closely match the current demand at any given point in time.  
Best practices:
  * Automated Scaling. Use automatically scalable services, e.g., Amazon S3, Amazon CloudFront, Auto Scaling, Amazon DynamoDB, AWS Elastic Beanstalk, etc.

  * Load Tested. Adopt a load testing methodology to measure if scaling activity will meet application requirements.

REL 4 How are you monitoring AWS resources?
  Logs and metrics are a powerful tool for gaining insight into the health of your applications. You can configure your system to monitor logs and metrics and send notifications when thresholds are crossed or significant events occur. Ideally, when low-performance thresholds are crossed or failures occur, the system will have been architected to automatically self-heal or scale in response.  
Best practices:
  * Monitoring Monitor your applications with Amazon CloudWatch or third-party tools.

  * Notification Plan to receive notifications when significant events occur.

  * Automated Response. Use automation to take action when failure is detected, e.g., to replace failed components.

REL 5 How are you executing change?
  Uncontrolled changes to your environment will make predictability of the effect of a change difficult. Controlled changes to provisioned AWS resources and applications is necessary to ensure that the applications and the operating environment are running known software and can be patched or replaced in a predictable manner.  
Best practices:
  * Automated Automate deployments and patching.

REL 6 How are you backing up your data?
  Back up data, applications, and operating environments (defined as operating systems configured with applications) to meet requirements for mean time to recovery (MTTR) and recovery point objectives (RPO).  
Best practices:
  * Automated Backups Use AWS features, AWS Marketplace solutions, or third-party software to automate backups.

  * Periodic Recovery Testing Validate that the backup process implementation meets RTO and RPO through a recovery test.

REL 7 How does your system withstand component failures?
  Do your applications have a requirement, implicit or explicit, for high availability and low mean time to recovery (MTTR)? If so, architect your applications for resiliency and distribute them to withstand outages. To achieve higher levels of availability, this distribution should span different physical locations. Architect individual layers (e.g., web server, database) for resiliency, which includes monitoring, self-healing, and notification of significant event disruption and failure.  
Best practices:
  * Multi-AZ /Region Distribute application load across multiple Availability Zones /Regions (e.g., DNS, ELB, Application Load Balancer, API Gateway).

  * Loosely Coupled Dependencies. For example use queuing systems, streaming systems, workflows, load balancers, etc.

  * Graceful Degradation When a component’s dependencies are unhealthy, the component itself does not report as unhealthy. It is capable of continuing to serve requests in a degraded manner.

  * Auto Healing Use automated capabilities to detect failures and perform an action to remediate. Continuously monitor the health of your system and plan to receive notifications of any significant events.

 
REL 8 How are you testing your resiliency?
  When you test your resiliency you might find latent bugs that might only surface in production. Regularly exercising your procedures through game days will help your organization smoothly execute your procedures.  
Best practices:
  * Playbook Have a playbook for failure scenarios.

  * Failure Injection Regularly test failures (e.g., using Chaos Monkey), ensuring coverage of failure pathways.

  * Schedule Game Days.

  * Root Cause Analysis (RCA) Perform reviews of system failures based on significant events to evaluate the architecture.

REL 9 How are you planning for disaster recovery?
  Data recovery (DR) is critical should restoration of data be required from backup methods. Your definition of and execution on the objectives, resources, locations, and functions of this data must align with RTO and RPO objectives.  
Best practices:
  * Objectives Defined. Define RTO and RPO.

  * Disaster Recovery. Establish a DR strategy.

  * Configuration Drift. Ensure that Amazon Machine Images (AMIs) and the system configuration state are up-to-date at the DR site/region.

  * DR Tested and Validated. Regularly test failover to DR to ensure RTO and RPO are met.

  * Automated Recovery Implemented. Use AWS and/or third-party tools to automate system recovery.


AWSARE
Source Information provided on this page is from the AWS Well-Architected Framework Document