Architectually Resilient Environment Discovery Interview
security reliability performance cost optimization operational excellence

Operational Excellence

OPS 1 What best practices for cloud operations are you using?
  Effective preparation is required to drive operational excellence. Using operations checklists ensures that your workloads are ready for production operation. The use of checklists prevents unintentional promotion to production without effective preparation.  
Best practices:
  * Operational Checklist. Create an operational checklist that you use to evaluate if you are ready to operate the workload.

  * Proactive Plan. Have a proactive plan for events (e.g., marketing campaigns, flash sales) that prepares you for both opportunities and risks that could have a material impact on your business (e.g., reputation, finances).

  * Security Checklist. Create a security checklist that you can use to evaluate if you are ready to securely operate the workload (e.g., zero day, DDoS, compromised keys).

OPS 2 Have you sized your resources to meet your cost targets?
  Environments, architecture, and the configuration parameters for resources within them, should be documented in a way that allows components to be easily identified for tracking and troubleshooting. Changes to configuration should also be trackable and automated.  
Best practices:
  * Resource Tracking. Plan for ways to identify your resources and their function within the workload (e.g., use metadata, tagging).

  * Documentation. Document your architecture (e.g., infrastructure as code, CMDB, diagrams, release notes).

  * Capture Operational Learnings. Captured operational learnings over time (e.g., wiki, knowledge base, tickets).

  * Immutable Infrastructure. Establish an immutable infrastructure so that you redeploy, you don’t patch.

  * Automated Change Procedures. Automate your change procedures.

  * Configuration Management Database (CMDB) Track all changes in a CMDB.

OPS 3 How are you evolving your workload while minimizing the impact of change?
  Your focus should be on automation, small frequent changes, regular quality assurance testing, and defined mechanisms to track, audit, roll back, and review changes.  
Best practices:
  * Deployment Pipeline. Put a CI/CD pipeline in place (e.g., source code repository, build systems, deployment and testing automation).

  * Release Management Process. Establish a release management process (e.g., manual or automated).

  * Small Incremental Changes. Ensure that you can release small incremental versions of system components.

  * Revertible Changes. Be prepared to revert changes that introduce operational issues (e.g., roll back, feature toggles).

  * Risk Mitigation Strategies Use risk mitigation strategies such as Blue/Green, Canary, and A/B testing.

OPS 4 How do you monitor your workload to ensure it is operating as expected?
  Your system can degrade over time due to internal and/or external factors. By monitoring the behavior of your systems, you can identify these factors of degradation and remediate them.  
Best practices:
  * Monitoring. Use Amazon CloudWatch, third-party, or custom monitoring tools to monitor performance.

  * Aggregate Logs. Aggregate logs from multiple sources (e.g., application logs, AWS service-specific logs, VPC flow logs, CloudTrail).

  * Alarm-Based Notifications. Receive an automatic alert from your monitoring systems if metrics are out of safe bounds.

  * Trigger-Based Actions. Alarms cause automated actions to remediate or escalate issues.

OPS 5 How do you respond to unplanned operational events?
  Be prepared to automate responses to unexpected operational events. This includes not just for alerting, but also mitigation, remediation, rollback, and recovery.  
Best practices:
  * Playbook. Have a playbook that you follow (e.g., on call process, workflow chain, escalation process) and update regularly.

  * RCA Process. Have an Root Cause Analysis process to ensure that you can resolve, document, and fix issues so they do not happen in the future.

  * Automated Response Handle unplanned operational events gracefully through automated responses (e.g., Auto Scaling, Support API).

OPS 6 How do you manage escalation when you respond to unplanned operational events?
  Responses to unplanned operational events should follow a pre-defined playbook that includes stakeholders and the escalation process and procedures. Define escalation paths and include both functional and hierarchical escalation capabilities. Hierarchical escalation should be automated, and escalated priority should result in stakeholder notifications.  
Best practices:
  * Appropriately Document and Provision. Put necessary stakeholders and systems in place for receiving alerts when escalations occur.

  * Functional Escalation with Queue-based Approach. Escalate between appropriate functional team queues based on priority, impact, and intake mechanisms.

  * Hierarchical Escalation. Use a demand- or time-based approach. As impact, scale, or time to resolution/recovery of incident increases, priority is escalated.

  * External Escalation Path. Include external support, AWS support, AWS Partners, and third-party support engagement in escalation paths.

  * Hierarchical Priority Escalation is Automated. When demand or time thresholds are passed, priority automatically escalates.

Source Information provided on this page is from the AWS Well-Architected Framework Document