AWS re:Invent Notes: GPSTEC302 — GPS: Anti-Patterns: Learning from Failure

I must be getting tired.  Honestly, I thought posting my session notes would be a pretty straightforward task but because I love documentation so much, I’ve made it harder than it needs to be.  My note taking problem is that I try to jot down everything as opposed to what may ultimately matter.  Anyway, on this post I’m going to give you my “raw” notes…unedited, unfiltered, and unspellchecked (is that even a word).  I thought this was one of the best sessions I attended and hope you will gain at least a little insight in regards to AWS best practices by reading these notes.

  • anti-pattern successful in short term but can turn into a fault, can also lead to best practices
    • best practices come from the investigation of anti-patterns
  • best practices are learned and often earned
    • explain the value of a particular best practice
    • we can learn from the behavior of others
    • we don’t invent best practices sitting around and thinking

Anti-Pattern #1:               Loss of Control / Poor IAM Access Key Controls

  • AP’s lead to real outages
  • loss of control of an AWS account
    • AWS reference architecture (https://github.com/awslabs/aws-refarch-wordpress)
    • using API you can create multiple well-architected infrastructures / it’s easy
    • can operate AWS API from anywhere, authenticated with IAM
    • we create accounts for humans to administer the account and give them out (broad permission sets, use root account, etc.)
    • security can end up leaving our control because they are given to users
      • can lose control because you don’t know what all of the various accounts do
      • many accounts become persistent when they don’t need to be
      • temporary credentials can be intercepted by a user and use those credentials
      • an intruder can shut down the entire infrastructure, an IAM user can be everything in a single IAM user
      • making backups, CloudTrail, etc….we can understand the scope of the event but can still locked out, remove access to my backups
    • ways to mitigate this AP
      • create multiple AWS account / credentials are scoped to an account
      • don’t put your prod and backup using the same user
      • account A can write to an AWS acct. but can’t delete
      • if acct. A deleted, use B to backup to C
      • establish separate administrative domains

Anti-Pattern #2:               Control Gaps

  • AWS CloudTrail is awesome
    • get user, IP, etc.
  • AWS config
    • point-in-time snapshot of AWS inventory / inventory in DC can span weeks with a mostly accurate view of what’s in the DC
    • What’s wrong 167.55.180.10/0 (give access to about 5% of the internet)
      • can look for /0 – detect and automate
      • easy to see with compliance automation
    • Making S3 buckets public temporarily to get around security but forget to change it back
      • Is it meant to be public or should it not be?
      • can be hard to tell
      • try to find gaps in the automation
  • things on right hand side happen after changes have happened (Config, CloudTrail, S3, etc)
  • augment aws config with aws managed rules – “managed rules for config” has a rule to check if an S3 bucket is public
  • consider eventing to SQS and Lambda to investigate changes and if it’s not compliant, Lambda will revert it back to original config
  • Amazon Macie – look in S3 bucket and use machine learning to determine what’s there / helps you understand how data is being accessed and if it’s being access the way you want it
  • consider change control (right-side stuff happens after a change)
    • another pair of eyeballs looking at changes is essential
    • use CloudFormation to automate stack
    • can actually automate Change Control / intercept changes on the front end
    • don’t use AWS mgmt. console to manage AWS resources / fantasict way to break automation / don’t use console for read-only access
  • Utility of auditors / what is the next gap? there will always be a next gap
    • continual audits key to identifying the next gap
    • partner applications to probe resources, make sure rules are enforced, active penetration testing

Anti-Pattern #3:                              Automating Outages

  • can easily automate a deployment using CloudFormation/Chef, etc
  • TerraForm to build automation
  • start thinking about blue/green deployments to push new code into production
    • but if using CloudFormation, you could delete your database
    • easy to get things out of step in AWS
  • need to start using CLoudFormation
    • decouple the infrastructure as it relates to their use in production
      • decouple web server from DBs
    • carve out functional components, especially the stateful components
  • AWS Management Console
    • limit interactive access to infrastructure
    • don’t make a change one may forget to undo tags
    • version number, like an app version number
    • is change associated with valid version of my app
  • make multiple cloudformation environments
    • change test to make sure it’ll work in prod

Anti-Pattern #4:               Schrodinger’s Backup

  • “You don’t have backups if you don’t test them.”
  • Schrodinger 1935
    • unless you measure what happens, how do you know what happens?
  • my business is the data
    • a 0KB backup is useless
  • “a backup is just data until you test it”
    • are files getting bigger everyday? backups rarely get smaller
    • are EBS snapshots going up? are old ones being deleted?
    • backup failures never happen on unimportant files
    • today, you can write lambda code to check backups, or the size of the backups
  • automate but monitor the backups
    • EBS snapshots are not great for snapshotting a hot database, use the native tools
    • replication is not a backup

Establishing Best Practices

  • want to learn from other’s failures
  • learn as much as possible prior to putting things in production
  • war gaming – sit down at round table and toss out scenarios and ask “what happens if we lose control of our root account?”
    • do every quarter / paper exercise
  • prioritize based on risk
  • use AWS services / AWS Trusted Advisor
  • use Security Partner Solutions
    • can help you look at your cloudtrail / external penetration testing
  • review the AWS Well-Architected
    • collection of published best practices
    • 56 questions in the whitepaper, read and consider them
    • Perform reviews to build a prioritized list

For more information, check out the Amazon Web Services YouTube channel….

Leave a Reply

Your email address will not be published. Required fields are marked *