AWS re:Invent Notes: GPSTEC302 — GPS: Anti-Patterns: Learning from Failure

I must be getting tired. Honestly, I thought posting my session notes would be a pretty straightforward task but because I love documentation so much, I’ve made it harder than it needs to be. My note taking problem is that I try to jot down everything as opposed to what may ultimately matter. Anyway, on this post I’m going to give you my “raw” notes…unedited, unfiltered, and unspellchecked (is that even a word). I thought this was one of the best sessions I attended and hope you will gain at least a little insight in regards to AWS best practices by reading these notes.

anti-pattern successful in short term but can turn into a fault, can also lead to best practices
- best practices come from the investigation of anti-patterns
best practices are learned and often earned
- explain the value of a particular best practice
- we can learn from the behavior of others
- we don’t invent best practices sitting around and thinking

Anti-Pattern #1: Loss of Control / Poor IAM Access Key Controls

AP’s lead to real outages
loss of control of an AWS account
- AWS reference architecture (https://github.com/awslabs/aws-refarch-wordpress)
- using API you can create multiple well-architected infrastructures / it’s easy
- can operate AWS API from anywhere, authenticated with IAM
- we create accounts for humans to administer the account and give them out (broad permission sets, use root account, etc.)
- security can end up leaving our control because they are given to users
  - can lose control because you don’t know what all of the various accounts do
  - many accounts become persistent when they don’t need to be
  - temporary credentials can be intercepted by a user and use those credentials
  - an intruder can shut down the entire infrastructure, an IAM user can be everything in a single IAM user
  - making backups, CloudTrail, etc….we can understand the scope of the event but can still locked out, remove access to my backups
- ways to mitigate this AP
  - create multiple AWS account / credentials are scoped to an account
  - don’t put your prod and backup using the same user
  - account A can write to an AWS acct. but can’t delete
  - if acct. A deleted, use B to backup to C
    - cost neutral
    - IAM best practices http://docs.aws.amazon.com/IAM/latest/UserGuide/best-practices.html
  - establish separate administrative domains

Anti-Pattern #2: Control Gaps

AWS CloudTrail is awesome
- get user, IP, etc.
AWS config
- point-in-time snapshot of AWS inventory / inventory in DC can span weeks with a mostly accurate view of what’s in the DC
- What’s wrong 167.55.180.10/0 (give access to about 5% of the internet)
  - can look for /0 – detect and automate
  - easy to see with compliance automation
- Making S3 buckets public temporarily to get around security but forget to change it back
  - Is it meant to be public or should it not be?
  - can be hard to tell
  - try to find gaps in the automation

things on right hand side happen after changes have happened (Config, CloudTrail, S3, etc)
augment aws config with aws managed rules – “managed rules for config” has a rule to check if an S3 bucket is public
consider eventing to SQS and Lambda to investigate changes and if it’s not compliant, Lambda will revert it back to original config
Amazon Macie – look in S3 bucket and use machine learning to determine what’s there / helps you understand how data is being accessed and if it’s being access the way you want it
consider change control (right-side stuff happens after a change)
- another pair of eyeballs looking at changes is essential
- use CloudFormation to automate stack
- can actually automate Change Control / intercept changes on the front end
- don’t use AWS mgmt. console to manage AWS resources / fantasict way to break automation / don’t use console for read-only access
Utility of auditors / what is the next gap? there will always be a next gap
- continual audits key to identifying the next gap
- partner applications to probe resources, make sure rules are enforced, active penetration testing

Anti-Pattern #3: Automating Outages

can easily automate a deployment using CloudFormation/Chef, etc
TerraForm to build automation
start thinking about blue/green deployments to push new code into production
- but if using CloudFormation, you could delete your database
- easy to get things out of step in AWS
need to start using CLoudFormation
- decouple the infrastructure as it relates to their use in production
  - decouple web server from DBs
- carve out functional components, especially the stateful components
AWS Management Console
- limit interactive access to infrastructure
- don’t make a change one may forget to undo tags
- version number, like an app version number
- is change associated with valid version of my app
make multiple cloudformation environments
- change test to make sure it’ll work in prod

Anti-Pattern #4: Schrodinger’s Backup

“You don’t have backups if you don’t test them.”
Schrodinger 1935
- unless you measure what happens, how do you know what happens?
my business is the data
- a 0KB backup is useless
“a backup is just data until you test it”
- are files getting bigger everyday? backups rarely get smaller
- are EBS snapshots going up? are old ones being deleted?
- backup failures never happen on unimportant files
- today, you can write lambda code to check backups, or the size of the backups
automate but monitor the backups
- EBS snapshots are not great for snapshotting a hot database, use the native tools
- replication is not a backup

Establishing Best Practices

want to learn from other’s failures
learn as much as possible prior to putting things in production
war gaming – sit down at round table and toss out scenarios and ask “what happens if we lose control of our root account?”
- do every quarter / paper exercise
prioritize based on risk
use AWS services / AWS Trusted Advisor
use Security Partner Solutions
- can help you look at your cloudtrail / external penetration testing
review the AWS Well-Architected
- collection of published best practices
- 56 questions in the whitepaper, read and consider them
- Perform reviews to build a prioritized list

For more information, check out the Amazon Web Services YouTube channel….

Share this:

Like this:

Published by David Ball

Leave a Reply Cancel reply