If you’re like most engineering leaders, you have this nagging feeling that too many developers have access to production. You’re moving towards least privilege, but you know it’s only useful as a north star to guide your efforts, not some end state you can actually achieve. Your problem is that it’s not obvious how much progress you should have made by now, and that ambiguity is causing anxiety.
Just as writing down sources of stress can be a useful way to help clear your mind and reduce anxiety, a risk-oriented framework can help you get the risk hotspots out of your head and onto paper, resulting in a priority sequence of risk-mitigation efforts. I’d like to share a lightweight approach for using asset-based risk assessments to help with early efforts to lock down AWS IAM.
The tool I propose is a pared-down version of a template used for an ISO 27001 risk assessment. There is a reason large companies and government agencies use risk assessments: they help bring order to chaos, enabling clear understanding of tradeoffs in order to make sound decisions for risk mitigation. Using a similar risk-based approach could help provide you confidence that you and your team are working on the most important things to protect your own company.
A common first step we’ve seen many of our customers take is to embark on a comprehensive “RBAC” initiative, beginning with engineer surveys to map out access needs. It feels intuitive to design your roles up front before addressing risky access but taking a role-first approach is putting the cart before the horse. You can easily spend too much time perfecting role-design for lower-risk areas of the business while high risk vulnerabilities remain. Risk should be the primary mechanism for prioritizing your security efforts.
You might think that formal risk assessments are the domain of large bureaucratic companies or compliance-induced security theater. But the steps I propose in this article can help you genuinely improve your security posture by using the best parts of asset-based risk assessments while leaving out much of the cruft. The process starts with filling out a simple spreadsheet with the following seven columns:
Let’s break down the information you will need to collect to complete this exercise column-by-column.
Here you define high level components in your AWS Accounts that, if accessed by bad actors, would create risk for your company. A single resource may represent multiple AWS EC2 instances (e.g. Web Application) or a cluster of database servers (e.g. Production Database). Ideally you’ve created these resources in an infrastructure as code tool such as Terraform and you can run something like terraform state list to get a list and pull out what makes sense to the spreadsheet.
The cardinality and variability of available IAM Actions for a given service is very high so for the purposes of evaluating risk we’ll use basic CRUD level actions (READ or WRITE). Later, we may need to create or edit IAM Policies and can use AWS docs or a tool like Policy Sentry to map CRUD levels to the actual permissions we’re interested in.
The ultimate goal of this exercise is to assess the risk of unauthorized access to each resource by determining a relative Risk value. We’ll start by breaking the Risk value into two components (Impact and Likelihood), providing estimates for each and then multiplying them together. We’ll use a simple 1-5 scale to indicate relative values for each component.
Impact represents the business impact of unauthorized access. As an example, if a database containing customer data was compromised by an attacker, the business impact would be very high compared to the unauthorized access of other resources. As such, a production database should carry the highest Impact value. On the other end of the spectrum, losing web access logs could be assigned a low Impact value assuming they were properly redacted.
Here we’ll rank (from 1-5) how likely unauthorized access to the specified resource would be. Remember these are relative values. A five doesn’t mean there is a 100% chance you’ll be compromised today but it should tell you that this resource has the highest chance of getting breached compared to other resources.
Likelihood is subjective and difficult to determine. A full overview of common methodologies is beyond the scope of this article. In coming up with a likelihood estimate, you should consider things like: what are the current policies and permission sets, how many people have access to them, what type of access, are they using MFA, etc. You should also consider common attack vectors and how they apply to your current access patterns. For example, if your engineers are using static credentials, you should be aware that a high number of breaches are caused by attackers obtaining static credentials. If even an estimate feels overwhelming, there are also tools like k9 security available that can help analyze sprawling IAM policies to zero-in on hot spots.
By multiplying Impact with Likelihood, you’ll end up with a Risk value ranging from 1 to 25. This gives you a relative ranking of risk associated with your AWS resources, and you can begin to make decisions on what to address. For now only the relative ordering is useful, but over time the absolute values will take more meaning. For example, you may decide one quarter that you want to mitigate all Risk values above 15.
RISK = IMPACT x LIKELIHOOD
Hopefully it hasn’t taken long to get here, and you’ve already produced something genuinely valuable – especially if you’ve involved stakeholders across the company. Yes, all of this risk information was in your head before you filled out this spreadsheet, but applying this structure helped organize your thoughts and get other people aligned, and getting alignment is half the battle to actually addressing the issues. Conversations about getting risk mitigation efforts on the roadmap should be much easier now.
The last field in our sheet is where you define the Action to take to mitigate your risks. But you don’t NEED to address everything. One of the primary benefits of this process is that you’ve created an easy way to decide explicitly which risks not to address. For lower risk issues that have been spinning in the back of your head, hopefully this will help you just let them go. Even in mature companies, risk assessments include risks that are explicitly accepted. The common set of choices for mitigation in many formal risk assessments include: Mitigate, Accept, Transfer and Avoid – where “Accept” is for risks that the company is willing to live with.
For resources with unacceptably high risk levels, several options to mitigate those risks should be available. A comprehensive survey of AWS risk mitigations is also beyond the scope of this article. We can, however, take a look at a couple common approaches often taken by teams early in their maturity curve: Isolating Risky IAM Permissions, and providing engineers with just-in-time temporary access.
One approach is to start with isolating risky IAM Permissions. This can be done in several ways. AWS recommends leveraging account boundaries to segment your cloud infrastructure based on different attributes including risk levels, but it might be a large painful effort for you to migrate resources to different accounts. It’s probably easier to begin by migrating risky permissions to new, dedicated IAM Roles. For example, if you created a role that contained the permissions required for production access, and then removed those permissions from other roles, you could then create a User Group with only the Users who are allowed to assume that role only. You can also easily log anytime anyone assumes this role, and set up corresponding alerts.
The step of removing all other access to those permissions can end up being the hard part. Again, this is an opportunity to use tools. For example, given a resource, PMapper can provide all IAM entities that have access to that role (though it doesn’t work with Identity Center (formerly known as AWS SSO)).
In some cases you can also use resource-policies to add another layer of Denys from everywhere but your dedicated roles.
In the case where you’ve isolated risky permissions to a dedicated role, but you end up including most of the engineers in the user group that can assume that role, you might realize that you didn’t meaningfully improve your risk value. In most cases isolating those permissions should allow you to reduce the number of people with default access to those permissions, but that’s not always the case.
As your organization matures you’ll need to make decisions about sets of permissions where both things are true: the business impact of unauthorized access is high AND engineers need access to those permissions to accomplish tasks. The latter requires broad access grants which carry a high Likelihood and consequently a high overall Risk value.
Ideally you’d be able to automate your way out of the need for access. If a developer can invoke a Lambda to accomplish the thing they needed access for (and if the script is confined to operations that carry low risk), then access can be removed and your Likelihood score can be reduced.
But automation can take a while to build and some of the access needs your team requires are not always easily scriptable. In that case you can use temporary access; meaning access is turned on for an engineer for only the duration the engineer needs to get their job done.
Many organizations have built internal systems to support temporary access. Capabilities of such systems often include:
Provided you have the proper automation in place, approval-based temporary access provides a very good option for mitigating risk without slowing your engineers down. This is in fact, why we built Sym!
Security teams at large companies have spent years incorporating risk into day-to-day decisions. For many security professionals, it’s become muscle memory.
If you are an engineer managing IAM it’s useful to get your own feel for making risk-based sooner than later. Early efforts to clean up access are well-served by using risk assessment similar to what I’ve described in this article. It will help get teammates into a risk-oriented mindset which could help enable the team to share the mental burden you’ve been carrying.
Credit to Ghozy Muhtarom for his icons on the Noun project