Recently AWS made a new feature available in Cloudwatch: AI Operations.
What does it do? In it’s own words:
Amazon Q Developer now includes a new generative-AI investigations experience that helps you troubleshoot operational issues by automating information gathering, analyzing observability data, and providing tailored recommendations.
With this tool one is able to create a trail or notebook of observations and have AWS automatically find new hopefully relevant observations and even suggest causes and actions.
An Example ๐
We have a service that consumes irregular workloads. By irregular I mean they can arrive at any time, and arrive in a range of sizes. It runs in ElasticBeanstalk and became “unhealthy” (ie the EnvironmentHealth metric went above a threshold into “warning” and then “degraded” states. This indicator is a rather vague and it’s often hard to find the cause just by looking at the logs, as it can be the instance itself that is in trouble.
Let’s pursue this with AI Investigations. If you click Investigate you can start a new Investigation (or update an existing one) and give it a useful description:
Upon creation you will notice that it has marked the time of interest automatically:
Now AWS will start looking for relevant other metrics, logs etc. to attach to your investigation, but you can always add more to an ongoing one. Let’s say I noticed something unusual about memory use:
Meanwhile AWS will be analysing your observations:
Eventually it will come up with a Hypothesis:
Now this isn’t much of a hypothesis it has to be said. It’s really just restating the observations. It has however given a focus for further investigations. Sometimes it will have suggested actions - changes you can make in your environment - but in this case there are none.
So, I’ll add heap memory for the service as an obsveration:
Shortly after adding this, a new hypothesis was generated, that a spike in activity led to pressure on EBS:
This is interesting, as I was aware that some workloads can be very memory and CPU intensive just from my experience with the service, but this has pointed to specific problems with our infrastructure configuration.
In fact, this time it has some suggested actions:
I will curtail our explorations here.
So Did This Help Me? ๐
Did it fix the problem neatly or automatically? No. But it did point me towards avenues I wouldn’t have considered without quite a bit of research. I wasn’t overly familiar with microbursting but now I have something to deep dive into myself. It’s an interesting learning tool in its own right.
Less critically, but worth mentioning, AI Investigations preserves a kind of notebook that is handy to keep track of investigations even if it doesn’t come up with any answers.
The Future ๐
The AWS offering here is marked as “preview” and will naturally develop further. You may wish to check out this post about building your own system for timely automated root cause analysis.