Implementing System Observability on AWS: An Essential for Production Environment

Apr 1, 2025
6 min read

Building commercial software is not trivial. Very often, requirements are complex yet unclear, people may lack domain knowledge or have limited technical skills. On top of that, everyone must communicate effectively to achieve a common goal. But even if we assume none of these problems occur, we will still, most likely, build software that is not free of bugs.

We can invest heavily in different kinds of testing activities before releasing software to our customers. However, there will always be people using our system in ways we couldn’t have foreseen. Not only that, but sometimes, the production environment is too expensive to be replicated in a testing environment and conduct tests properly.

Another thing to consider is that our software will run on hardware that may fail at some point in time, in one way or another. The network might become temporarily unstable, and the hard drive might fail permanently.

Given all that, we should expect problems to happen in the production environment. To act quickly and effectively, we must have good awareness of what’s going on in our system.

Observability

The rise of Microservice Architecture has introduced a lot of complexity to modern software platforms, making standard system monitoring techniques insufficient. Monitoring has evolved into Observability, which is the extent to which you can understand what the system is doing based on external outputs.

The three main building blocks of Observability are:

Log aggregation - collecting text information across multiple microservices.
Metrics aggregation - collecting raw numbers from microservices and infrastructure to help detect problems, drive capacity planning, and scale the system.
Distributed tracing - tracking a flow of calls across multiple microservice boundaries to determine what went wrong and derive accurate latency information.

Microservice instances log to a file system. A log-forwarding daemon sends logs to operators using a log aggregation tool. — Log aggregation in Microservice Architecture. Building Microservices, Sam Newman, O'Reilly, 2022

Let’s see how Apptimia used observability techniques and tools to build a software platform for one of our customers on AWS.

Case Study

The simplified system architecture is presented in the diagram below:

Flowchart showing Angular Frontend to AWS integrations: API Gateway, Lambda, SQS, Cognito, SNS, RDS, S3, EC2. Colorful icons. — Simplified system architecture of a platform built by Apptimia

The Angular frontend communicates with AWS API Gateway, which handles HTTP traffic and forwards requests to a Lambda REST API. The Lambda authorizes requests using AWS Cognito, stores and loads data from PostgreSQL on RDS, and sends messages to an SQS Queue and SNS Topic X. There are also other Lambda and EC2 Workers, as well as AWS S3, involved in system transactions at later stages.

So how can we implement observability in this system? Let’s start with log aggregation.

Log Aggregation

We use the most straightforward choice for log aggregation on AWS: Amazon CloudWatch. With the Node.js runtime on AWS Lambda, you can simply call the console.log function to log a message to CloudWatch without any extra setup, cost, or performance hit.

Every message is logged into a specific log stream associated with a specific Lambda function execution environment. Log streams are combined into log groups that can be used to search logs across multiple Lambda function execution environments.

For more complex queries involving different Lambda functions, we use Logs Insights.

Logs Insights dashboard showing error logs query results. Graph with timestamps, services, and error messages displayed below in a table. — Querying different log groups at once, in CloudWatch Log Insights

This setup is sufficient for our needs, but it has some limitations regarding how many log groups you can query or cross-account support. You may also prefer to use tools you already know, like the ELK stack. In either case, you can use log group subscription filters or Lambda extensions to redirect logs from CloudWatch to other aggregation services.

Structured Logging and Log Sampling

Using console.log to log a message quickly is convenient, but usually, you want something more sophisticated. We use the Powertools for AWS Lambda NPM package for structured logging, which provides more consistency. It also allows us to inject additional context into log messages and provides different log levels.

Another important consideration is enabling log sampling. Given CloudWatch pricing, it is probably not a good idea to log everything every time. Instead, you sample low-priority logs at a set rate to reduce log costs while maintaining enough logs to debug transient errors or reduce MTTR, without redeploying your app with debug-level logs enabled.

EC2 Log Aggregation

For aggregating logs from EC2 instances, we use the CloudWatch agent to capture and send syslog messages to CloudWatch, where each instance has its own log group.

We also have a few log files from different Python and Node.js scripts, which we send to an AWS S3 bucket along with other artifacts when the EC2 instance is done processing and is about to be terminated.

Angular Frontend Log Aggregation

For the frontend web app, we use Sentry.io to gather all errors on the client side. Sentry provides an SDK for Angular apps via the @sentry/angular NPM package, allowing easy integration with the service.

Metrics Aggregation

Since AWS Lambda doesn’t provide access to the underlying OS, we cannot use popular metrics daemons like StatsD. However, Lambda, as well as other AWS services, sends many useful metrics to CloudWatch out of the box, which is more than enough for our needs.

Here are some of the AWS Lambda metrics we use:

Invocations: The total number of times a Lambda function is executed.
Errors: The number of invocations that result in a function error.
Duration: The time taken to execute a Lambda function.
Throttles: The number of invocation attempts that are throttled due to exceeding concurrency limits.

Here are the API Gateway service metrics:

4XXError: The number of client-side errors.
5XXError: The number of server-side errors.
Count: The total number of API requests.
Integration Latency: The time between when API Gateway relays a request to the backend and receives a response.
Latency: The overall time between when API Gateway receives a request from a client and returns a response.

Dashboard with multiple graphs: Count, Error, Latency. Graphs display time-based data trends with colorful lines, labels, and legends. — CloudWatch metrics dashboard

Custom Lambda Metrics

It is possible to send custom metrics from Lambda functions, for example, for some domain-specific events. To avoid performance overhead, custom metrics can be sent asynchronously as log messages using Embedded Metric Format (EMF).

Distributed Tracing

At this point, we can get a lot of useful information about our system's health, but we don’t have an easy way to see how requests propagate across our system. In other words, we lack business transaction context when investigating logs. Different tools like Honeycomb provide tracing across distributed systems, but in our case, the cost-effective and easy-to-configure AWS X-Ray works just fine.

Enabling X-Ray for Lambda and API Gateway requires just a few lines in our SAM template. For integration with other AWS services, we use the previously mentioned Powertools for AWS Lambda, which instruments AWS SDK.

AWS trace view with icons showing stages: client, API Gateway, Lambda functions. Segments timeline with green bars illustrates durations. — Detailed trace information from AWS X-Ray

X-Ray, or distributed tracing tools in general, can also be used to analyze latencies. In the case of AWS Lambda, we can use this information to improve cold starts of our functions.

One thing to be aware of is that X-Ray uses quite aggressive sampling. To address this, we configure X-Ray to ensure that all POST and PUT requests from API Gateway are captured.

Some people find X-Ray insufficient when working with AWS and opt for more sophisticated but also more expensive tools like Lumigo. We may consider other tools in the future.

Alerting

Now that we can look up system information in various ways, it is also valuable to be immediately notified when something goes wrong.

We have set up different CloudWatch alarms that get triggered in different situations:

5XX errors in API Gateway.
The word “ERROR” detected in the log of a Lambda function or an EC2 instance syslog.
EC2 idle CPU for more than 3 hours.
Angular errors reported by Sentry.

This setup involves configuring AWS services like CloudWatch Metrics, CloudWatch Alarms, SNS, and Chatbot.

Flowchart with arrows showing sequence: Service A → CW Logs → CW Metrics → CW Alarms → SNS → Chatbot → MS Teams, under "AWS". — Propagation path of an error, originated from AWS Lambda function, to MS Teams channel

For the Angular app, the setup is slightly different, requiring configuration in Sentry, a special adapter endpoint on AWS, and a webhook in MS Teams.

Flowchart showing data flow: Frontend to Sentry.io, API Gateway, Lambda Function, Skyportal Sentry Webhook, and Frontend Errors Channel. — Propagation path of an error, originated from Angular app, to MS Teams channel

Conclusion: System Observability on AWS Production Environment

Building reliable software is inherently challenging due to complex requirements, unforeseen user behavior, and potential hardware or network failures in a production environment. Traditional monitoring techniques are no longer sufficient, especially with the rise of microservices, making system observability on AWS production environment essential for maintaining system reliability.

By utilizing CloudWatch for centralized log and metrics aggregation, X-Ray for distributed tracing, and Sentry for frontend error tracking, teams can gain deep visibility into their systems. Structured logging, log sampling, and proactive alerting further enhance system monitoring, allowing quick identification and resolution of issues. Implementing a strong observability strategy on AWS ensures that even in complex, dynamic environments, teams can detect problems early, optimize performance, and maintain a high-quality user experience.

This is one of the many production-grade cloud systems we have built for our customers at Apptimia. If you are building a scalable cloud system to be deployed in production and want to stay in charge of its performance, reliability and user experience and need help, get in touch with us!

Bartosz Ch.

Lead Software Engineer at Apptimia