Introduction:
To build an effective monitoring system, it should have the ability to handle complex microservices. This includes the capability to trace transactions across services, simplifying debugging and optimizing performance. Additionally, it should provide insights into request-response times to enhance the user experience. Prioritizing root cause analysis and identifying trends in system logs and metrics is crucial for proactive issue mitigation and ensuring system stability. Real-time alerts via email or Slack, coupled with anomaly detection, are essential components to promptly address and notify of any irregularities or errors in the system.
Architecture Overview:
We use ELK, APM, Filebeat, Metricbeat, and CloudWatch for our monitoring system. Deploying ELK on the cloud is convenient, as we can easily sign up and deploy the service. Our Node.js application is deployed on Elastic Beanstalk, and we use Filebeat to ship log files to ELK. APM allows us to track transactions and monitor API performance, automatically capturing errors. Kibana enables us to view logs, request-response times, and transactions. We have also set up alerts through Kibana. Our event microservice is built on top of the serverless stack, and we use Filebeat to pull logs from CloudWatch.
Beanstalk Node.js Application Integration:

files:
"/etc/filebeat/filebeat.yml":
mode: "000755"
owner: root
group: root
content: |
filebeat.inputs:
- type: aws-cloudwatch
log_group_arn: arn:aws:logs:AWS_REGION:ACCOUNT_ID:log-group:/aws/lambda/event-service-ELASTIC_APM_ENV-basicEventHandler:*
scan_frequency: 5m
enabled: true
access_key_id: ELK_AWS_ACCESS_KEY_ID
secret_access_key: ELK_AWS_SECRET_ACCESS_KEY
start_position: beginning
- type: aws-cloudwatch
log_group_arn: arn:aws:logs:AWS_REGION:ACCOUNT_ID:log-group:/aws/lambda/event-service-ELASTIC_APM_ENV-connectionHandler:*
scan_frequency: 5m
enabled: true
access_key_id: ELK_AWS_ACCESS_KEY_ID
secret_access_key: ELK_AWS_SECRET_ACCESS_KEY
start_position: beginning
- type: aws-cloudwatch
log_group_arn: arn:aws:logs:AWS_REGION:ACCOUNT_ID:log-group:/aws/lambda/event-service-ELASTIC_APM_ENV-disconnectionHandler:*
scan_frequency: 5m
enabled: true
access_key_id: ELK_AWS_ACCESS_KEY_ID
secret_access_key: ELK_AWS_SECRET_ACCESS_KEY
start_position: beginning
- type: aws-cloudwatch
log_group_arn: arn:aws:logs:AWS_REGION:ACCOUNT_ID:log-group:/aws/lambda/event-service-ELASTIC_APM_ENV-enhancedEventHandler:*
scan_frequency: 5m
enabled: true
access_key_id: ELK_AWS_ACCESS_KEY_ID
secret_access_key: ELK_AWS_SECRET_ACCESS_KEY
start_position: beginning
- type: aws-cloudwatch
log_group_arn: arn:aws:logs:AWS_REGION:ACCOUNT_ID:log-group:/aws/lambda/event-service-ELASTIC_APM_ENV-fetchEventHandler:*
scan_frequency: 5m
enabled: true
access_key_id: ELK_AWS_ACCESS_KEY_ID
secret_access_key: ELK_AWS_SECRET_ACCESS_KEY
start_position: beginning
- type: aws-cloudwatch
log_group_arn: arn:aws:logs:AWS_REGION:ACCOUNT_ID:log-group:/aws/lambda/event-service-ELASTIC_APM_ENV-mainScheduler:*
scan_frequency: 5m
enabled: true
access_key_id: ELK_AWS_ACCESS_KEY_ID
secret_access_key: ELK_AWS_SECRET_ACCESS_KEY
start_position: beginning
- type: aws-cloudwatch
log_group_arn: arn:aws:logs:AWS_REGION:ACCOUNT_ID:log-group:/aws/lambda/event-service-ELASTIC_APM_ENV-optionHandler:*
scan_frequency: 5m
enabled: true
access_key_id: ELK_AWS_ACCESS_KEY_ID
secret_access_key: ELK_AWS_SECRET_ACCESS_KEY
start_position: beginning
- type: aws-cloudwatch
log_group_arn: arn:aws:logs:AWS_REGION:ACCOUNT_ID:log-group:/aws/lambda/event-service-ELASTIC_APM_ENV-presenceHandler:*
scan_frequency: 5m
enabled: true
access_key_id: ELK_AWS_ACCESS_KEY_ID
secret_access_key: ELK_AWS_SECRET_ACCESS_KEY
start_position: beginning
- type: aws-cloudwatch
log_group_arn: arn:aws:logs:AWS_REGION:ACCOUNT_ID:log-group:/aws/lambda/event-service-ELASTIC_APM_ENV-wssPingHandler:*
scan_frequency: 5m
enabled: true
access_key_id: ELK_AWS_ACCESS_KEY_ID
secret_access_key: ELK_AWS_SECRET_ACCESS_KEY
start_position: beginning
- type: log
enabled: true
paths:
- /var/app/current/logs/errors.log
- /var/app/current/logs/events.log
- /var/app/current/logs/messages.log
- /var/app/current/logs/shell.log
- /var/app/current/logs/warnings.log
fields:
environment: ELASTIC_APM_ENV
cloud.id: "ELK_CLOUD_ID"
cloud.auth: "ELK_CLOUD_AUTH"
filebeat.name: 'api-server-filebeat'
"/home/ec2-user/beat_env_setup.sh":
mode: "000755"
owner: root
group: root
content: |
#!/bin/bash
sudo yum install jq
echo "elastic beanstal env print"
ELK_CLOUD_AUTH=$(/opt/elasticbeanstalk/bin/get-config environment -k ELK_CLOUD_AUTH)
ELK_CLOUD_ID=$(/opt/elasticbeanstalk/bin/get-config environment -k ELK_CLOUD_ID)
ELASTIC_APM_ENV=$(/opt/elasticbeanstalk/bin/get-config environment -k ELASTIC_APM_ENV)
ELK_AWS_ACCESS_KEY_ID=$(/opt/elasticbeanstalk/bin/get-config environment -k ELK_AWS_ACCESS_KEY_ID)
ELK_AWS_SECRET_ACCESS_KEY=$(/opt/elasticbeanstalk/bin/get-config environment -k ELK_AWS_SECRET_ACCESS_KEY)
AWS_REGION=$(ec2-metadata --availability-zone | cut -d " " -f 2 | sed 's/.$//')
ACCOUNT_ID=$(aws sts get-caller-identity --query 'Account' --output text)
ELASTIC_APM_ENV=$(/opt/elasticbeanstalk/bin/get-config environment -k ELASTIC_APM_ENV)
sed -i "s/ELK_CLOUD_ID/$ELK_CLOUD_ID/g" /etc/filebeat/filebeat.yml
sed -i "s/ELK_CLOUD_AUTH/$ELK_CLOUD_AUTH/g" /etc/filebeat/filebeat.yml
sed -i "s/ELASTIC_APM_ENV/$ELASTIC_APM_ENV/g" /etc/filebeat/filebeat.yml
sed -i "s/ELK_AWS_ACCESS_KEY_ID/$ELK_AWS_ACCESS_KEY_ID/g" /etc/filebeat/filebeat.yml
sed -i "s/ELK_AWS_SECRET_ACCESS_KEY/$ELK_AWS_SECRET_ACCESS_KEY/g" /etc/filebeat/filebeat.yml
sed -i "s/AWS_REGION/$AWS_REGION/g" /etc/filebeat/filebeat.yml
sed -i "s/ACCOUNT_ID/$ACCOUNT_ID/g" /etc/filebeat/filebeat.yml
sed -i "s/ELASTIC_APM_ENV/$ELASTIC_APM_ENV/g" /etc/filebeat/filebeat.yml
curl -L -O https://artifacts.elastic.co/downloads/beats/filebeat/filebeat-8.8.1-x86_64.rpm
sudo rpm -vi filebeat-8.8.1-x86_64.rpm
sudo service filebeat start
# Commands that will be run on container_commmands
# Here the container variables will be visible as environment variables.
commands:
1_command:
command: "./beat_env_setup.sh"
cwd: /home/ec2-user
files:
"/etc/metricbeat/metricbeat.yml":
mode: "000755"
owner: root
group: root
content: |
metricbeat.config.modules:
path: ${path.config}/modules.d/*.yml
reload.period: 10s
reload.enabled: true
metricbeat.max_start_delay: 10s
setup.dashboards.enabled: true
metricbeat.modules:
- module: system
metricsets:
- cpu # CPU usage
- load # CPU load averages
- memory # Memory usage
- network # Network IO
- process # Per process metrics
- process_summary # Process summary
- uptime # System Uptime
- socket_summary # Socket summary
#- core # Per CPU core usage
#- diskio # Disk IO
#- filesystem # File system usage for each mountpoint
#- fsstat # File system summary metrics
#- raid # Raid
#- socket # Sockets and connection info (linux only)
#- service # systemd service information
enabled: true
period: 10s
processes: ['.*']
# Configure the mount point of the host’s filesystem for use in monitoring a host from within a container
#hostfs: "/hostfs"
# Configure the metric types that are included by these metricsets.
cpu.metrics: ["percentages","normalized_percentages"] # The other available option is ticks.
core.metrics: ["percentages"]
cloud.id: "ELK_CLOUD_ID"
cloud.auth: "ELK_CLOUD_AUTH"
logging.to_files: true
logging.files:
# Configure the path where the logs are written. The default is the logs directory
# under the home path (the binary location).
path: /var/log/metricbeat
name: 'api-server-ELASTIC_APM_ENV'
fields:
env: ELASTIC_APM_ENV
processors:
- add_host_metadata: ~
- add_cloud_metadata: ~
"/home/ec2-user/metric_beat_env_setup.sh":
mode: "000755"
owner: root
group: root
content: |
#!/bin/bash
ELK_CLOUD_AUTH=$(/opt/elasticbeanstalk/bin/get-config environment -k ELK_CLOUD_AUTH)
ELK_CLOUD_ID=$(/opt/elasticbeanstalk/bin/get-config environment -k ELK_CLOUD_ID)
ELK_AWS_ACCESS_KEY_ID=$(/opt/elasticbeanstalk/bin/get-config environment -k ELK_AWS_ACCESS_KEY_ID)
ELASTIC_APM_ENV=$(/opt/elasticbeanstalk/bin/get-config environment -k ELASTIC_APM_ENV)
ELK_AWS_SECRET_ACCESS_KEY=$(/opt/elasticbeanstalk/bin/get-config environment -k ELK_AWS_SECRET_ACCESS_KEY)
AWS_REGION=$(ec2-metadata --availability-zone | cut -d " " -f 2 | sed 's/.$//')
AWS_INSTATANCE_ID=$(ec2-metadata --instance-id | cut -d " " -f 2)
sed -i "s/ELK_CLOUD_ID/$ELK_CLOUD_ID/g" /etc/metricbeat/metricbeat.yml
sed -i "s/ELK_CLOUD_AUTH/$ELK_CLOUD_AUTH/g" /etc/metricbeat/metricbeat.yml
sed -i "s/ELK_AWS_ACCESS_KEY_ID/$ELK_AWS_ACCESS_KEY_ID/g" /etc/metricbeat/metricbeat.yml
sed -i "s/ELK_AWS_SECRET_ACCESS_KEY/$ELK_AWS_SECRET_ACCESS_KEY/g" /etc/metricbeat/metricbeat.yml
sed -i "s/AWS_REGION/$AWS_REGION/g" /etc/metricbeat/metricbeat.yml
sed -i "s/AWS_INSTATANCE_ID/$AWS_INSTATANCE_ID/g" /etc/metricbeat/metricbeat.yml
sed -i "s/ELASTIC_APM_ENV/$ELASTIC_APM_ENV/g" /etc/metricbeat/metricbeat.yml
curl -L -O https://artifacts.elastic.co/downloads/beats/metricbeat/metricbeat-8.8.1-x86_64.rpm
sudo rpm -vi metricbeat-8.8.1-x86_64.rpm
metricbeat modules enable aws
sudo service metricbeat start
sudo service metricbeat restart
# Commands that will be run on container_commmands
# Here the container variables will be visible as environment variables.
commands:
1_command:
command: "./metric_beat_env_setup.sh"
cwd: /home/ec2-user
Using the above script you can configure filebeat and metricbeat into the beanstalk application
ELK APM Configured:
const config = require('./config');
const apm = require('elastic-apm-node').start({
serviceName: config.ELASTIC_APM_SERVICE_NAME,
secretToken: config.ELASTIC_APM_SERVICE_SECRET,
serverUrl: config.ELASTIC_APM_SERVER_URL,
environment: config.ELASTIC_APM_ENV
})
function apmErrorCapture(err){
apm.captureError(err)
}
module.exports ={
apmErrorCapture,
apm
}
we set the userId for APM
app.use(function (req, res, next) {
const userId = req.headers['x-user-id']
if (userId) {
apm.setUserContext({id: userId})
}
next()
})
Custom Logging with log4j:
const log4js = require('log4js');
const {apm} = require("../../config/elk-apm");
const { logger } = require('elastic-apm-node');
log4js.configure({
appenders: {
default: {
type: 'file',
layout: {
type: 'pattern',
pattern: '[%d] [%p] - %m',
},
filename: './logs/shell.log'
},
error: {
type: 'file',
layout: {
type: 'pattern',
pattern: '[%d] [%p] - %m',
},
filename: './logs/errors.log'
},
warning: {
type: 'file',
layout: {
type: 'pattern',
pattern: '[%d] [%p] - %m',
},
filename: './logs/warnings.log'
},
event: {
type: 'file',
layout: {
type: 'pattern',
pattern: '[%d] [%p] - %m',
},
filename: './logs/events.log'
},
message: {
type: 'file',
layout: {
type: 'pattern',
pattern: '[%d] [%p] - %m',
},
filename: './logs/messages.log'
},
query: {
type: 'file',
layout: {
type: 'pattern',
pattern: '[%d] [%p] - %m',
},
filename: './logs/queries.log'
},
},
categories: {
default: { appenders: ['default'], level: 'ALL' },
errors: { appenders: ['error'], level: 'ERROR' },
warnings: { appenders: ['warning'], level: 'WARN' },
events: { appenders: ['event'], level: 'INFO' },
messages: { appenders: ['message'], level: 'INFO' },
queries: { appenders: ['query'], level: 'TRACE' }
},
});
const Logger = {
DEFAULT: 'shell',
ERROR: 'errors',
EVENT: 'events',
MESSAGE: 'messages',
WARNING: 'warnings',
QUERY: 'queries',
createLogger: (event) => {
log(logger.MESSAGE, `${JSON.stringify(event)}`);
},
};
const log = (phase, message) => {
let traceId = apm?.currentTraceIds?.['trace.id']
let logMessage = `${message}`;
switch (phase) {
case Logger.ERROR:
log4js.getLogger(phase).error(logMessage,traceId);
break;
case Logger.EVENT:
log4js.getLogger(phase).info(logMessage,traceId);
break;
case Logger.MESSAGE:
log4js.getLogger(phase).info(logMessage,traceId);
break;
case Logger.WARNING:
log4js.getLogger(phase).warn(logMessage,traceId);
break;
case Logger.QUERY:
log4js.getLogger(phase).trace(logMessage,traceId);
break;
default:
log4js.getLogger(Logger.DEFAULT).info(logMessage, traceId);
break;
}
}
module.exports = Logger;
When writing custom logs, we include an APM trace ID, allowing us to associate these logs with a transaction
Microservice Distributed Tracing:
We have a microservice that functions using an event-driven manager. It runs on a serverless stack, and all logs are written to CloudWatch. When we log, we include a trace ID.
* This is the logger configuration for our Lambda function.
LOGGER.debug(`${COMPONENT}`,"Processing message - basicEventHandler", eventPayload.traceId );
- We are pushing events from our API to the event service
const Logger = require('../helpers/LoggerService.js');
const {apmErrorCapture, apm} = require("../../config/elk-apm")
async function generateEvent(type, metadata) {
try{
const traceId = apm?.currentTraceIds?.['trace.id']
let event = {
type: type,
eventId: uuidv4(),
createdAt: Date.now(),
traceId
}
switch (type) {
case STAFF_LIST_CHANGE:
event = await decorateStaffListChange(event, metadata);
break;
case PATIENT_LIST_CHANGE:
event = await decoratePatientListChange(event, metadata);
break;
case PATIENT_CHANGE:
event = await decoratePatientChange(event, metadata);
break;
case NOTIFICATION:
event = await decorateNotification(event, metadata);
break;
default:
event = null;
break;
}
Logger.createSqsEventLogger(event);
if (event) {
const eventRes = await sendSQSEvent(event);
Logger.createSqsEventLogger(eventRes);
}
} catch(error){
apmErrorCapture(error)
throw error;
}
}
module.exports = {
generateEvent
}
Monitoring and Observability:
This is how the time is taken to interact with services in a transaction

This is how event service logs are added to the transaction.

Traceability and Troubleshooting:
If a user reports an issue, we can list all transactions and all errors using their ID

Alert Configuration:
We have configured alerts to be sent via email and Slack

In today's complex world of microservices, monitoring and tracing have become more critical than ever before. As we've explored in this blog, implementing a monitoring system, complete with distributed tracing, offers numerous benefits. It allows us to detect and resolve issues proactively, ensuring that our applications run smoothly and our users enjoy a seamless experience