Monitoring tools and portfolio

7 min readApr 13, 2020

The era is a monitoring tool era of Great War. . . With so many tools coming out, people were confused about what to use!

Classify it like this and arrange the tools we have on the horizontal axis
In this example, there is no service health check and SEIM, so you should consider introducing it immediately.

Introduction

So what do you use for monitoring tools? Is it Zabbix? Prometheus? Or Datadog? I think some people use Mackerel or something.

Apart from that, I think many people use APM like New Relic, Dynatrace and Instana. Other times you use OpenZipkin or Jaeger for distributed tracing. Or maybe you’ve added Kibana for log analysis in addition to those, and you’ve included PagerDuty for alert management.

So how do you use those tools properly? If each is a single function, it is easy to talk about, but many tools have various functions, “Do you already have similar tools?”

I think many monitoring tools are very complex and have multiple functions in one product. Therefore, when you insert the new tool you want to insert, what is different from the one you already have? It tends to be like that.

Of course, we evaluate the tool at the time of introduction, but it remains questionable whether Zabbix and New Relic can be compared in the same field just because you can see the CPU metrics and if you build Zabbix you can visualize the application response .

I think that each one has a specialty, so I think it will be necessary to put multiple tools, but still

So, I categorized the roles of the monitoring tools to make it easier to think about what tools they have.

I categorized the monitoring tool roles into the following 10 as follows.

Digital Experience Management
Service Health Check
Application Performance Monitoring (APM)
Distribution Tracing
Middleware Monitoring
Server Monitoring
Infrastructure Monitoring
Security Information and Event Management (SIEM)
Event Management
Log Analysis / Dashboard

There are levels in each role, such as whether or not they correspond to containers, but I think that it is better to categorize roles as this.

Explanation of each role

Digital Experience Management

Representative tools:

Marketing origin: Adobe Analytics, Google Analytics.
APM origin: Fireabase, Dynatrace UEM, New Relic, AppDynamics

Overview:

UX monitoring tool. There are tools that originate in marketing tools and those that originate in APM, but these tools analyze user behavior. Also with User Experience Monitoring.
Probably the most difficult to introduce because it is a representative of functions with similar purposes from different origins
Something from APM seems to be inferior to that from marketing in analyzing business information. Maybe it’s just a good way to get a correlation between response and stay rate or drop count.
On the other hand, there are many cases where you do not know the response from marketing, so it is good to put both for the time being.
Since immediacy is not required, I think that there are many places that do log analysis and make their own.

Service Health Check

Representative tools:

Zabbix, NetCool, Dynatrace, New Relic

Overview:

Body monitoring or health check. Confirm that it is operating normally as a service. A typical example is to confirm that a request sent to a specific URL (such as / healthcheck) changes normally
It is important that the application is normal, not whether the server is alive or the process is alive
Be careful not to make 400 or 500 zombies other than healthchekc because the implementation of healthcheck is too simple
With advanced applications / middleware, it is linked with what functions do not work.
It exists as a function of most APMs and integrated monitoring tools, but it is not necessary to put it alone, so it is separated
With the spread of containers, the process has become a matter of course dying, and areas where new UIs are needed such as UIs such as heat maps and the death of individual applications are not usually shown

Application Performance Monitoring (APM)

Representative tools:

Dynatrace, New Relic, DataDog, ENdoSnipe, Stackdriver APM

Overview:

Tools for monitoring and analyzing application performance
The main function is to detect the deterioration of the response and identify the cause (Diagnosis). Due to its characteristics, there are many that support distributed tracing
Since it is possible to detect where at the method level it is slow, the speed of countermeasures at the time of failure changes dramatically depending on the presence or absence of such tools
Server metrics are only used as reference information and often have only CPU and memory individually

Distribution Tracing

Representative tools:

Dynatrace, New Relic, DataDog, Stackdriver APM, Jaeger, Dapper, OpenTracing

Overview:

Tool to track multiple applications in context
Since SOA and MSA inevitably cross multiple applications, tracking is essential for root cause analysis and its importance has increased
Until a general-purpose tool was created, everyone was writing a trace ID in the log, narrowing down by time stamp, and working hard
A field that is actively involved in APM development. OpenTelemetry integrated with OpenTracing and OpenCensus and supported by various commercial tools will be the de facto standard in the future

Midleware Monitoring

Representative tools:

Oracle Enterprise Manager, Cloudera Manager, Confluent Control Center, Prometheus

Overview:

A tool specialized for monitoring middleware such as DB, MQ, and distributed processing platforms. Usually from the middleware vendor
Because it is a specialization that goes beyond general monitoring tools, the detail level of data and the ease of use of the UI are often much better
However, there are cases where silos are separated from other systems and the alert function is weak, so it is safe to throw key metrics to the tool used for Event Management and use it as an analysis tool in case of problems

Server Monitoring

Representative tools:

Zabbix, DataDog, Prometheus, Mackerel, Nagios, Dynatrace

Overview:

A tool to collect and monitor various server metrics such as CPU, memory, and load
Monitor Linux and Windows servers here
A type of tool that has been used for many years. Basics of monitoring.
However, since it is not a service level indicator (SLI) in nature, it is also a recommended value that you do not usually care about in the context of AIOps

Infrastructure Monitoring

Zabbix, DataDog, Prometheus, Mackerel, Nagios

Overview:

Monitor the values of network devices and storage devices using SNMP
Basically the same as Server Monitoring, but it depends on whether the agent can be installed or not. So the appliance is actually here.
Critical monitoring points likely to be shared resources and susceptible to Noisy Neighbours

Security Information and Event Management (SIEM)

Representative tools:

Splunk, Kibana SIEM, Azure Sentinel, McAfee SIEM,

Overview:

A tool for collecting and analyzing security logs. Pronounced “seam.”
Collect and aggregate audit logs (when, who performed what operations) from PCs to server / operation tool, back office tool operations
It monitors not only the inside information but also network packets, etc., and is used for intrusion analysis.
The analyzed information is reported and analyzed by the SOC team and CIRT. Recently, there are many things that are automatically detected with AI
If you do not put it in properly, it will be very angry at the country that it is a type of company that handles personal information

Event Management

Representative tools:

Zabbix, Prometheus, Stackdriver Error Reporting, PagerDuty

Overview:

Ability to report alerts. Almost all monitoring tools basically come with
When using multiple monitoring tools, it is difficult to manage if alerts are issued from each, so it is basic to aggregate to a specific monitoring tool and take action from there
The most basic alert is email, but “email is not an incident management tool.” Fly to JIRA, REDMINE, ServiceNow, etc. if possible
Some of the higher priority ones can be sent to email, chat, or phone
A smart tool will not just alert you immediately when an event occurs, but will collect similar alerts, check with a custom script, check the correlation with other metrics with AI and reduce the number of alerts, so the essence is Focus on technical issues
The main battlefield of AIOps. Breaking out of Noisy Alert / wolf juvenile alert is an important function for optimizing operations

Log Analysis / Dashboard

Representative tools:

ELK, Metabase, Redash, StackDriver + BigQuery, grafana, Splunk

Overview:

A tool for collecting, analyzing, and visualizing various logs. Basically, collection / analysis / visualization can be changed by different tools.
The main use is to display analysis that cannot be done with existing tools on your own dashboard
Often used as DEM / UEM.
In many cases, the same infrastructure is used on the business side as BI, ot only for system monitoring purposes
If the log is huge, it may bite stream processing such as Kafka or Spark

Monitoring tool example portfolio

For example, when a company uses multiple monitoring tools, it can be represented as follows.

So, if you want to include APM, you will need to include New Relic and the tool you want to introduce, and if that tool covers Server Monitoring, you will see that it will be partially compared with Zabbix.

Also, since there is no Service Health Check or SIEM, depending on the company, you can understand that “Is this equipment okay?”

Since this is a table for checking portfolios and finding tools that are insufficient or duplicate, it is necessary to use different items such as Dynatrace and Instana evaluation as APM.

Monitoring tools and portfolio

Introduction

Explanation of each role

Digital Experience Management

Service Health Check

Application Performance Monitoring (APM)

Distribution Tracing

Midleware Monitoring

Server Monitoring

Infrastructure Monitoring

Security Information and Event Management (SIEM)

Event Management

Log Analysis / Dashboard

Monitoring tool example portfolio

Written by Maciej

No responses yet