The era is a monitoring tool era of Great War. . . With so many tools coming out, people were confused about what to use!
- Classify it like this and arrange the tools we have on the horizontal axis
- In this example, there is no service health check and SEIM, so you should consider introducing it immediately.
Introduction
So what do you use for monitoring tools? Is it Zabbix? Prometheus? Or Datadog? I think some people use Mackerel or something.
Apart from that, I think many people use APM like New Relic, Dynatrace and Instana. Other times you use OpenZipkin or Jaeger for distributed tracing. Or maybe you’ve added Kibana for log analysis in addition to those, and you’ve included PagerDuty for alert management.
So how do you use those tools properly? If each is a single function, it is easy to talk about, but many tools have various functions, “Do you already have similar tools?”
I think many monitoring tools are very complex and have multiple functions in one product. Therefore, when you insert the new tool you want to insert, what is different from the one you already have? It tends to be like that.
Of course, we evaluate the tool at the time of introduction, but it remains questionable whether Zabbix and New Relic can be compared in the same field just because you can see the CPU metrics and if you build Zabbix you can visualize the application response .
I think that each one has a specialty, so I think it will be necessary to put multiple tools, but still
So, I categorized the roles of the monitoring tools to make it easier to think about what tools they have.
I categorized the monitoring tool roles into the following 10 as follows.
- Digital Experience Management
- Service Health Check
- Application Performance Monitoring (APM)
- Distribution Tracing
- Middleware Monitoring
- Server Monitoring
- Infrastructure Monitoring
- Security Information and Event Management (SIEM)
- Event Management
- Log Analysis / Dashboard
There are levels in each role, such as whether or not they correspond to containers, but I think that it is better to categorize roles as this.
Explanation of each role
Digital Experience Management
Representative tools:
- Marketing origin: Adobe Analytics, Google Analytics.
- APM origin: Fireabase, Dynatrace UEM, New Relic, AppDynamics
Overview:
- UX monitoring tool. There are tools that originate in marketing tools and those that originate in APM, but these tools analyze user behavior. Also with User Experience Monitoring.
- Probably the most difficult to introduce because it is a representative of functions with similar purposes from different origins
- Something from APM seems to be inferior to that from marketing in analyzing business information. Maybe it’s just a good way to get a correlation between response and stay rate or drop count.
- On the other hand, there are many cases where you do not know the response from marketing, so it is good to put both for the time being.
- Since immediacy is not required, I think that there are many places that do log analysis and make their own.
Service Health Check
Representative tools:
- Zabbix, NetCool, Dynatrace, New Relic
Overview:
- Body monitoring or health check. Confirm that it is operating normally as a service. A typical example is to confirm that a request sent to a specific URL (such as / healthcheck) changes normally
- It is important that the application is normal, not whether the server is alive or the process is alive
- Be careful not to make 400 or 500 zombies other than healthchekc because the implementation of healthcheck is too simple
- With advanced applications / middleware, it is linked with what functions do not work.
- It exists as a function of most APMs and integrated monitoring tools, but it is not necessary to put it alone, so it is separated
- With the spread of containers, the process has become a matter of course dying, and areas where new UIs are needed such as UIs such as heat maps and the death of individual applications are not usually shown
Application Performance Monitoring (APM)
Representative tools:
- Dynatrace, New Relic, DataDog, ENdoSnipe, Stackdriver APM
Overview:
- Tools for monitoring and analyzing application performance
- The main function is to detect the deterioration of the response and identify the cause (Diagnosis). Due to its characteristics, there are many that support distributed tracing
- Since it is possible to detect where at the method level it is slow, the speed of countermeasures at the time of failure changes dramatically depending on the presence or absence of such tools
- Server metrics are only used as reference information and often have only CPU and memory individually
Distribution Tracing
Representative tools:
- Dynatrace, New Relic, DataDog, Stackdriver APM, Jaeger, Dapper, OpenTracing
Overview:
- Tool to track multiple applications in context
- Since SOA and MSA inevitably cross multiple applications, tracking is essential for root cause analysis and its importance has increased
- Until a general-purpose tool was created, everyone was writing a trace ID in the log, narrowing down by time stamp, and working hard
- A field that is actively involved in APM development. OpenTelemetry integrated with OpenTracing and OpenCensus and supported by various commercial tools will be the de facto standard in the future
Midleware Monitoring
Representative tools:
- Oracle Enterprise Manager, Cloudera Manager, Confluent Control Center, Prometheus
Overview:
- A tool specialized for monitoring middleware such as DB, MQ, and distributed processing platforms. Usually from the middleware vendor
- Because it is a specialization that goes beyond general monitoring tools, the detail level of data and the ease of use of the UI are often much better
- However, there are cases where silos are separated from other systems and the alert function is weak, so it is safe to throw key metrics to the tool used for Event Management and use it as an analysis tool in case of problems
Server Monitoring
Representative tools:
- Zabbix, DataDog, Prometheus, Mackerel, Nagios, Dynatrace
Overview:
- A tool to collect and monitor various server metrics such as CPU, memory, and load
- Monitor Linux and Windows servers here
- A type of tool that has been used for many years. Basics of monitoring.
- However, since it is not a service level indicator (SLI) in nature, it is also a recommended value that you do not usually care about in the context of AIOps
Infrastructure Monitoring
- Zabbix, DataDog, Prometheus, Mackerel, Nagios
Overview:
- Monitor the values of network devices and storage devices using SNMP
- Basically the same as Server Monitoring, but it depends on whether the agent can be installed or not. So the appliance is actually here.
- Critical monitoring points likely to be shared resources and susceptible to Noisy Neighbours
Security Information and Event Management (SIEM)
Representative tools:
- Splunk, Kibana SIEM, Azure Sentinel, McAfee SIEM,
Overview:
- A tool for collecting and analyzing security logs. Pronounced “seam.”
- Collect and aggregate audit logs (when, who performed what operations) from PCs to server / operation tool, back office tool operations
- It monitors not only the inside information but also network packets, etc., and is used for intrusion analysis.
- The analyzed information is reported and analyzed by the SOC team and CIRT. Recently, there are many things that are automatically detected with AI
- If you do not put it in properly, it will be very angry at the country that it is a type of company that handles personal information
Event Management
Representative tools:
- Zabbix, Prometheus, Stackdriver Error Reporting, PagerDuty
Overview:
- Ability to report alerts. Almost all monitoring tools basically come with
- When using multiple monitoring tools, it is difficult to manage if alerts are issued from each, so it is basic to aggregate to a specific monitoring tool and take action from there
- The most basic alert is email, but “email is not an incident management tool.” Fly to JIRA, REDMINE, ServiceNow, etc. if possible
- Some of the higher priority ones can be sent to email, chat, or phone
- A smart tool will not just alert you immediately when an event occurs, but will collect similar alerts, check with a custom script, check the correlation with other metrics with AI and reduce the number of alerts, so the essence is Focus on technical issues
- The main battlefield of AIOps. Breaking out of Noisy Alert / wolf juvenile alert is an important function for optimizing operations
Log Analysis / Dashboard
Representative tools:
- ELK, Metabase, Redash, StackDriver + BigQuery, grafana, Splunk
Overview:
- A tool for collecting, analyzing, and visualizing various logs. Basically, collection / analysis / visualization can be changed by different tools.
- The main use is to display analysis that cannot be done with existing tools on your own dashboard
- Often used as DEM / UEM.
- In many cases, the same infrastructure is used on the business side as BI, ot only for system monitoring purposes
- If the log is huge, it may bite stream processing such as Kafka or Spark
Monitoring tool example portfolio
For example, when a company uses multiple monitoring tools, it can be represented as follows.
So, if you want to include APM, you will need to include New Relic and the tool you want to introduce, and if that tool covers Server Monitoring, you will see that it will be partially compared with Zabbix.
Also, since there is no Service Health Check or SIEM, depending on the company, you can understand that “Is this equipment okay?”
Since this is a table for checking portfolios and finding tools that are insufficient or duplicate, it is necessary to use different items such as Dynatrace and Instana evaluation as APM.