Tuesday, August 4, 2015

Logging and Audit and Monitoring and Observability and Site Reliability Engineering and Chaos Engineering and Computer Forensics and DevSecOps and AIOPS

  • Open-Falcon

A Distributed and High-Performance Monitoring System
http://open-falcon.org/


  • Zabbix is a mature and effortless enterprise-class open source monitoring solution for network monitoring and application monitoring of millions of metrics

https://www.zabbix.com/
  • Riemann monitors distributed systems
Riemann aggregates events from your servers and applications with a powerful stream processing language. Send an email for every exception raised by your code. Track the latency distribution of your web app. See the top processes on any host, by memory and CPU. Combine statistics from every Riak node in your cluster and forward to Graphite. Send alerts when a key process fails to check in. Know how many users signed up right this second.
http://riemann.io/
  • Observium is a low-maintenance auto-discovering network monitoring platform supporting a wide range of device types, platforms and operating systems including Cisco, Windows, Linux, HP, Juniper, Dell, FreeBSD, Brocade, Netscaler, NetApp
http://www.observium.org/
  • Cockpit makes it easy to administer your GNU/Linux servers via a web browser
http://cockpit-project.org/
  • OpenNMS is the world’s first enterprise grade network management application platform developed under the open source model.
http://www.opennms.org/
  • Cricket

Cricket is a high performance, extremely flexible system for monitoring trends in time-series data. Cricket was expressly developed to help network managers visualize and understand the traffic on their networks, but it can be used all kinds of other jobs, as well.
http://cricket.sourceforge.net/
  •  Munin the monitoring tool surveys all your computers and remembers what it saw. It presents all the information in graphs through a web interface. Its emphasis is on plug and play capabilities. After completing a installation a high number of monitoring plugins will be playing with no more effort.

Using Munin you can easily monitor the performance of your computers, networks, SANs, applications, weather measurements and whatever comes to mind. It makes it easy to determine "what's different today" when a performance problem crops up. It makes it easy to see how you're doing capacity-wise on any resources
http://munin-monitoring.org/


  • Monitoring as Code. Use our object based configuration or provision your monitoring code through the REST API. Scale and Secure.

Monitor infrastructures of all sizes with the integrated cluster system secured by SSL
Integrate with many popular DevOps tools and extend Icinga to meet your needs
https://icinga.com/


  • collectd is a daemon which collects system and application performance metrics periodically and provides mechanisms to store the values in a variety of ways, for example in RRD files.

https://collectd.org/

Sending data by using the Monitoring plugin (collectd)
https://console.bluemix.net/docs/services/cloud-monitoring/send-metrics/conf_monitoring_plugin.html#conf_monitoring_plugin


  • Data visualization & Monitoring with support for Graphite, InfluxDB, Prometheus, Elasticsearch and many more databases.

The leading open source software for time series analytics
https://grafana.com/


  • Monitor servers, services, application health, and business KPIs. Get notified about failures before your users do. Collect and analyze custom metrics.

Workflow automation for monitoring
From bare metal to Kubernetes, the Sensu monitoring event pipeline gives you complete visibility across every system, every protocol, every time.
https://sensu.io/


  • The Elastic Stack

Built on an open source foundation, the Elastic Stack lets you reliably and securely take data from any source, in any format, and search, analyze, and visualize it in real time
https://www.elastic.co/products


  • Kibana lets you visualize your Elasticsearch data and navigate the Elastic Stack, so you can do anything from learning why you're getting paged at 2:00 a.m. to understanding the impact rain might have on your quarterly numbers.

https://www.elastic.co/products/kibana


  • Logstash is an open source, server-side data processing pipeline that ingests data from a multitude of sources simultaneously, transforms it, and then sends it to your favorite “stash.” (Ours is Elasticsearch, naturally.) 

https://www.elastic.co/products/logstash



  • Elasticsearch is a distributed, RESTful search and analytics engine capable of solving a growing number of use cases. As the heart of the Elastic Stack, it centrally stores your data so you can discover the expected and uncover the unexpected

https://www.elastic.co/products/elasticsearch


  • Lightweight data shippers
Beats is a free and open platform for single-purpose data shippers. They send data from hundreds or thousands of machines and systems to Logstash or Elasticsearch. 
https://www.elastic.co/beats/

  • What are Examples of Beats?
There are currently six official Beats from Elastic: Filebeat, Metricbeat, Packetbeat, Heartbeat, Winlogbeat, and Auditbeat. All of these beats are open source and Apache-licensed. Elastic maintains a list of regularly updated community beats that users can download, install, and even modify as needed. While each beat has its own distinct use, they all solve the common problem of gathering data at its source and making it easy and efficient to ship that data to Elasticsearch.

Filebeat
Filebeat is designed to read files from your system. It is particularly useful for system and application log files, but can be used for any text files that you would like to index to Elasticsearch in some way. In the logging case, it helps centralize logs and files in an efficient manner by reading from your various servers and VMs, then shipping to a central Logstash or Elasticsearch instance. Additionally, Filebeat eases the configuration process by including “modules” for grabbing common log file formats from MySQL, Apache, NGINX and more

Metricbeat
As the name implies, Metricbeat is used to collect metrics from servers and systems. It is a lightweight platform dedicated to sending system and service statistics. Like Filebeat, Metricbeat includes modules to grab metrics from operating systems like Linux, Windows and Mac OS, applications such as Apache, MongoDB, MySQL and nginx. Metricbeat is extremely lightweight and can be installed on your systems without impacting system or application performance. As with all of the Beats, Metricbeat makes it easy to create your own custom modules.

Packbeat
Packetbeat, a lightweight network packet analyzer, monitors network protocols to enable users to keep tabs on network latency, errors, response times, SLA performance, user access patterns and more. With Packetbeat, data is processed in real time so users can understand and monitor how traffic is flowing through their network. Furthermore, Packetbeat supports multiple application layer protocols, including MySQL and HTTP.

Winlogbeat
Winlogbeat is a tool specifically designed for providing live streams of Windows event logs. It can read events from any Windows event log channel, monitoring log-ons, log-on failures, USB storage device usage and the installation of new software programs. The raw data collected by Winlogbeat is automatically sent to Elasticsearch and then indexed for convenient future reference. Winlogbeat acts as a security enhancement tool and makes it possible for a company to keep tabs on literally everything that is happening on its Windows-powered hosts.

Auditbeat
Auditbeat performs a similar function on Linux platforms, monitoring user and process activity across your fleet. Auditd event data is analyzed and sent, in real time, to Elasticsearch for monitoring the security of your environment.

Heartbeat
Heartbeat is a lightweight shipper for uptime monitoring. It monitors services basically by pinging them and then ships data to Elasticsearch for analysis and visualization. Heartbeat can ping using ICMP, TCP and HTTP. IT has support for TLS, authentication and proxies. Its efficient DNS resolution enables it to monitor every single host behind a load-balanced server.
 
https://www.objectrocket.com/resource/what-are-elasticsearch-beats/

  • Graphite
Graphite is a highly scalable real-time graphing system. As a user, you write an application that collects numeric time-series data that you are interested in graphing, and send it to Graphite's processing backend, carbon, which stores the data in Graphite's specialized database. The data can then be visualized through graphite's web interfaces.

Graphite is a free open-source software (FOSS) tool that monitors and graphs numeric time-series data such as the performance of computer systems
A highly scalable real-time graphing system
https://github.com/graphite-project/graphite-web

  • Highcharts
Highcharts is a charting library written in pure JavaScript, offering an easy way of adding interactive charts to your web site or web application. Highcharts currently supports line, spline, area, areaspline, column, bar, pie, scatter, angular gauges, arearangeareasplinerangecolumnrange, bubble, box plot, error bars, funnel, waterfall and polar chart types.
http://www.highcharts.com/

  • Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids. It is based on a hierarchical design targeted at federations of clusters. It leverages widely used technologies such as XML for data representation, XDR for compact, portable data transport, and RRDtool for data storage and visualization

http://ganglia.sourceforge.net/


  • Automatically discover your entire network using CDP, FDP, LLDP, OSPF, BGP, SNMP and ARP.

Native iPhone App is available which provides core functionality.
Native Android App is available which provides core functionality..
https://www.librenms.org/


  • SmokePing keeps track of your network latency:

Smokeping is a latency measurement tool. It sends test packets out to the net and measures the amount of time they need to travel from one place to the other and back.
https://oss.oetiker.ch/smokeping/


  • Cacti is a complete network graphing solution designed to harness the power of RRDTool's data storage and graphing functionality. Cacti provides a fast poller, advanced graph templating, multiple data acquisition methods, and user management features out of the box

https://cacti.net/


  • System Monitoring Using NAGIOS, Cacti, and Prism

Cacti   uses   Round   Robin   Databases   (RRD)   and MySQL     database     technologies     to     store     collected
information.     MySQL   and   PHP   is   used   to   provide     a graphical,   web   based   interface   to   the   RRD   databases.
Rrd   database   technology   was   popularized   in   the   widely known MRTG graphing project.
https://cug.org/5-publications/proceedings_attendee_lists/CUG09CD/S09_Proceedings/pages/authors/11-15Wednesday/14A-Davis/davis-paper.pdf

  • Drill down into individual containers, gaining protocol level views of an application's behavior. Easily find application errors & bottlenecks.

https://sysdig.com/opensource/sysdig/


  •         Power your metrics and alerting with a leading open-source monitoring solution

https://prometheus.io/


  • Scalable datastore for metrics, events, and real-time analytics

https://github.com/influxdata/influxdb


  • cAdvisor (Container Advisor) provides container users an understanding of the resource usage and performance characteristics of their running containers. It is a running daemon that collects, aggregates, processes, and exports information about running containers. Specifically, for each container it keeps resource isolation parameters, historical resource usage, histograms of complete historical resource usage and network statistics. This data is exported by container and machine-wide.

https://github.com/google/cadvisor

  • Filebeat

Forget using SSH when you have tens, hundreds, or even thousands of servers, virtual machines, and containers generating logs. Filebeat helps you keep the simple things simple by offering a lightweight way to forward and centralize logs and files
https://www.elastic.co/products/beats/filebeat


  • Prometheus + InfluxDB + Grafana

Heapster or Prometheus as a data aggregator
InfluxDB as storage backend
Grafana as a data visualization platform
Prometheus has become the de facto standard for Kubernetes data aggregation
All the components are open source

Prometheus + ELK stack (ElasticSearch + Logstash + Kibana)
Prometheus is used as a data aggregator
ElasticSearch as storage backend
Logstash as a logging manager
Kibana as a data visualization platform.
All the components are open source

https://medium.com/containerum/4-tools-to-monitor-your-kubernetes-cluster-efficiently-ceaf62818eea


  • Monitoring SRE's Golden Signals 

These signals are especially important as we move to microservices and containers, where more functions are spread more thinly, including 3rd parties
There are many metrics to monitor, but industry experience has shown that these 5: rate, errors, latency, saturation and utilization, contain virtually all the information you need to know what’s going on and where.

What are golden signals?
There is no definitive agreement, but these are the three main lists of golden signals today:
    From the Google SRE book: Latency, Traffic, Errors, Saturation
    USE Method (from Brendan Gregg): Utilization, Saturation, Errors
    RED Method (from Tom Wilkie): Rate, Errors, and Duration

USE is about resources with an internal view, while RED is about requests and real work, with an external view.
    Request Rate — request rate, in requests/sec.
    Error Rate — error rate, in errors/sec.
    Latency — response time, including queue/wait time, in milliseconds.
    Saturation — how overloaded something is, directly measured by things like queue depth (or sometimes concurrency). Becomes non-zero when the system gets saturated.
    Utilization — how busy the resource or system is. Usually expressed 0–100% and most useful for predictions (saturation is usually more useful for alerts).

One of the key reasons these are “golden” signals is they try to measure things that directly affect the end-user and work-producing parts of the system — they are direct measurements of things that matter.
This means they are more useful than less-direct measurements such as CPU, RAM, networks, replication lag, and endless other things.
We use the golden signals in several ways:
    Alerting — tell us when something is wrong.
    Troubleshooting — help us find and fix the problem.
    Tuning & Capacity Planning — help us make things better over time.

Are you average or percentile?
Basic alerts typically use average values to compare against some threshold, but - if your monitoring system can do it - use median values instead, which are less sensitive to big/small outlier values. This will reduce false alerts.
Percentiles are even better. For example, you can alert on 95th percentile latency, which is a much better measure of bad user experience
Furthermore, anomaly detection allows for tighter alerting bands so you can find issues much faster than you would with static thresholds (which must be fairly broad to avoid false alerts).

https://www.infoq.com/articles/monitoring-SRE-golden-signals







  •     First, what are the SRE Signals ?


There are three common lists or methodologies:

    From the Google SRE book: Latency, Traffic, Errors, and Saturation
    USE Method (from Brendan Gregg): Utilization, Saturation, and Errors
    RED Method (from Tom Wilkie): Rate, Errors, and Duration

https://medium.com/devopslinks/how-to-monitor-the-sre-golden-signals-1391cadc7524



  • What is The RED Method?


The RED Method defines the three key metrics you should measure for every microservice in your architecture. Those metrics are:

    (Request) Rate - the number of requests, per second, you services are serving.
    (Request) Errors - the number of failed requests per second.
    (Request) Duration - distributions of the amount of time each request takes.

https://www.weave.works/blog/the-red-method-key-metrics-for-microservices-architecture/
  • When ISPs bill "burstable" internet bandwidth, the 95th or 98th percentile usually cuts off the top 5% or 2% of bandwidth peaks in each month, and then bills at the nearest rate. In this way, infrequent peaks are ignored, and the customer is charged in a fairer way. The reason this statistic is so useful in measuring data throughput is that it gives a very accurate picture of the cost of the bandwidth. The 95th percentile says that 95% of the time, the usage is below this amount: so, the remaining 5% of the time, the usage is above that amount.
https://en.wikipedia.org/wiki/Percentile

  • For example, if a score is at the 86th percentile, where 86 is the percentile rank, it is equal to the value below which 86% of the observations may be found. In contrast, if it is in the 86th percentile, the score is at or below the value of which 86% of the observations may be found. Every score is in the 100th percentile.
The 25th percentile is also known as the first quartile (Q1), the 50th percentile as the median or second quartile (Q2), and the 75th percentile as the third quartile (Q3). In general, percentiles and quartiles are specific types of quantiles.
https://pallipedia.org/percentile/#:~:text=A%20percentile%20(or%20a%20centile,the%20observations%20may%20be%20found.
  • Monitor your applications by using the golden signals

Latency
Latency is the time that it takes to service a request, or the metric that is formally known as response time. It’s important to measure the latency from service to service and the latency that the user is experiencing. Establish a baseline for application normalcy with latency. It is a key indicator of degradation in the application.

Don't use averages against latency, as they can be misleading. Rather, use histograms for this metric. Establishing percentile thresholds and values provide a better understanding of what the latency is. Values in the 95th or 99th percentile are key to detecting performance issues in a request or a component.

Be sure to monitor the latency of errors, too. One bad long performing transaction can induce latency to the good requests, making for unhappy users.

Traffic
Traffic is the amount of activity in the application. This value might be different depending on the characteristics of the application. Again, don't use averages. Examples of traffic include the number of requests that an API handled, the number of connections to an application server, and the bandwidth that was consumed to stream an application.

Errors
Errors are the rate of requests that are failing. Monitoring explicit errors, such as HTTP 500s, is straightforward. You also need to "catch" the HTTP 200s that are sharing the wrong content. Measure errors in rates.

Errors should expose bugs in the application, misconfigurations in the service, and dependency failures. Error rates can also affect other measurements, such as lowering latency or increasing saturation.

Saturation
Saturation is how "full" your service is. The type of application that you're monitoring is directly related to the utilization metrics that you use to determine saturation. Saturation is the most challenging signal to implement. You need utilization metrics and the utmost flexibility to determine saturation.

A few examples for determining saturation are as follows:

CPU and memory for all applications
Disk I/O rates for databases and streaming applications
Heap, memory, thread pool garbage collection for Java™ applications
99th percentile for latency
Keep in mind that the application services usually start to degrade before a metric reaches 100% utilization.

It takes time to set up the signals for all the components in today’s applications. The easiest path is to shift left and begin monitoring and testing the application during the development and load-test phases, understanding the performance characteristics before the production rollout.

The successful implementation of the golden signals is key to achieving observability. Apply the signals to these activities:

Monitoring application runtimes
Monitoring the user experience
Synthetic or black-box monitoring
Creating useful dashboards that provide information about the monitored component

Collect and store metric data to support query capabilities and establish performance normalcy and trending for the monitored service. You can also use metric data to explore hypotheses and institute AIops capabilities. Metric data can provide searchable and extensible data dimensions and be a robust data source for dashboards. Dashboards are no longer static and require slice-and-dice capabilities of the data to investigate an incident or improve the application's performance or scalability.


Send actionable alerts. Make sure that alerts require intervention by a first responder and that they contain valuable context as to what is going on.

https://www.ibm.com/garage/method/practices/manage/golden-signals/
  • White-box monitoring

Monitoring based on metrics exposed by the internals of the system, including logs, interfaces like the Java Virtual Machine Profiling Interface, or an HTTP handler that emits internal statistics.

Black-box monitoring
Testing externally visible behavior as a user would see it.

The Four Golden Signals

Latency
The time it takes to service a request. It’s important to distinguish between the latency of successful requests and the latency of failed requests. For example, an HTTP 500 error triggered due to loss of connection to a database or other critical backend might be served very quickly; however, as an HTTP 500 error indicates a failed request, factoring 500s into your overall latency might result in misleading calculations. On the other hand, a slow error is even worse than a fast error! Therefore, it’s important to track error latency, as opposed to just filtering out errors.

Traffic
A measure of how much demand is being placed on your system, measured in a high-level system-specific metric. For a web service, this measurement is usually HTTP requests per second, perhaps broken out by the nature of the requests (e.g., static versus dynamic content). For an audio streaming system, this measurement might focus on network I/O rate or concurrent sessions. For a key-value storage system, this measurement might be transactions and retrievals per second

Errors
The rate of requests that fail, either explicitly (e.g., HTTP 500s), implicitly (for example, an HTTP 200 success response, but coupled with the wrong content), or by policy (for example, "If you committed to one-second response times, any request over one second is an error"). Where protocol response codes are insufficient to express all failure conditions, secondary (internal) protocols may be necessary to track partial failure modes. Monitoring these cases can be drastically different: catching HTTP 500s at your load balancer can do a decent job of catching all completely failed requests, while only end-to-end system tests can detect that you’re serving the wrong content.

Saturation
How "full" your service is. A measure of your system fraction, emphasizing the resources that are most constrained (e.g., in a memory-constrained system, show memory; in an I/O-constrained system, show I/O). Note that many systems degrade in performance before they achieve 100% utilization, so having a utilization target is essential.
In complex systems, saturation can be supplemented with higher-level load measurement: can your service properly handle double the traffic, handle only 10% more traffic, or handle even less traffic than it currently receives? For very simple services that have no parameters that alter the complexity of the request (e.g., "Give me a nonce" or "I need a globally unique monotonic integer") that rarely change configuration, a static value from a load test might be adequate. As discussed in the previous paragraph, however, most services need to use indirect signals like CPU utilization or network bandwidth that have a known upper bound. Latency increases are often a leading indicator of saturation. Measuring your 99th percentile response time over some small window (e.g., one minute) can give a very early signal of saturation.
https://landing.google.com/sre/sre-book/chapters/monitoring-distributed-systems/
  • Metrictank is a multi-tenant timeseries engine for Graphite and friends. It provides long term storage, high availability, efficient storage, retrieval and processing for large scale environments.

https://github.com/grafana/metrictank

  • Announcing the first step in our journey to create a new and modern Graphite-compatible stack for large environments: metrictank. A high performance replacement for carbon and whisper.

Graphite Compatible
metrictank is fully compatible and works seamlessly with all your existing tools and dashboards.
metrictank uses the proven, rock-solid Cassandra for highly reliable clustered storage that works at massive scale. 
http://milo.wearecapacity.com/metrictank/

  • Metrictank publishes its own internal stats to a graphite-compatible datastore (such as graphite/carbon or metrictank itself). This dashboard queries that databasource.

https://grafana.com/dashboards/279
  • Graylog is a powerful open-source log management platform. It aggregates and extracts important data from server logs, which are often sent using the Syslog protocol. It also allows you to search and visualize the logs in a web interface.

https://www.digitalocean.com/community/tutorials/how-to-manage-logs-with-graylog-2-on-ubuntu-16-04


How to Use Graylog for Software Monitoring
Graylog, an open-source solution for log management. We usually have it paired up with Grafana, an open-source dashboard for data visualization.
What Is Graylog?
Graylog is a powerful platform that allows for easy log management of both structured and unstructured data along with debugging applications.
It is based on Elasticsearch, MongoDB, and Scala.
We use Graylog primarily as the stash for the logs of the web applications we build.
However, it is also effective when working with raw strings (i.e. syslog): the tool parses it into the structured data we need. 
In other words, when integrated properly with a web app, Graylog helps engineers to analyze the system behavior on almost per code line basis
Graylog Use Cases
The main advantage of Graylog is that it provides a perfect single instance of log collection for the whole system.  
At Logicify, we use Graylog both for the applications under development and the ones already released publicly. 
As Graylog consistently stores all the logs of an application, it allows tracking of the system’s state for every specific moment of time. This gives developers an efficient mechanism to understand the context of any error
Use in the Production Phase
In software products that are already released for public use, Graylog is also applied for log storage. 
https://dzone.com/articles/how-to-use-graylog-for-technical-monitoring-in-sof
  • Advantages of Graylog+Grafana Compared to ELK Stack

Graylog has proved effective and user-friendly for log storage and management

Advantages of Graylog
The tool has a powerful search syntax, so it is easy to find exactly what you are looking for, even if you have terabytes of log data. The search queries could be saved.
Graylog offers an archiving functionality, so everything older than 30 days could be stored on slow storage and re-imported into Graylog when such a need appears (for example, when the dev team need to investigate a certain event from the past)
Python applications could be easily connected with Graylog as there is an out-of-box library for this.

Graylog versus ELK
Graylog server (the entire application and web interface), combined with MongoDB and Elasticsearch, is often compared to the ELK stack (Elasticsearch, Logstash, and Kibana).
Graylog is positioned as a powerful logging solution, while ELK is a Big Data solution
Graylog can receive structured logs and standard syslog directly from an application through the network protocol. On the contrary, ELK is the solution that analyzes already collected plain text logs using Logstash and then parses them to ElasticSearch
Graylog in this sense is more convenient as it offers a single-application solution (excluding ElasticSearch as a flexible data storage) with almost the same functionality. So the time needed to deploy a usable solution is shorter. 
https://medium.com/@logicify/advantages-of-graylog-grafana-compared-to-elk-stack-a7c86d58bc2c
  • Comparing network monitoring tools; Nagios ,Cricket ,Cacti ,Zenoss ,Zabbix 


I used tools that fellow administrators will find familiar: Nagios and Cacti. And another less famous text-configuration-based monitoring tool called Cricket.  That worked well somehow but Cricket was hard to learn for my coworkers and Cacti seems unreliable and fundamentally broken in terms of SNMP checking. Besides why do I have to set up availability checking in Nagios and set up checking of the same parameters in another software to draw graphs? Then in 2009 I came across an open-source software I hadn't heard of before: Zabbix. And although it has a few rough edges it seems way more professional than other common tools (the commercial tools I saw were even worse than the open-source variants). I tried it and after a lot of reading and trying it looks like it has a good potential to replace Nagios and Cacti workaround.org/try-zabbix

  • Grafana vs. Kibana: The Key Differences to Know


Both Kibana and Grafana are powerful visualization tools. However, at their core, they are both used for different data types and use cases
Grafana together with a time-series database such as Graphite or InfluxDB is a combination used for metrics analysis
Kibana is part of the popular ELK Stack, used for exploring log data
a significant amount of organizations will use both tools as part of their overall monitoring stack. At Logz.io we use both tools to monitor our production environment, with Grafana hooked up to Graphite, Prometheus and Elasticsearch.
https://logz.io/blog/grafana-vs-kibana/


  • Log Management Comparison: ELK vs Graylog


Logging with ELK
ELK is an acronym for 3 open-source projects – ElasticSearch, Logstash, and Kibana.
ElasticSearch – stores large amounts of data and lets you search it
Logstash – processes the data
Kibana – a GUI that lets you visualize large amounts of data

Pros:
Robust solution
Variety of plugins
Logstash allows you to create customized log processing pipeline
Incredible Kibana visualizations
Control over how you index data in ElasticSearch

Cons:
Steep learning curve
Kibana has no default “logging” dashboards
Requires intensive management
Authentication and Alerting are paid features

Logging with Graylog
If you want to add to its functionality, you will likely have to add other tools like Grafana for intricate graphs, an InfluxDB or Graphite datastore or other custom scripts and programs

Pros:
Quick setup
Authentication and Authorization included for free
Parsing, alerting, some basic graphing
Small learning curve
Mostly GUI-based

Cons:
Limited scope of what it does well
Powerful parsing ability
Graphing is basic – will need to use Grafana and/or Kibana
Fewer plugins available than for Logstash and Kibana

DevOps engineers and CTOs mostly care about speed, reliability, and flexibility in queries and visualizations. For this, the ELK stack is a better choice.
If alerting is important to you, Graylog is your best option. Graylog is also the better choice for security logs collection, while the ELK stack has can be a bit more difficult to implement for that matter.

https://coralogix.com/log-analytics-blog/log-management-comparison-elk-vs-graylog/


  • Best of 2018: Log Monitoring and Analysis: Comparing ELK, Splunk and Graylog


Elasticsearch is a modern search and analytics engine based on Apache Lucene, while Logstash provides data processing and enrichment. Kibana offers logs discovery and visualization.

Splunk is a platform for searching, analyzing and visualizing the machine-generated data gathered from the websites, applications, sensors, devices etc. covering the entire infrastructure landscape.
It communicates with different log files and stores files data in the form of events into local indexes. It provides the easiest way of search capabilities and has wide array of options to collect logs from multiple sources.
Graylog offers open source log monitoring tools providing capabilities similar to ELK and Splunk. Graylog performs centralized log monitoring; where Graylog is used for data processing and Elasticsearch, MongoDB used for search and storage. It provides log archival and drill-down of metrics and measurements.


https://devops.com/log-monitoring-and-analysis-comparing-elk-splunk-and-graylog/
  • What is API Monitoring? 


API Monitoring Fundamentals
UPTIME MONITORING
Be the first to know when an API is down. 
PERFORMANCE MEASUREMENT
Get visibility into API performance 
DATA VALIDATION
Ensure that the structure and content of your API calls are returning the data that you—and your customers

Five Steps to API Monitoring Success

1-Run API monitors frequently
2-Validate response data
Add assertions to your API monitors to make sure your APIs are returning the right data.
3-Cover functional use cases
4-Include integrations with third-party & partner APIs
5-Get a complete performance picture
https://www.runscope.com/api-monitoring


  • Request tracing is the ultimate insight tool. Request tracing tracks operations inside and across different systems. Practically speaking, this allows engineers to see the how long an operation took in a web server, database, application code, or entirely different systems, all presented along a timeline. Request tracing is especially valuable in distributed systems where a single transaction (such as “create an account”) spans multiple systems.

Request tracing complements logs and metrics. A trace tells you when one of your flows is broken or slow along with the latency of each step. However, traces don’t explain latency or errors.  Logs can explain why. Metrics allow deeper analysis into system faults. Traces are also specific to a single operation, they are not aggregated like logs or metrics. Tracing, logs, and metrics form the ultimate telemetry solution. Teams armed with all three are well equipped to debug and resolve production problems.
Zipkin and Jaeger are two popular choices for request tracing. Zipkin was originally inspired by Dapper and developed by Twitter.
Jaeger was originally built and open sourced by Uber. Jaeger is a Cloud Native Computing Foundation project.
https://logz.io/blog/zipkin-vs-jaeger/


  • What is OpenTelemetry?

OpenCensus and OpenTracing have merged to form OpenTelemetry, which serves as the next major version of OpenCensus and OpenTracing.
OpenTelemetry is made up of an integrated set of APIs and libraries as well as a collection mechanism via an agent and collector. These components are used to generate, collect, and describe telemetry about distributed systems. This data includes basic context propagation, distributed traces, metrics, and other signals in the future. OpenTelemetry is designed to make it easy to get critical telemetry data out of your services and into your backend(s) of choice. For each supported language it offers a single set of APIs, libraries, and data specifications, and developers can take advantage of whichever components they see fit.

OpenTelemetry is a CNCF incubating project.

Formed through a merger of the OpenTracing and OpenCensus projects.

https://opentelemetry.io/

  • Telemetry is the in situ collection of measurements or other data at remote points and their automatic transmission to receiving equipment (telecommunication) for monitoring

Although the term commonly refers to wireless data transfer mechanisms (e.g., using radio, ultrasonic, or infrared systems), it also encompasses data transferred over other media such as a telephone or computer network, optical link or other wired communications like power line carriers. Many modern telemetry systems take advantage of the low cost and ubiquity of GSM networks by using SMS to receive and transmit telemetry data.

A telemeter is a physical device used in telemetry. It consists of a sensor, a transmission path, and a display, recording, or control device. Electronic devices are widely used in telemetry and can be wireless or hard-wired, analog or digital. Other technologies are also possible, such as mechanical, hydraulic and optical

Telemetry may be commutated to allow the transmission of multiple data streams in a fixed frame.

https://en.wikipedia.org/wiki/Telemetry

New Relic Is All In On The Future of Observability
https://blog.newrelic.com/product-news/observability-open-instrumentation-opentelemetry/

  • What is OpenCensus?

OpenCensus is a set of libraries for various languages that allow you to collect application metrics and distributed traces, then transfer the data to a backend of your choice in real time. This data can be analyzed by developers and admins to understand the health of the application and debug problems.
https://opencensus.io/


OpenCensus: A Stats Collection and Distributed Tracing Framework 
https://opensource.googleblog.com/2018/01/opencensus.html
  • Centreon - IT and Application monitoring software

Centreon is a network, system, applicative supervision and monitoring tool
https://github.com/centreon/centreon

  • IPERF: How to test network Speed,Performance,Bandwidth


Network Throughput
Transfer rate of data from one place to another with respect to time is called as throughput.
Throughput is considered a quality measuring metric for hard disks,network etc. Its measured in Kbps(Kilo bits per second),Mbps(Mega bits per second),Gbps(Giga bits per second.)

TCP Window
TCP (Transmission Control Protocol), is a reliable transport layer protocol used for network communications
Whenever two machine's are communicating with each other, then each of them will inform the other, about the amount of bytes it is ready to receive at one time.
In other words, the maximum amount of data that a sender can send the other end, without an acknowledgement is called as Window Size. This TCP window size affects network throughput very badly sometimes

Suppose you want to send a 500MB of data from one machine to the other, with the tcp window size of 64KB.
Which means for sending the whole 500MB data, the sending machine has to wait 800 times for an acknowledgement from the receiver.
500MB / 64KB = 800

So you can clearly see that, if you increase the Window size a little bit to tune TCP, it can bring significant difference to the throughput achieved

As we discussed before, not only TCP window size but network parameter's like the following also affects the throughput achieved during a connectionn.
    Out of order delivery
    Network Jitter
    Packet loss out of total number of packets

Network jitter = 0.167 ms (network jitter is the deviation in time for periodic arrival of data gram's. If you are doing the test with server's on the other side of the globe, then you might see higher jitter values in iperf output.)



https://www.slashroot.in/iperf-how-test-network-speedperformancebandwidth


  • What is Docker Monitoring?


However, the use of containers to build application environments has a disruptive impact on traditional monitoring methods because containers don’t fit well with the assumptions made by traditional tools and methods that were originally designed for bare-metal machines.

Common challenges
The dynamicity of container based application infrastructure brings new problems to monitoring tools. Also, Docker added another layer of infrastructure and network monitoring requirements to the overall scope.

Think of the typical scenario of multiple VMs provisioned on a bare-metal machine and containers come and go on each one of those VMs. The monitoring requirements include checking the health of bare-metal host, the VMs provisioned on it and the containers active at a given point of time.
Of course, how well these components are interacting with each other and to the outer world should also be checked from the networking side of monitoring requirements.

Monitor Docker host
Docker containers are run on a cluster of large bare-metal or virtual machines. Monitoring of these machines for their availability and performance is important. This falls into the traditional infrastructure monitoring.

Tracking containers
The Docker containers are run on a cluster of hosts and a specific Docker instance could be running on any one of those hosts depending on the scheduling and scaling strategies set in the container orchestration system used like Docker Swarm, Kubernetes, Apache Mesos and Hashicorp Nomad.
Ideally, there is no need to track where the containers are running but things are not ideal usually in production (and that’s why you need monitoring in the first place) and you may want to look at a specific container instance. Tracking information on the up and running containers would be handy in such situations and also to make sure that scheduling and scaling rules are actually enforced.

Runtime resource usage
As with bare-metal and virtual machines, CPU, memory and storage metrics are tracked for Docker containers as well.
The native Docker command “docker stats” returns some of these metrics

Container networking
Checking on container level networks is one of the most important aspect of Docker monitoring

Tracking ephemeral containers
The containers come and go and it would be better if those are not tracked individually. The best method is to tag the containers with keywords. That way time series data from same type of containers could be looked up for monitoring and operational insights, irrespective of their lifecycle status.

Application endpoints
A container-based environment would be running a large, highly distributed application with each service running on one or more containers. The application checks could be done both at the container level, pod level and system-wide level. (A pod is a group of containers that offers a service.) Usually REST API endpoints would be available to perform such checks that could easily be plugged into any modern monitoring system to check the availability of related services.

Most of the popular monitoring tools are not equipped to monitor Docker containers though it is not hard to extend them to support containers.

https://www.bmc.com/blogs/docker-monitoring-explained-monitor-containers-microservices/

  • Kubernetes Logging: Comparing Fluentd vs. Logstash

Logging is an important part of the observability and operations requirements for any large-scale, distributed system. 
There are multiple log aggregators and analysis tools in the DevOps space, but two dominate Kubernetes logging: Fluentd and Logstash from the ELK stack.
Both log aggregators, Fluentd and Logstash, address the same DevOps functionalities but are different in their approach, making one preferable to the other, depending on your use case.

Fluentd and Logstash are log collectors.

Logstash
Elasticsearch is the distributed, search engine.
With Kibana, users can create powerful visualizations of their data, share dashboards, and manage the Elastic Stack
Logstash is the ELK open-source data collection engine and it can do real-time pipelining
Logstash can unify data from disparate sources dynamically and also normalize the data into destinations of your choice

Fluentd
lets you unify the data collection and consumption to allow better insight into your data. 
Fluentd scraps logs from a given set of sources, processes them (converting into a structured data format) and then forwards them to other services like Elasticsearch, object storage etc.
Fluentd also works together with ElasticSearch and Kibana. This is known as the EFK stack.

Comparing Logstash and Fluentd
Both tools run on both Windows and Linux

Event routing
Logstash and Fluentd are different in their approach concerning event routing.
Logstash uses the if-else condition approach; this way we can define certain criteria with If..Then..Else statements – for performing actions on our data.
With Fluentd, the events are routed on tags. Fluentd uses tag-based routing and every input (source) needs to be tagged. Fluentd then matches a tag against different outputs and then sends the event to the corresponding output.

Transport
Logstash is limited to an in-memory queue that holds 20 events and, therefore, relies on an external queue, like Redis, for persistence across restart.Often, Redis is facilitated as a “broker” in a centralized Logstash installation, queueing Logstash events from remote Logstash “shippers”.
This means that with Logstash you need an additional tool to be installed and configured in order to get data into Logstash.
This dependency on an additional tool adds another dependency and complexity to the system, and can increase the risk of failure
with Fluentd, which is independent in getting its data and has a configurable in-memory or on-disk buffering system. Fluentd, therefore, is ‘safer’ than Logstash regarding data transport.

Performance and high-volume logging
it is known that Logstash consumes more memory than Fluentd.
Elastic Beats and Fluent-bit that have an even smaller resource footprint.
Fluentd uses Ruby and Ruby Gems for configuring its 500+ plugins

Fluent-bit is recommended when using small or embedded applications.
Elastic beats is the lightweight variant of Logstash. However, if your use case goes beyond mere data transport, to also require data pulling and aggregation, then you’d need both Logstash and Elastic Beats.

Log parsing
Fluentd uses standard built-in parsers (JSON, regex, csv etc.) and Logstash uses plugins for this

Docker support
Docker has a built-in logging driver for Fluentd, but doesn’t have one for Logstash. With Fluentd, no extra agent is required on the container in order to push logs to Fluentd. Logs are directly shipped to Fluentd service from STDOUT without requiring an extra log file.
Logstash requires a plugin (filebeat) in order to read the application logs from STDOUT before they can be sent to Logstash.
when using Docker containers, Fluentd is the preferred candidate, as it makes the architecture less complex and this makes it less risky for logging mistakes.

Container metrics data collection
Both Fluentd and Logstash use the Prometheus exporter to collect container metrics
Logstash, as part of the ELK stack, also uses MetricBeat.

Coding
Logstash can be coded with JRuby and Fluentd with CRuby. This means Fluentd has an advantage here, because no java runtime is required.

Logstash vs. Fluentd: Which one to use for Kubernetes?
Data logging can be divided into two areas: event and error logging. Both Fluentd and Logstash can handle both logging types and can be used for different use cases, and even co-exist in your environments for logging both VMs/legacy applications as well as Kubernetes-based microservices

For Kubernetes environments, Fluentd seems the ideal candidate due to its built-in Docker logging driver and parser – which doesn’t require an extra agent to be present on the container to push logs to Fluentd. In comparison with Logstash, this makes the architecture less complex and also makes it less risky for logging mistakes. The fact that Fluentd, like Kubernetes, is another CNCF project 

https://platform9.com/blog/kubernetes-logging-comparing-fluentd-vs-logstash/


  • Fluent Bit is an open source and multi-platform Log Processor and Forwarder which allows you to collect data/logs from different sources, unify and send them to multiple destinations. It's fully compatible with Docker and Kubernetes environments.
https://fluentbit.io/

  • Beats is a free and open platform for single-purpose data shippers. They send data from hundreds or thousands of machines and systems to Logstash or Elasticsearch.
https://www.elastic.co/beats/





  • Fluentd vs Logstash: Platform Comparison
Logstash: Linux and Windows
Fluentd: Linux and Windows

Event Routing Comparison
Logstash Event Routing
Logstash routes all data into a single stream and then uses algorithmic if-then statements to send them to the right destination.
Fluentd Event Routing
Fluentd relies on tags to route events. Each Fluentd event has a tag that tells Fluentd where it wants to be routed.
Fluentd’s approach is more declarative whereas Logstash’s method is procedural. 
Logstash: Uses algorithmic statements to route events and is good for procedural programmers
Fluentd: Uses tags to route events and is better at complex routing

Plugin Ecosystem Comparison
Logstash Plugins
One key difference is how plugins are managed. Logstash manages all its plugins under a single GitHub repo
Fluentd Plugins
Fluentd adopts a more decentralized approach.

Transport Comparison
Logstash lacks a persistent internal message queue: Currently, Logstash has an on-memory queue that holds 20 events (fixed size) and relies on an external queue like Redis for persistence across restarts.
aim to persist the queue on-disk.
Fluentd has a highly easy-to-configure buffering system. It can be either in-memory or on-disk with more parameters
The upside of Logstash’s approach is simplicity: the mental model for its sized queue is very simple. However, you must deploy Redis alongside Logstash for improved reliability in production.
Logstash: Needs to be deployed with Redis to ensure reliability
Fluentd: Built-in reliability, but its configuration is more complicated

Performance Comparison
Logstash is known to consume more memory at around 120MB compared to Fluentd’s 40MB. 
Spread across 1,000 servers, this can mean 80GB of additional memory use, which is significant. (This hypothetical number comes from the 80MB difference between Logstash and FluentD on a single machine multiplied by 1,000 machines.)
Logstash has a solution. Instead of running the fully featured Logstash on leaf nodes, Elastic recommends that you run Elastic Beats, resource-efficient, purpose-built log shippers.
On Fluentd’s end, there is Fluent Bit, an embeddable low-footprint version of Fluentd written in C, as well as Fluentd Forwarder, a stripped down version of Fluentd written in Go
Logstash: Slightly more memory use. Use Elastic Beats for leaf machines.
Fluentd: Slightly less memory use. Use Fluent Bit and Fluentd Forwarder for leaf machines.

https://logz.io/blog/fluentd-logstash/





  • Prometheus vs. ELK

Prometheus is an open-source monitoring and alerting system that pulls metrics from application services, servers, and other target sources. 

Prometheus advantages
Provides service discovery that is greatly integrated with Kubernetes, finding all services, and pulling metrics from Prometheus endpoints. 
Prometheus always works, even if other parts of the infrastructure are broken. No need to install agents
Provides a functional query language, PromQL, that allows us to select and aggregate time-series data in real-time. It can apply subqueries, functions, and operators. It can filter and group by labels, and use regular expressions for improved matching and filtering

Prometheus disadvantages 
Monitoring limits (required to increase server storage capacity or to limit the number of metrics).
Does not offer reliable long term data storage, anomaly detection, horizontal scaling, and user management.
Requires a bit of a work-around when it comes to push-based solutions for collecting metrics for short-lived jobs. Also some work-arounds can be made via Pushgateway since these metrics are only available for a short period of time.
Prometheus is not a dashboard solution.using Grafana for dashboarding is required when using Prometheus for monitoring. 

ELK (Elasticsearch Stack: Elasticsearch, Logstash, Kibana)

Logstash features
We can have multiple pipelines running within the same Logstash instance. This means that Logstash is horizontally scalable.
Collect, parse, and analyse a large variety of structured and unstructured data and events.
Centralize data processing.
Decipher geo coordinates from IP addresses

Elasticsearch features
It is a NoSQL database providing distributed data storage
It provides detailed analyses by offering different query types such as structured, unstructured, geo, and metric data
Provide full-text search.
Use standard RESTful API and JSON, as it’s based on Apache Lucene.
Provide schema free, REST, and JSON distributed data storage.
Provide horizontal scalability, reliability, and capability to real-time search.
Provide security, monitoring, alerting, anomaly detection, anomaly prediction, graph exploration, and reporting features.

Kibana Features 
Kibana is the visualization tool that pairs with Elasticsearch and Logstash
data can also be exported from Elasticsearch to Grafana for more advanced metrics visualization.
Provide real-time analysis, summarizing, charting, and debugging capabilities.
Allow snapshots sharing: share the link or export to PDF or CSV file and send it as an attachment.
Allow setting geo data on any map using Elastic Maps Service to visualize geospatial data.
 
Beats
ELK uses Beats, a collection of so-called data shippers
For example, there are Auditbeat for Linux audit logs, Filebeat for log files, Packetbeat for network traffic, and so on.

ELK advantages
Provides great insight into your distributed system with one ELK instance without the need to connect to hundreds of log data sources.
Elasticsearch is real-time. It means that an added document is available to explore after just seconds.
Ability to scale vertically and horizontally.

ELK disadvantages 
Due to the Logstash and Elasticsearch being memory intensive, you need to do a lot of work to prevent Elastic nodes from going down.

Prometheus VS ELK
Both monitoring systems, Prometheus and ELK stack, have similar purposes. 
Their goals are detecting problems, debugging, and solving issues.
The biggest difference is that ELK specializes in logs, and Prometheus specializes in metrics.
Most major productions require using both ELK and Prometheus

Prometheus VS ELK: the similarities
Both systems use RESTful HTTP/JSON API access methods.
Both systems use sharding methods for storing different data on different nodes
Both systems support different alerting options with integrations for email, Slack or PageDuty
Prometheus and ELK stack use replication methods for redundant storage of data on multiple nodes.

Prometheus VS ELK: the differences

Prometheus is used for metric collection, various systems monitoring and setting up alerts based on these metrics.
ELK is used to take all types of data, perform different types of analytics based on these data, search, and visualize it.

Prometheus uses TimeSeries DBMS as its primary database model.
ELK stack’s primary database model is a search engine that supports storing different unstructured data types with an inverted index that allows very fast full-text searches.

Prometheus uses its own PromQL which is actually very easy and powerful.
ELK provides domain-specific query language based on JSON. Elasticsearch also provides a feature to use SQL-like queries.

While Prometheus stores data identified by metric name and key/value pairs.
ELK uses a schema-free data scheme. 

ELK collects a variety of logs from different sources, analyzes, and stores them
Prometheus collects metrics in a standard format via a pull method over HTTP. 

Prometheus stores numeric examples of named time series.
In ELK stack, different types of data can be stored, such as numeric, string, boolean, binary, and so on. This lets you keep, analyze, and use data in a more efficient way regardless of the data.

Prometheus stores data locally within the instance, for a maximum of 14 days.Prometheus is not optimized to be a long-term metric store. 
ELK provides more long-term data retention compared to Prometheus. 

Kibana allows analyzing relationships in your data (show related products for example), and visualizations for these relationships.
Prometheus has no such extended features in its list, all analysis must be conducted through Grafana.


Use ELK in the following cases
    You are doing event logging.
    You need to process big amounts of log data.
    You need long-term data storage.
    You need to have deep insights into a specific event. 
    You need a clustered solution.

Use Prometheus in the following cases
    You are primarily doing metrics.
    You need simplicity in setting up monitoring and graphing tools.
    You need run alerts across various sources.

https://www.metricfire.com/blog/prometheus-vs-elk/
  • ELK/EFK compare with Splunk
Log Management, Log Analytics platform
collect and index logs and provide an interface to search , filter and interact with log data

Splunk has three components
Forwarder — is a component installed in the client machine and pushes data to remote indexers.
Indexers — Sorts and indexes the data pushed to it by forwarders and it responsible to provide index data to search requests.
Search head — is the front end web interface

ELK/ EFK are stacks
ElasticSearch — is basically a NoSQL database that uses Lucene search engine to search logs.
LogStash/FluentD — is a data processing and transportation pipeline which populates the ElasticSearch with the log data
Kibana — is a dashboard that works on top of ElasticSearch , provides UI to search, visualize and facilitates data analytics.
https://medium.com/@balajijk/elk-efk-compare-with-splunk-4c18fc362fd6
  • Prometheus vs InfluxDB

What is Prometheus?
Prometheus is an open-source monitoring tool and time-series database.
Prometheus provides powerful query language, storage, and visualization features for its users.
Prometheus can be integrated with many other different systems (for example, Docker, StatsD, MySQL, Consul, etc.). 

What is InfluxDB?
InfluxDB is an open-source time-series database 
It is widely used as a system for monitoring applications, infrastructure, and IoT, as well as for data analysis
InfluxDB has its own ecosystem called TICK-stack consisting of four components: Telegraf, InfluxDB, Chronograf, and Kapacitor
InfluxDB is the central component of this stack. Its primary aim is to store data, while Telegraf acts as a data collector, Kapacitor provides tools for real-time data processing (for example, alerting), and Chronograf is the system for visualization and interaction with all other components of the stack.

Key similarities between Prometheus and InfluxDB
Both Prometheus and InfluxDB are tools for monitoring and storing time-series data. 
Both platforms support multi-dimensional data. This is done by using labels in Prometheus and tags in InfluxDB.
Both systems have additional instruments to deal with specific tasks. For example, InfluxDB has Kapacitor, and Prometheus has Alertmanager for alerting purposes.
They both use query languages to interact with metrics and analyse them.
If for some use cases it is not enough to use the existing plugins, the functionality of both systems can be extended with the help of web hooks.

Key differences between Prometheus and InfluxDB
. Both systems could be used for monitoring and time-series data storing. However, InfluxDB is more known as a time-series database, while Prometheus has a broader scope of monitoring purposes.

InfluxDB itself cannot be used for the tasks of data visualization or alerting.We should use other instruments from the TICK-stack: Kapacitor for alerting and Chronograf for visualization
Prometheus also needs to use Alertmanager to send notifications, but defining the alerting and recording rules can be done directly in the Prometheus interface. 

Prometheus can write data with the millisecond resolution timestamps. InfluxDB is more advanced in this regard and can work with even nanosecond timestamps. 
Prometheus uses an append-only file per time-series approach for storing data. InfluxDB uses another method of storing, that is considered better for working with events logging. 

https://www.metricfire.com/blog/prometheus-vs-influxdb/
  • InfluxDB is a time series database optimized for high-availability storage and rapid retrieval of time series data.
It can work as a stand-alone solution, or it can be used to process data from Graphite
In addition to monitoring, InfluxDB is used for the Internet of things, sensor data, and home automation solutions
https://logz.io/blog/monitoring-kubernetes-grafana-influxdb/






  • Grafana ships with built-in support for Jaeger, which provides open source, end-to-end distributed tracing.
You can link to Jaeger trace from logs in Loki by configuring a derived field with internal link
https://grafana.com/docs/grafana/latest/datasources/jaeger/

  • Grafana ships with built-in support for Loki, Grafana’s log aggregation system. 
Querying Logs
Querying and displaying log data from Loki is available via Explore, and with the logs panel in dashboards. Select the Loki data source, and then enter a LogQL query to display your logs.
Live tailing
Loki supports Live tailing which displays logs in real-time. This feature is supported in Explore.
https://grafana.com/docs/grafana/latest/datasources/loki/


  • Using Graphite in Grafana
Grafana has an advanced Graphite query editor that lets you quickly navigate the metric space, add functions, change function parameters and much more. The editor can handle all types of graphite queries.
https://grafana.com/docs/grafana/latest/datasources/graphite/



  • The Relationship Between Observability and Monitoring
Observability and monitoring tools work together to offer robust insight into the health of your IT infrastructure. While monitoring alerts the team to a potential issue, observability helps the team detect and solve the root cause of the issue. 

Observability is essential for developers to effectively perform root cause analysis and debug their systems.
With observability software, developers can do this work more easily than if they relied solely on monitoring tools, including telemetry and APM tools.

https://www.strongdm.com/blog/observability-vs-monitoring#:~:text=While%20monitoring%20alerts%20the%20team,root%20cause%20of%20the%20issue.














  • Monitoring, by textbook definition, is the process of collecting, analyzing, and using information to track a program’s progress toward reaching its objectives and to guide management decisions.
Monitoring focuses on watching specific metrics. Logging provides additional data but is typically viewed in isolation of a broader system context.

Observability is the ability to understand a system’s internal state by analyzing the data it generates, such as logs, metrics, and traces. Observability helps teams analyze what’s happening in context across multicloud environments so you can detect and resolve the underlying causes of issues.

Monitoring is capturing and displaying data, whereas observability can discern system health by analyzing its inputs and outputs. 

 For example, we can actively watch a single metric for changes that indicate a problem — this is monitoring. A system is observable if it emits useful data about its internal state, which is crucial for determining root cause.
 
https://www.dynatrace.com/news/blog/observability-vs-monitoring/




  • Monitoring is tooling or a technical solution that allows teams to watch and understand the state of their systems. Monitoring is based on gathering predefined sets of metrics or logs.

Observability is tooling or a technical solution that allows teams to actively debug their system. Observability is based on exploring properties and patterns not defined in advance.

Blackbox monitoring
In a blackbox (or synthetic) monitoring system, input is sent to the system under examination in the same way a customer might. This might take the form of HTTP calls to a public API, or RPC calls to an exposed endpoint, or it might be calling for an entire web page to be rendered as a part of the monitoring process.

Whitebox monitoring

Monitoring and observability rely on signals sent from the workload under scrutiny into the monitoring system. This can generally take the form of the three most common components: metrics, logs, and traces

Metrics are simply measurements taken inside a system, representing the state of that system in a measurable way.
Logs can be thought of as append-only files that represent the state of a single thread of work at a single point in time.
Traces are composed of spans, which are used to follow an event or user action through a distributed system. 


https://cloud.google.com/architecture/devops/devops-measurement-monitoring-and-observability








  • White Box Monitoring
This type of monitoring mainly refers to the monitoring the internal states of the applications running on your system. Mainly this type of monitoring involves exposing metrics that are specific to your application like total number of http requests received / latency etc.

Black Box Monitoring
This type of monitoring mainly refers to the monitoring state of services in the system. Using this type of monitoring we ensure things like status of the application being alive or dead , cpu / disk usage etc

Using the Black Box Monitoring involves using tools like Nagios , Zabbix which are mainly based on the ideas of running custom checks on the systems to identify status of various applications / services whose response are mainly as 0 or 1 to indicate the status of the service being monitored.

Whereas using white box monitoring involves using tools like Prometheus which enables you to export metrics like total number of http requests received by the application , errors logged etc.

If the disk space of a system is filling up fast and goes beyond 80% the black box monitoring tools would throw alerts with high risk to be fixed. But the problem arises if we want to fix the alerts then we want to know some metrics about the system like the rate at which disk capacity was increasing on the system , the internal application metrics about disk usage. Knowing these metrics can help us solve the issues in lesser time.


If we have a white box monitoring solution enabled for the same scenario then observing the graphs for rate of disk capacity usage and application disk usage for some specific timelines we can make predictions about trend at which disk capacity was filling up also make our predictions of which app may be behaving faulty at this scenario and knowing these specifics can help us resolve the issues with less time.

https://www.linkedin.com/pulse/white-box-vs-black-monitoring-vipul-sharma