📅 Prometheus Server

EdgeHit incorporates a comprehensive monitoring and alerting stack using Prometheus, Grafana, and Alertmanager. These components work seamlessly together to track system metrics, visualize operational health, and alert administrators to potential infrastructure issues.

While often referred to as the Prometheus Server, this setup actually includes several key components — Prometheus, Grafana, and Alertmanager — working in tandem.

In practice, while the Prometheus server can technically run on the same machine as EdgeHit Controller, for production environments, it is often separated to optimize performance, scalability, and resource management.

📈 Prometheus

Prometheus acts as the central monitoring system for EdgeHit, collecting and storing time-series metrics across all nodes and services.

Minimal Configuration: Prometheus is deployed with a basic configuration that primarily scrapes metrics from the Node Exporter running on EdgeHit Controller node.
Dynamic Registration of Load Proxy Nodes: A custom Python script is provided to automate the addition of Load Proxy nodes to the Prometheus configuration. This script reads node metadata and dynamically updates the prometheus.yml file to include new instances without requiring manual intervention.

ℹ️ Note: The script’s usage and configuration are documented in the Maintenance Guideline section, which explains how to integrate new Load Proxy nodes automatically.
Data Collection: Prometheus scrapes metrics from Node Exporter at regular intervals, with each node providing system-level metrics such as CPU usage, memory, disk I/O, and network traffic.
Access: Prometheus is accessible at 0.0.0.0:9090, secured with Pre-Shared Key (PSK)-based HTTP authentication. The PSK is generated randomly during deployment and stored securely in the .env file for container runtime use.

ℹ️ Note: Web access to Prometheus is restricted to viewing existing alerts, targets, and configuration details. Configuration modifications, such as adding or removing scrape targets, must be done via predefined scripts.
Use Case: Prometheus continuously monitors the performance of Load Proxy and EdgeHit Controller. Metrics like CPU load, disk usage, and memory utilization help identify resource bottlenecks. Data from Prometheus is fed into Grafana for visualization, while custom queries can be written for advanced analytics and reporting.

📊 Grafana

Grafana serves as the visualization platform for the EdgeHit observability stack. It connects to Prometheus and provides custom dashboards for operational monitoring.

Default User Authentication: Upon deployment, Grafana generates a default admin username and password, which is stored in the .env file.
Data Source: Grafana connects to Prometheus running on local process via Docker networking.
Alerts: Grafana enables alert threshold definitions directly on metrics. If any monitored metric (e.g., CPU usage, memory usage, NGINX cache misses) crosses a defined threshold, it can trigger an alert via Alertmanager.
Access: Grafana is accessible at 0.0.0.0:3000, with the default username and password.

🚨 Alertmanager

Alertmanager is integrated into EdgeHit’s monitoring stack to handle alert routing, aggregation, and notification for critical events in the EdgeHit environment.

Alert Grouping and Silence: Alertmanager supports grouping multiple alerts into a single notification and silencing specific alerts during maintenance windows to reduce noise and avoid alert fatigue.
Notification Endpoints: Alertmanager routes alerts to various notification endpoints, including:
- PagerDuty: Incident creation on PagerDuty for on-call response.
- Slack: Alerts can be sent to a dedicated Slack channel for team visibility.
Use Case: When a Load Proxy node experiences high resource consumption , Alertmanager sends an immediate notification via PagerDuty, ensuring that the on-call team is informed and can take action promptly.