Datalore 2024.3 Help

Healthcheck & monitoring

Healthcheck

Use Datalore's in-built HTTP endpoint (accessible at /health) to verify whether the instance has become online and responsive.

This endpoint and returns OK when no issues are detected.

Use the same endpoint as Kubernetes liveness probe if default Helm charts are used for the deployment.

Monitoring

Datalore has a built-in metrics exporter, which is disabled by default and accessible at the /metrics path when enabled explicitly.

There are two mutually exclusive environment variables of the Datalore server that can be used to enable metrics:

Monitoring environment variables

Name

Type

Default value

Description

METRICS_AUTH_TOKEN

String

Not defined

Enables the exporter and defines the authentication token required to collect metrics. Mutually exclusive with ENABLE_UNAUTHORIZED_METRICS.

ENABLE_UNAUTHORIZED_METRICS

String

Not defined

Enables the exporter. No authentication will be required to read metrics. Mutually exclusive with METRICS_AUTH_TOKEN.

Metrics

  1. agent_pool_size: shows how many agents the pool currently has.

    • Prometheus query: sum by (instance_name)(agent_pool_size)

  2. agent_waiting_time_bucket: represents the timespan in which the user waited for an instance startup.

    • Prometheus query: sum(increase(agent_waiting_time_bucket[10m])) by (le)

  3. agent_in_pool_time_bucket: represents the timespan in which the agent was online and idle before being assigned to a specific notebook.

    • Prometheus query: sum(increase(agent_in_pool_time_bucket[10m])) by (le)

  4. agents_started_total: shows how many agents were started per minute.

    • Prometheus query: sum by (instance_name)(rate(agents_started_total[5m])) * 60

Last modified: 25 June 2024