Chia has a decent GUI, but I like to combine all the metrics I’m interested in - CPU, disk, plotting, farming - into one dashboard, that I can access from anywhere.

Chia dashboard

My main tools are Grafana and Prometheus, which I already had running on my server monitoring node metrics from the Prometheus node exporter. This gives me enough for basic monitoring of the plotting process, keeping an eye on disk utilization and IO (and/or memory and CPU).

The main metric I’m interested in for the farming process is the search time (how long it takes Chia to see if any of your plots win); if your node doesn’t claim the win within 30s you don’t get the XCH. This metric is in the Chia debug.log when the log level is set to INFO, e.g.:

# 2021-04-24T11:01:53.390 harvester chia.harvester.harvester: INFO     1 plots were eligible for farming 940b588c2a... Found 0 proofs. Time: 0.98087 s. Total 19 plots

We could feed these logs into Elasticsearch, but instead of standing up another service and the pipeline to feed it, I’m going to use mtail to directly parse the logs and extract the metrics I’m interested in.

Mtail has a simple awk-like programming language to define metrics and extract them from logs with regular expressions. There’s probably other interesting things in the log, but there’s lots of interesting information on the example harvester line above. First I define the metrics we’ll track:

counter chia_harvester_blocks_total
gauge chia_harvester_plots_total
counter chia_harvester_plots_eligible
counter chia_harvester_proofs_total
# keep this from getting too high cardinality, plots_eligible will be one of: 0, 1, 2, many
histogram chia_harvester_search_time buckets 0.1, 0.25, 0.5, 0.75, 1, 2.5, 5, 10, 30 by plots_eligible

The counters will simple count up, the gauge will follow the “Total x plots” (hopefully that won’t go down though!). We need some more detail for the search time, so I use a histogram, with buckets set to give me some more detail around the typical times I’ve seen, and some higher buckets to give me some idea if I’m approaching or exceeding (!!) the 30s limit. I’ve noticed that search times are very short when no plots pass the initial filter, so I add a label on the metric, which will further bucket the metrics on the number of plots that passed the filter (but setting it to “many” if greater than 2 to keep the cardinality in check).

Here’s the regex I’ll use match it:

/^(?P<timestamp>\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}.\d+) harvester chia.harvester.harvester: \w+\s+ (?P<eligible_plots>\d+) plots were eligible for farming \w+\.\.\. Found (?P<proofs>\d+) proofs\. Time: (?P<search_time>[\d\.]+) s\. Total (?P<total_plots>\d+) plots$/

Then we just need to match the timestamp and increment/set metrics:

{
    strptime($timestamp, "2006-01-02T15:04:05")
    chia_harvester_blocks_total++
    chia_harvester_plots_total = $total_plots
    chia_harvester_plots_eligible += $eligible_plots
    chia_harvester_proofs_total += $proofs
    $eligible_plots > 2 {
        chia_harvester_search_time["many"] = $search_time
    } else {
        chia_harvester_search_time[$eligible_plots] = $search_time
    }

}

The complete mtail program is on GitHub along with the Grafana dashboard. After I got a lot of interest on Reddit I also put together a docker-compose stack that one can use to easily deploy all the services with Docker.