Skip to main content

Design your metric

Understand what you want from your metric and how to implement it with StatsHouse:

Then create a metric and start sending data.

What are metrics in StatsHouse?

By "monitoring," we mean getting statistical data. A metric is the minimal unit for setting up, collecting, and viewing statistics.

A metric structure in StatsHouse looks like this:

Here, counter, value, and unique are basic metric types in StatsHouse.

important

StatsHouse does not store an exact metric value per each moment.

It stores aggregates associated with time intervals.

Metric types

You can measure same things in different ways—they are metric types.

See the table below for definitions and examples:

Metric typeWhat does it measure?Examples
counterIt counts the number of times an event has occurred.The number of API method calls
The number of requests to a server
The number of errors received while sending messages
valueIt measures magnitude of a parameter.
A measurement event itself is counted as well.
How long does it take
for a service to generate a newsfeed?
What is CPU usage for this host?
What is the response size (in bytes)?
uniqueIt counts the number of unique events.
The total number of events is counted as well.
The number of unique users who sent packages to a service
important

A metric type affects the range of descriptive statistics available for your metric to view and analyze. For example, percentiles are available for values only. Or you cannot view the cumulative graph for uniques.

See more about enabling percentiles and showing descriptive statistics in the UI.

note

Metric types should not be confused with data types in programming languages.

Implementation details

Counter and value metrics are float64. When StatsHouse receives metric data of a counter or value type, it trims everything outside the [-max(float32)..max(float32)] range. Thus, you avoid getting positive or negative infinity (±inf) while summarizing values—in the database as well.

Counters

Imagine a hypothetical product. For this product, we need to get the number of received packets per second. The packets may have different formats and statuses.

Each time an application receives a packet, it sends an event to StatsHouse:

{"metrics":[ {"name": "toy_packets_count",
"tags":{"format": "JSON", "status": "ok"},
"counter": 1}] }

or

{"metrics":[ {"name": "toy_packets_count",
"tags":{"format": "TL", "status": "error_too_short"},
"counter": 1} ]}

Let's represent an event as a row in a conventional database. Upon per-second aggregation, we'll get the table below—for each tag value combination received, we get the row with the corresponding count:

timestampmetricformatstatuscounter
13:45:05toy_packets_countJSONok100
13:45:05toy_packets_countTLok200
13:45:05toy_packets_countTLerror_too_short5

The number of rows in such a table is a metric's cardinality.

Value metrics

Suppose we want to monitor the size of the received packets. This is what our metric could look like:

{"metrics":[ {"name": "toy_packets_size",
"tags":{"format": "JSON", "status": "ok"},
"value": [150]} ]}

or

{"metrics":[ {"name": "toy_packets_size",
"tags":{"format": "JSON", "status": "error_too_short"},
"value": [0]} ]}

When you use value metrics, StatsHouse calculates an aggregate in addition to a counter: sum, min, max.

timestampmetricformatstatuscountersumminmax
13:45:05toy_packets_sizeJSONok10013000201200
13:45:05toy_packets_sizeTLok20070004800
13:45:05toy_packets_sizeTLerror_too_short51008

The metric value is an array, so you can send several values at a time.

Sending regular values

If you need to record a value per second (a "water level"), the StatsHouse client libraries try to send each value in the middle of the calendar second. Agents finalize the second in accordance to a calendar second. StatsHouse does not ensure that each second contains exactly one measurement but tries to make it more probable. If you need to ensure this, add the timestamp explicitly.

Unique counters

A unique counter is the number of unique integer values. For the string values, hashes are used.

Suppose we receive packets containing the senders' IDs. We can count how many distinct senders there are:

{"metrics":[ {"name": "toy_packets_user",
"tags":{"format": "JSON", "status": "ok"},
"unique": [17]} ]}

The unique value is an array, so you can send several values at a time.

timestampmetricformatstatuscounterunique
13:45:05toy_packets_userJSONok100uniq(17, 19, 13, 15)
13:45:05toy_packets_userTLok200uniq(17, 19, 13, 15, 11)
13:45:05toy_packets_userTLerror_too_short5uniq(51)

StatsHouse uses the HyperLogLog-like function: the values themselves are inaccessible, so you can only estimate the cardinality for the sets.

For unique metrics, StatsHouse stores the aggregates: sum, min, max (the same as for the value metrics interpreted as int64 and approximated to float64). Knowing the range of values may be useful for debugging.

Implementation details

Unique counters are int64—StatsHouse interprets it as 64 bits. They are not uint64 just because some programming languages does not have it. These values are regarded as hashes. When estimating cardinality for the sets of these values, StatsHouse tests them for equality and inequality.

When producing aggregates (sum, min, max), StatsHouse interprets these values as int64 and then converts them into float64 as the column for these aggregates are the same as for the value aggregates.

Combining metric types

Check the valid metric type combinations in the table below:

What you sendWhat you get
"counter":100counter
"value":[1, 2, 3]counter (the size of the array), value (sum, min, max)
"unique":[17, 25, 37]counter (the size of the array), value (sum, min, max), unique
"counter":6, "value":[1, 2, 3]User-guided sampling
"value":100,"unique":100This is not a valid combination

If you refactor your existing metric, i.e., switch between different metric types for a single metric, the data may become confusing or uninformative.

important

Keep sending data of the same type per metric.

Counters for value and unique metrics

If you send a value or unique array, the size of this array becomes the counter for this metric. Thus, you should not implement a separate counter metric for your value or unique metrics. You still can specify counter to implement user-guided sampling.

Tags

Use tags to differentiate the characteristics of what you measure, the contributing factors, or a context.

Tags are additional dimensions you use to filter or group your data. They are sometimes mentioned as "labels" or "keys." Tags are the name-value pairs.

Imagine you conducting an A/B test: which color-text combination is better for a button? You measure the number of clicks per button and use the tags:

Tagged metrics help to verify hypotheses about your data. For monitoring, troubleshooting, or other purposes, you may ask questions like these:

"Does the error rate differ for platforms?"

or

"What is the region we have the highest request rate from? Does it differ for environments?"

For these example questions, you may send metrics (here, using the client library for Python):

statshouse.value("error_rate", {"platform": "web"}, 42.5)
↑ ↑ ↑ ↑
metric name ↑ ↑ measurement
tag name ↑
tag value

or

statshouse.value("request_rate", {"env": "production", "region": "moscow"}, 42.5)
↑ ↑ ↑ ↑ ↑ ↑
metric name ↑ ↑ ↑ ↑ measurement
tag name ↑ tag name ↑
tag value tag value

Then you can filter or group your data using these tags. When you view a metric on a graph, the default UI behavior is to use no grouping.

How many tags

You can use 16 tags per metric:

  • the 0 tag is usually for an environment (read more about customizing it),
  • the 1..15 tags are for any other characteristics.

There is also one more String top tag:

  • the __s tag.

"What if I want more tags?"

Unfortunately, it is impossible for now. We plan to increase the number of tags in the future.

Tag names

You can use system tag names (0..15) to send data. For convenience, add aliases (custom names) to your tags.

Please use these characters:

  • Latin alphabet
  • integers
  • underscores
note

Do not start tag names with underscores. They are for StatsHouse internal use only.

You can use the same tag names for different metrics.

In the StatsHouse UI, you can edit tag names and add short descriptions to them.

Tag values

Tag values are usually string values. StatsHouse maps all of them to int32 values for higher efficiency. This huge stringint32 map is common for all metrics, and the budget for creating new mappings is limited. Mapping flood appears when you exceed this budget.

Length and characters

Tag values must contain only UTF-8 printable characters. All the non-printing characters are replaced with the traffic sign.

The maximum tag value length is 128 bytes—the rest is cut.

Tag values are also normalized: all leading and trailing white space is removed, as defined by Unicode. The sequence of Unicode whitespace characters within a tag value is replaced with one ASCII whitespace character.

How many tag values

There is no formal limitation for a number of tag values, but the rule is to have not that many of them.

Tags with many different values such as user IDs or email addresses may lead to mapping flood errors or increased sampling due to high cardinality. In StatsHouse, metric cardinality is how many unique tag value combinations you send for a metric.

If a tag has too many values, they will soon exceed the mapping budget and will be lost: tag values for your measurements will become the empty strings.

Even if all your tag values have been already mapped, and you avoid the mapping flood but keep sending data with many tag values, your data will probably be sampled. Sampling means that StatsHouse throws away pieces of data to reduce its overall amount. To keep aggregation, statistics, and overall graph's shape the same, StatsHouse multiplies the rest of data by a sampling coefficient.

If it is important for you not to sample data at all, keep an eye on your metric cardinality or reduce resolution for your metric.

We recommend that the very first tags have the lowest cardinality rate. For example, the 0 tag is usually an environment tag having not that many values.

tip

If you need a tag with many different 32-bit integer values (such as user_ID), use the Raw tag values to avoid the mapping flood.

For many different string values (such as search_request), use a String top tag.

Raw tags

If tag values in your metric are originally 32-bit integer values, you can mark them as Raw ones to avoid the mapping flood.

These Raw tag values will be parsed as (u)int32 (-2^31..2^32-1 values are allowed) and inserted into the ClickHouse database as is.

To help yourself remember what your Raw tag values mean, specify a format for your data to show in the UI and add value comments.

String top tag

Use a String top tag (__s) when you need a tag with many different string values such as referrers or search requests.

With the common tags, you will get mapping flood errors very soon for this scenario. The String top tag stands apart from the other ones as its values are not mapped to integers. Thus, you can avoid mapping flood errors and massive sampling.

The String top tag has a special storage: when you send your data labeled with the String top tag values, only the most popular tag values are stored. The other tag values for this metric become empty strings and are aggregated. Read more about the String top tag implementation.

To filter data with the String top tag on a graph, add a name or description to it.

Host name as a tag

To view statistics for each host separately, you may want to use host names as tag values. Try the Max host option instead. You do not have to send something special to get use of Max hostenable it in the UI.

Using host names as tag values prevents data from being aggregated and leads to increased sampling. By contrast, the Max host option does not lead to increased sampling but allows you to find the host that sends the maximum value for your metric.

The Max host option helps to answer questions like these:

  • which host has the maximum disk space usage, or
  • which host shows the maximum rate for a particular error type.

During aggregation, StatsHouse uses the special max_host column in the database to store the name of the host, which is responsible for sending the maximum value (for value metrics) or the maximum contribution to the counter (for counter metrics).

For example, StatsHouse aggregates the rows from two agents:

timestampmetricformatminmaxmax_host
13:45:05toy_latencyJSON2001200nginx001

and

timestampmetricformatminmaxmax_host
13:45:05toy_latencyJSON480nginx003

The maximum toy_latency value (which is 1200) is associated with the nginx001 host in the resulting aggregate:

timestampmetricformatminmaxmax_host
13:45:05toy_latencyJSON41200nginx001
Implementation details

The value in the max_host column is float32 rather than float64 for better compression as there is no need for high precision here. The host name is stored as the stringint32 mapping similar to tag values.

We also recommend using the environment tag (or similar) instead of host_name. When you deploy an experimental feature to one or more hosts, label them with the staging or development tag values instead of their host names.

Customizing the environment tag

StatsHouse stores metrics in a ClickHouse database, where 16 columns are for tags. These tag columns are named like 1..15. For example, the tag names may be “format” and “status.” One can edit the metric, so that "format" relates to the 1 column, and "status" relates to the 2 column. You can use system names 1..15.

What about the 0 column? Use it to specify environments for collecting statistics, e.g., production or staging. For example, if the experimental version of software is installed on a number of hosts, you can associate the 0 tag with this experiment. Set up this tag once in the client library during initialization. In other respects, the 0 tag is similar to the other ones.

Timestamps

StatsHouse writes real-time data as a priority.

important

Writing historical data is allowed only for the latest hour and a half.

If the timestamp is in the future, StatsHouse replaces it with the current time.

If the timestamp relates to a moment that is more than 1.5 hours ago, StatsHouse replaces it with the current time minus 1.5 hours.

For cron jobs that send metric data, use the one-hour sending period: it is OK to send data once in an hour, but it is not OK to send data once in a day.

We do not encourage you to specify timestamps explicitly because rows with differing timestamps cannot be aggregated—this may lead to increased sampling. Moreover, the ClickHouse database is rather slow when inserting historical data.