Managing Data
Deployments of ClickHouse for Observability invariably involve large datasets, which need to be managed. ClickHouse offers a number of features to assist with data management.
Partitions
Partitioning in ClickHouse allows data to be logically separated on disk according to a column or SQL expression. By separating data logically, each partition can be operated on independently e.g. deleted. This allows users to move partitions, and thus subsets, between storage tiers efficiently on time or expire data/efficiently delete from a cluster.
Partitioning is specified on a table when it is initially defined via the PARTITION BY
clause. This clause can contain a SQL expression on any column/s, the results of which will define which partition a row is sent to.
![NEEDS ALT](/docs/assets/images/observability-14-b71a9bc9505db6019ee2a6ae912528db.png)
The data parts are logically associated (via a common folder name prefix) with each partition on the disk and can be queried in isolation. For the example below, default otel_logs
schema partitions by day using the expression toDate(Timestamp)
. As rows are inserted into ClickHouse, this expression will be evaluated against each row and routed to the resulting partition if it exists (if the row is the first for a day, the partition will be created).
CREATE TABLE default.otel_logs
(
...
)
ENGINE = MergeTree
PARTITION BY toDate(Timestamp)
ORDER BY (ServiceName, SeverityText, toUnixTimestamp(Timestamp), TraceId)
A number of operations can be performed on partitions, including backups, column manipulations, mutations altering/deleting data by row) and index clearing (e.g. secondary indices).
As an example, suppose our otel_logs
table is partitioned by day. If populated with the structured log dataset, this will contain several days of data:
SELECT Timestamp::Date AS day,
count() AS c
FROM otel_logs
GROUP BY day
ORDER BY c DESC
┌────────day─┬───────c─┐
│ 2019-01-22 │ 2333977 │