grafana loki的理解与配置(2.9)

loki的部署模式

grafana loki 在 helm 中版本变化很大，理解部署模式有助于快速 run 一组 loki 服务；loki 服务是一个单体二进制文件运行的微服务架构模式，具体运行的角色通过 -target 参数来区分运行什么组件。

单体模式

单体模式又称整体模式，源自官网 (Monolithic mode)，这个模式指的是 -target=all 时运行的服务。需要注意的是单体模式下，每天读写量为 20GB.

Monolithic mode is useful for getting started quickly to experiment with Loki, as well as for small read/write volumes of up to approximately 20GB per day. ^[1]

该模式下loki 所有的组件都为二进制文件或者容器作为单一进程运行。

SSD

SSD (Simple Scalable deployment mode) 简单可扩展部署模式，是 loki 对内部组件进行分类，当参数 target 我i -target=write , -target=read, -target=backend 时，是作为 SSD mode 启动。其中 write 和 backend 是有状态服务，read 是无状态服务。

微服务模式

微服务模式 (Microservices mode) 指的是，参数 target 制定了对应的组件名字，即每个组件作为独立的微服务运行。可以通过 -list-targets 参数来获取 loki 所有组件。

bash

1
2
docker run docker.io/grafana/loki:2.9.2 \
	-config.file=/etc/loki/local-config.yaml -list-targets

微服务模式官方推荐在大规模 loki 集群时使用。

Microservices mode is only recommended for very large Loki clusters or for operators who need more precise control over scaling and cluster operations. ^[2]

Loki的存储

Loki存储日志的方式是将日志的元数据信息作为类似 Prometheus 的 key/value 对的格式进行索引，而日志内容不索引进行存储，大致一条信息的组成为：

text

1
2
3
4
5
6
 2024-12-25T10:01:02.123456789Z   {key1="value", key2="value2"}    GET /ping
|______________________________| |_____________________________| |____________|
  时间戳                            Prometheus-style label           日志内容
  nano second precision            Key/Value Paris                  logline
|______________________________________________________________| |____________|
                             Indexed                               Unindented

在 Loki 2.0 之前版本，loki存储将索引数据和未索引数据分别存储，索引数据包含了相对应的标签，常见为 {app="api", env="production", filename="/var/logs/app.log"} 这种起了唯一的标识；对象存储则负责存储和压缩日志。索引提供负责快速查询的标签。

在 Loki 2.0 后的版本提供了 Single Store，这里存储了包含“索引数据”和“非索引数据”，即只提供了一个存储。

note

这里均提到了“对象存储”的概念，这里对象存储和传统理解上的 S3 是有区别的。Loki 中提到的对象存储是块存储 Chunk storage，包含了 “文件系统” 和 “对象存储”，这里的 “对象存储” 才是传统意义上的 S3 ^[3]

Helm 安装 Loki 2.9

添加 loki 仓库

bash

1
helm repo add grafana https://grafana.github.io/helm-charts

helm的版本差异

loki helm 有很多不同的模式，官方也提供了诸如 SSD 的 helm，这里采用 loki helm 进行安装。在安装时需要区分版本，loki helm 2.x (最高对应 loki 2.6.1 版本)，这里 helm 运行为单一进程的。在 loki helm 3 以上，编默认作为 SSD 模式运行，如果还想运行为“单体模式” 需要自行修改。

下面时 loki helm 2.16.0 的 helm values.yaml

yaml

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
image:
  repository: grafana/loki
  tag: 2.6.1
  pullPolicy: IfNotPresent

  ## Optionally specify an array of imagePullSecrets.
  ## Secrets must be manually created in the namespace.
  ## ref: https://kubernetes.io/docs/tasks/configure-pod-container/pull-image-private-registry/
  ##
  # pullSecrets:
  #   - myRegistryKeySecretName

ingress:
  enabled: false
  # For Kubernetes >= 1.18 you should specify the ingress-controller via the field ingressClassName
  # See https://kubernetes.io/blog/2020/04/02/improvements-to-the-ingress-api-in-kubernetes-1.18/#specifying-the-class-of-an-ingress
  # ingressClassName: nginx
  annotations: {}
    # kubernetes.io/ingress.class: nginx
    # kubernetes.io/tls-acme: "true"
  hosts:
    - host: chart-example.local
      paths: []
  tls: []
  #  - secretName: chart-example-tls
  #    hosts:
  #      - chart-example.local

## Affinity for pod assignment
## ref: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity
affinity: {}
# podAntiAffinity:
#   requiredDuringSchedulingIgnoredDuringExecution:
#   - labelSelector:
#       matchExpressions:
#       - key: app
#         operator: In
#         values:
#         - loki
#     topologyKey: "kubernetes.io/hostname"

## StatefulSet annotations
annotations: {}

# enable tracing for debug, need install jaeger and specify right jaeger_agent_host
tracing:
  jaegerAgentHost:

config:
  # existingSecret:
  auth_enabled: false

  memberlist:
    join_members:
      # the value must be defined as string to be evaluated when secret manifest is being generating
      - '{{ include "loki.fullname" . }}-memberlist'

  ingester:
    chunk_idle_period: 3m
    chunk_block_size: 262144
    chunk_retain_period: 1m
    max_transfer_retries: 0
    wal:
      dir: /data/loki/wal
    lifecycler:
      ring:
        replication_factor: 1

      ## Different ring configs can be used. E.g. Consul
      # ring:
      #   store: consul
      #   replication_factor: 1
      #   consul:
      #     host: "consul:8500"
      #     prefix: ""
      #     http_client_timeout: "20s"
      #     consistent_reads: true
  limits_config:
    enforce_metric_name: false
    reject_old_samples: true
    reject_old_samples_max_age: 168h
    max_entries_limit_per_query: 5000
  schema_config:
    configs:
    - from: 2020-10-24
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 24h
  server:
    http_listen_port: 3100
    grpc_listen_port: 9095
  storage_config:
    boltdb_shipper:
      active_index_directory: /data/loki/boltdb-shipper-active
      cache_location: /data/loki/boltdb-shipper-cache
      cache_ttl: 24h         # Can be increased for faster performance over longer query periods, uses more disk space
      shared_store: filesystem
    filesystem:
      directory: /data/loki/chunks
  chunk_store_config:
    max_look_back_period: 0s
  table_manager:
    retention_deletes_enabled: false
    retention_period: 0s
  compactor:
    working_directory: /data/loki/boltdb-shipper-compactor
    shared_store: filesystem
# Needed for Alerting: https://grafana.com/docs/loki/latest/rules/
# This is just a simple example, for more details: https://grafana.com/docs/loki/latest/configuration/#ruler_config
#  ruler:
#    storage:
#      type: local
#      local:
#        directory: /rules
#    rule_path: /tmp/scratch
#    alertmanager_url: http://alertmanager.svc.namespace:9093
#    ring:
#      kvstore:
#        store: inmemory
#    enable_api: true

## Additional Loki container arguments, e.g. log level (debug, info, warn, error)
extraArgs: {}
  # log.level: debug

extraEnvFrom: []

livenessProbe:
  httpGet:
    path: /ready
    port: http-metrics
  initialDelaySeconds: 45

## ref: https://kubernetes.io/docs/concepts/services-networking/network-policies/
networkPolicy:
  enabled: false

## The app name of loki clients
client: {}
  # name:

## ref: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/
nodeSelector: {}

## ref: https://kubernetes.io/docs/concepts/storage/persistent-volumes/
## If you set enabled as "True", you need :
## - create a pv which above 10Gi and has same namespace with loki
## - keep storageClassName same with below setting
persistence:
  enabled: false
  accessModes:
  - ReadWriteOnce
  size: 10Gi
  labels: {}
  annotations: {}
  # selector:
  #   matchLabels:
  #     app.kubernetes.io/name: loki
  # subPath: ""
  # existingClaim:
  # storageClassName:

## Pod Labels
podLabels: {}

## Pod Annotations
podAnnotations:
  prometheus.io/scrape: "true"
  prometheus.io/port: "http-metrics"

podManagementPolicy: OrderedReady

## Assign a PriorityClassName to pods if set
# priorityClassName:

rbac:
  create: true
  pspEnabled: true

readinessProbe:
  httpGet:
    path: /ready
    port: http-metrics
  initialDelaySeconds: 45

replicas: 1

resources: {}
# limits:
#   cpu: 200m
#   memory: 256Mi
# requests:
#   cpu: 100m
#   memory: 128Mi

securityContext:
  fsGroup: 10001
  runAsGroup: 10001
  runAsNonRoot: true
  runAsUser: 10001

containerSecurityContext:
  readOnlyRootFilesystem: true

service:
  type: ClusterIP
  nodePort:
  port: 3100
  annotations: {}
  labels: {}
  targetPort: http-metrics

serviceAccount:
  create: true
  name:
  annotations: {}
  automountServiceAccountToken: true

terminationGracePeriodSeconds: 4800

## Tolerations for pod assignment
## ref: https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/
tolerations: []

## Topology spread constraint for multi-zone clusters
## ref: https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/
topologySpreadConstraints:
  enabled: false

# The values to set in the PodDisruptionBudget spec
# If not set then a PodDisruptionBudget will not be created
podDisruptionBudget: {}
# minAvailable: 1
# maxUnavailable: 1

updateStrategy:
  type: RollingUpdate

serviceMonitor:
  enabled: false
  interval: ""
  additionalLabels: {}
  annotations: {}
  # scrapeTimeout: 10s
  # path: /metrics
  scheme: null
  tlsConfig: {}
  prometheusRule:
    enabled: false
    additionalLabels: {}
  #  namespace:
    rules: []
    #  Some examples from https://awesome-prometheus-alerts.grep.to/rules.html#loki
    #  - alert: LokiProcessTooManyRestarts
    #    expr: changes(process_start_time_seconds{job=~"loki"}[15m]) > 2
    #    for: 0m
    #    labels:
    #      severity: warning
    #    annotations:
    #      summary: Loki process too many restarts (instance {{ $labels.instance }})
    #      description: "A loki process had too many restarts (target {{ $labels.instance }})\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    #  - alert: LokiRequestErrors
    #    expr: 100 * sum(rate(loki_request_duration_seconds_count{status_code=~"5.."}[1m])) by (namespace, job, route) / sum(rate(loki_request_duration_seconds_count[1m])) by (namespace, job, route) > 10
    #    for: 15m
    #    labels:
    #      severity: critical
    #    annotations:
    #      summary: Loki request errors (instance {{ $labels.instance }})
    #      description: "The {{ $labels.job }} and {{ $labels.route }} are experiencing errors\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    #  - alert: LokiRequestPanic
    #    expr: sum(increase(loki_panic_total[10m])) by (namespace, job) > 0
    #    for: 5m
    #    labels:
    #      severity: critical
    #    annotations:
    #      summary: Loki request panic (instance {{ $labels.instance }})
    #      description: "The {{ $labels.job }} is experiencing {{ printf \"%.2f\" $value }}% increase of panics\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    #  - alert: LokiRequestLatency
    #    expr: (histogram_quantile(0.99, sum(rate(loki_request_duration_seconds_bucket{route!~"(?i).*tail.*"}[5m])) by (le)))  > 1
    #    for: 5m
    #    labels:
    #      severity: critical
    #    annotations:
    #      summary: Loki request latency (instance {{ $labels.instance }})
    #      description: "The {{ $labels.job }} {{ $labels.route }} is experiencing {{ printf \"%.2f\" $value }}s 99th percentile latency\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"


initContainers: []
## Init containers to be added to the loki pod.
# - name: my-init-container
#   image: busybox:latest
#   command: ['sh', '-c', 'echo hello']

extraContainers: []
## Additional containers to be added to the loki pod.
# - name: reverse-proxy
#   image: angelbarrera92/basic-auth-reverse-proxy:dev
#   args:
#     - "serve"
#     - "--upstream=http://localhost:3100"
#     - "--auth-config=/etc/reverse-proxy-conf/authn.yaml"
#   ports:
#     - name: http
#       containerPort: 11811
#       protocol: TCP
#   volumeMounts:
#     - name: reverse-proxy-auth-config
#       mountPath: /etc/reverse-proxy-conf


extraVolumes: []
## Additional volumes to the loki pod.
# - name: reverse-proxy-auth-config
#   secret:
#     secretName: reverse-proxy-auth-config

## Extra volume mounts that will be added to the loki container
extraVolumeMounts: []

extraPorts: []
## Additional ports to the loki services. Useful to expose extra container ports.
# - port: 11811
#   protocol: TCP
#   name: http
#   targetPort: http

# Extra env variables to pass to the loki container
env: []

# Specify Loki Alerting rules based on this documentation: https://grafana.com/docs/loki/latest/rules/
# When specified, you also need to add a ruler config section above. An example is shown in the rules docs.
alerting_groups: []
#  - name: example
#    rules:
#    - alert: HighThroughputLogStreams
#      expr: sum by(container) (rate({job=~"loki-dev/.*"}[1m])) > 1000
#      for: 2m

useExistingAlertingGroup:
  enabled: false
  configmapName: ""

单体模式进行部署

修改 values.yaml, 搜索 singleBinary，修改为 true

text

1
2
3
4
5
6
singleBinary:
  # -- Number of replicas for the single binary
  replicas: 0
  autoscaling:
    # -- Enable autoscaling
    enabled: false 

注意修改存储模式，在使用到单体的状态下，通常都是进行最小化部署，所以使用 “文件系统” 模式，这里修改文件系统模式

yaml

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# -- Storage config. Providing this will automatically populate all necessary storage configs in the templated config.
  storage:
    bucketNames:
      chunks: chunks
      ruler: ruler
      admin: admin
    # type 这里修改为 filesystem
    type: filesystem
    # s3可以是 minio, ceph rgw等
    s3:
      s3: null
      endpoint: null
      region: null
      secretAccessKey: null
      accessKeyId: null
      signatureVersion: null
      s3ForcePathStyle: false
      insecure: false
      http_config: {}
    gcs:
      chunkBufferSize: 0
      requestTimeout: "0s"
      enableHttp2: true
    azure:
      accountName: null
      accountKey: null
      useManagedIdentity: false
      useFederatedToken: false
      userAssignedId: null
      requestTimeout: null
      endpointSuffix: null
    filesystem:
      chunks_directory: /var/loki/chunks
      rules_directory: /var/loki/rules

需要注意的是，使用了 filesystem，对应的你的 k8s 集群需要提供 storageClass，以供应用自动获取 pvc

yaml

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
  persistence:
    # -- Enable StatefulSetAutoDeletePVC feature
    enableStatefulSetAutoDeletePVC: false
    # -- Size of persistent disk
    size: 10Gi
    # -- Storage class to be used.
    # If defined, storageClassName: <storageClass>.
    # If set to "-", storageClassName: "", which disables dynamic provisioning.
    # If empty or set to null, no storageClassName spec is
    # set, choosing the default provisioner (gp2 on AWS, standard on GKE, AWS, and OpenStack).
    storageClass: null
    # -- Selector for persistent disk
    selector: null

部署 Loki

bash

1
2
3
4
5
6
7
## 部署 Loki
helm upgrade --install loki  \
  --namespace logging \
  --create-namespace \
  -f values.yaml \
  --version {loki version} \
  grafana/loki

SSD模式部署

默认模式下，就是 SSD 模式，

yaml

1
2
3
4
5
  commonConfig:
    path_prefix: /var/loki
    # 系数决定了，后面的 read, write, backend 是最少需要多少个实例
    replication_factor: 3 
    compactor_address: '{{ include "loki.compactorAddress" . }}'

而 Ingress 中会根据用户启用了什么模式而分配不同的路由，在文件 grafana/loki/blob/v2.9.2/production/helm/loki/templates/ingress.yaml 中定义

yaml

1
2
3
4
5
6
7
  rules:
    {{- range $.Values.ingress.hosts }}
    - host: {{ . | quote }}
      http:
        paths:
          {{- include "loki.ingress.servicePaths" $ | indent 10}}
    {{- end }}

在文件 grafana/loki/blob/v2.9.2/production/helm/loki/templates/_helpers.tpl 可以看到

yaml

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
{{/*
Generate list of ingress service paths based on deployment type
*/}}
{{- define "loki.ingress.servicePaths" -}}
{{- if (eq (include "loki.deployment.isScalable" .) "true") -}}
{{- include "loki.ingress.scalableServicePaths" . }}
{{- else -}}
{{- include "loki.ingress.singleBinaryServicePaths" . }}
{{- end -}}
{{- end -}}

{{/*
Ingress service paths for scalable deployment
*/}}
{{- define "loki.ingress.scalableServicePaths" -}}
{{- include "loki.ingress.servicePath" (dict "ctx" . "svcName" "read" "paths" .Values.ingress.paths.read )}}
{{- include "loki.ingress.servicePath" (dict "ctx" . "svcName" "write" "paths" .Values.ingress.paths.write )}}
{{- end -}}

# 上面传入了 key/value的值，例如 svcName=write, path=.Values.ingress.paths.write
# 下面会根据路由选择对应的service
# https://github.com/grafana/loki/blob/v2.9.2/production/helm/loki/templates/_helpers.tpl#L471C1-L486C12
{{- range .paths }}
- path: {{ . }}
  {{- if $ingressSupportsPathType }}
  pathType: Prefix
  {{- end }}
  backend:
    {{- if $ingressApiIsStable }}
    {{- $serviceName := include "loki.ingress.serviceName" (dict "ctx" $.ctx "svcName" $.svcName) }}
    service:
      name: {{ $serviceName }}
      port:
        number: 3100
    {{- else }}
    serviceName: {{ $serviceName }}
    servicePort: 3100
{{- end -}}


# isScalable就是 singleBinary 为0，或者使用了对象存储
{{- define "loki.deployment.isScalable" -}}
  {{- and (eq (include "loki.isUsingObjectStorage" . ) "true") (eq (int .Values.singleBinary.replicas) 0) }}
{{- end -}}

{{/* Determine if deployment is using object storage */}}
{{- define "loki.isUsingObjectStorage" -}}
{{- or (eq .Values.loki.storage.type "gcs") (eq .Values.loki.storage.type "s3") (eq .Values.loki.storage.type "azure") -}}
{{- end -}}

最后就是 values.yaml 中定义的路由，根据不同的服务，吧请求路由转发到对应的服务上。

yaml

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
  paths:
    write:
      - /api/prom/push
      - /loki/api/v1/push
    read:
      - /api/prom/tail
      - /loki/api/v1/tail
      - /loki/api
      - /api/prom/rules
      - /loki/api/v1/rules
      - /prometheus/api/v1/rules
      - /prometheus/api/v1/alerts
    singleBinary:
      - /api/prom/push
      - /loki/api/v1/push
      - /api/prom/tail
      - /loki/api/v1/tail
      - /loki/api
      - /api/prom/rules
      - /loki/api/v1/rules
      - /prometheus/api/v1/rules
      - /prometheus/api/v1/alerts

后面就配置你要启用的服务即可，read, write, backend, gateway, loki有维护一个表格，包含了该模式下的组件有哪些，backend可以看到是不属于 read, write 模式。

Component	individual	all	read	write	backend
Distributor	x	x		x
Ingester	x	x		x
Query Frontend	x	x	x
Query Scheduler	x	x			x
Querier	x	x	x
Index Gateway	x				x
Compactor	x	x			x
Ruler	x	x			x
Bloom Planner (Experimental)	x				x
Bloom Builder (Experimental)	x				x
Bloom Gateway (Experimental)	x				x

表：loki部署模式中包含的组件Source：https://grafana.com/docs/loki/latest/get-started/components/

官方也有提供一些简单的部署模式在文件夹 production/helm/loki/ci

至此 loki 使用 helm 部署的两种模式就完成了部署，再配合 data ingestion, 例如 Promtail, Filebeat 等就可以完成日志的收集。

Reference

^[1] Loki deployment modes

^[2] Microservices mode

^[3] Chunk storage

^[4] Loki components

loki的部署模式#

单体模式#

SSD#

微服务模式#

Loki的存储#

Helm 安装 Loki 2.9#

helm的版本差异#

单体模式进行部署#

SSD模式部署#

Reference#

相关阅读

loki的部署模式

单体模式

SSD

微服务模式

Loki的存储

Helm 安装 Loki 2.9

helm的版本差异

单体模式进行部署

SSD模式部署

Reference