首页 > 其他分享> 文章详细

云原生监控体系建设

2022-04-02 12:00:23 阅读：246 来源： 互联网

标签：原生体系 node ops labels instance 监控 container com

监控系统建设思维导图

可观测

1. 监控系统

Prometheus + grafana + alert

搭建方法网上很多，忽略。

2. 日志系统

loki，官方文档之外，可以参考“云原生小白”写的一系列loki的文章。https://blog.csdn.net/weixin_49366475

3. 链路追踪（未实现）

Jaeger，eBPF(https://developer.aliyun.com/article/875115?utm_content=g_1000329007)

可感知

监控以及告警规则使用k8s-sidecar配置，注册config-map，agent和alert监控该config-map名进行reload操作。

k8s-sidecat项目地址：https://github.com/kiwigrid/k8s-sidecar

1. 告警规则

NODE告警

nodealertrules:
  groups:
  - name: node_monitor
    rules:
    - alert: node_status
      expr: up == 0
      for: 5m
      labels:
        ds: 32
        severity: error
      annotations:
        summary: "{{$labels.instance}}:不可用"
        description: "{{$labels.instance}}:状态为down超过5分钟"
    
    - alert: CPU_critical
      expr: 100-(avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by(ks_ops_haique_tech_com_nodegroup,instance,job)* 100) > 80 and count(node_cpu_seconds_total{mode='system'}) by (ks_ops_haique_tech_com_nodegroup,instance,job) > 1
      for: 10m
      labels:
        ds: 32
        severity: critical
      annotations:
        summary: "{{$labels.instance}} CPU使用率过高！"
        description: "{{$labels.instance }} CPU使用大于80%(目前使用:{{$value}}%),{{$labels.ks_ops_haique_tech_com_nodegroup}}"

    - alert: CPU_softirq
      expr: avg(irate(node_cpu_seconds_total{mode="softirq"}[5m])) by(ks_ops_haique_tech_com_nodegroup,instance,job)* 100 >40 and count(node_cpu_seconds_total{mode='system'}) by (ks_ops_haique_tech_com_nodegroup,instance,job) > 1
      for: 10m
      labels:
        ds: 32
        severity: error
      annotations: 
        summary: "{{$labels.instance}} CPU软中断si使用率过高！"
        description: "{{$labels.instance }} CPU软中断si使用率大于40%(目前使用:{{$value}}%),{{$labels.ks_ops_haique_tech_com_nodegroup}}"

    - alert: CPU_load5
      expr: avg(node_load5) by (ks_ops_haique_tech_com_nodegroup,instance,job) - count(node_cpu_seconds_total{mode='system'}) by (ks_ops_haique_tech_com_nodegroup,instance,job) *2 > 0 and count(node_cpu_seconds_total{mode='system'}) by (ks_ops_haique_tech_com_nodegroup,instance,job) > 1
      for: 10m
      labels:
        ds: 32
        severity: error
      annotations: 
        summary: "{{$labels.instance}} CPU负载高！"
        description: "{{$labels.instance }} CPU平均负载持续10分钟超过CPU核数2倍(目前超过:{{$value}}),{{$labels.ks_ops_haique_tech_com_nodegroup}}"

    - alert: HostOutOfMemory
      expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10
      for: 3m
      labels:
        ds: 32
        severity: error
      annotations:
        summary: "{{$labels.instance}} 内存使用率过高！"
        description: "{{$labels.instance}} 内存剩余小于10%(目前剩余:{{$value}}%),{{$labels.ks_ops_haique_tech_com_nodegroup}}"

    - alert: HostOomKillDetected
      expr: increase(node_vmstat_oom_kill[1m]) > 0
      for: 0m
      labels:
        ds: 32
        severity: error
      annotations:
        summary: "{{ $labels.instance }}出现OOM"
        description: "{{ $labels.instance }}出现OOM，出现{{$value}}次,{{$labels.ks_ops_haique_tech_com_nodegroup}}"
        
    - alert: IO
      expr: irate(node_disk_io_time_seconds_total{device=~"vd.*"}[1m]) *100 > 40
      for: 5m
      labels:
        ds: 32
        severity: error
      annotations:
        summary: "{{$labels.instance}} 磁盘I/O耗时高！"
        description: "{{$labels.instance }} 磁盘I/O操作耗时占比大于40%(目前使用:{{$value}}),{{$labels.ks_ops_haique_tech_com_nodegroup}}"

    - alert: mount_error
      expr: (node_filesystem_size_bytes{fstype=~"ext.*|xfs"}-node_filesystem_free_bytes{fstype=~"ext.*|xfs"}) *100/(node_filesystem_avail_bytes {fstype=~"ext.*|xfs"}+(node_filesystem_size_bytes{fstype=~"ext.*|xfs"}-node_filesystem_free_bytes{fstype=~"ext.*|xfs"})) > 80
      for: 1m
      labels:
        ds: 32
        severity: error
      annotations:
        summary: "{{$labels.instance}} {{$labels.mountpoint}} 磁盘分区使用率过高！"
        description: "{{$labels.instance}} {{$labels.mountpoint }} 磁盘分区使用大于80%(目前使用:{{$value}}%),{{$labels.ks_ops_haique_tech_com_nodegroup}}"
    
    - alert: mount_critical
      expr: (node_filesystem_size_bytes{fstype=~"ext.*|xfs"}-node_filesystem_free_bytes{fstype=~"ext.*|xfs"}) *100/(node_filesystem_avail_bytes {fstype=~"ext.*|xfs"}+(node_filesystem_size_bytes{fstype=~"ext.*|xfs"}-node_filesystem_free_bytes{fstype=~"ext.*|xfs"})) > 95
      for: 1m
      labels:
        ds: 32
        severity: critical
      annotations:
        summary: "{{$labels.instance}} {{$labels.mountpoint}} 磁盘分区使用率过高！"
        description: "{{$labels.instance}} {{$labels.mountpoint }} 磁盘分区使用大于95%(目前使用:{{$value}}%),{{$labels.ks_ops_haique_tech_com_nodegroup}}"

    - alert: HostConntrackLimit
      expr: node_nf_conntrack_entries / node_nf_conntrack_entries_limit > 0.8
      for: 5m
      labels:
        ds: 32
        severity: error
      annotations:
        summary: "{{ $labels.instance }} 连接数接近limit值"
        description: "{{ $labels.instance }} 连接跟踪的数量与limit的比值大于0.8(目前使用:{{$value}}),{{$labels.ks_ops_haique_tech_com_nodegroup}}"

POD告警

podalertrules:
  groups:
  - name: pod_monitor
    rules:
    - alert: pod_cpu
      expr: sum(rate(container_cpu_usage_seconds_total{container !="",container!="POD"}[2m])) by (container, pod) / sum(kube_pod_container_resource_limits{resource="cpu",unit="core",container !="",container!="POD"} > 0) by (container, pod) * 100 > 80
      for: 10m
      labels:
        ds: 32
        severity: warning
      annotations:
        summary: "{{$labels.pod}}:cpu使用率高于80%"
        description: "{{$labels.pod}},{{$labels.container}},cpu使用率{{$value}}%"
    - alert: pod_mem
      expr: sum(container_memory_working_set_bytes{container !="",container!="POD"}) by (container, pod) / sum(container_spec_memory_limit_bytes{container !="",container!="POD"} > 0) by (container, pod) * 100 > 80
      for: 10m
      labels:
        ds: 32
        severity: warning
      annotations:
        summary: "{{$labels.pod}}:mem使用率高于80%"
        description: "{{$labels.pod}},{{$labels.container}},mem使用率{{$value}}%"
    - alert: pod_restartNum
      expr: increase(kube_pod_container_status_restarts_total{container!="POD"}[3h])> 3
      for: 0m
      labels:
        ds: 32
        severity: error
      annotations:
        summary: "{{$labels.pod}}:容器经常重启"
        description: "{{$labels.pod}},{{$labels.container}},重启近3小时重启次数大于3,共重启{{$value}}次"

可预测

可参考我另一篇文章。

使用z-score异常检测算法进行监控告警 - 沄持的学习记录 - 博客园 (cnblogs.com)

标签：原生,体系,node,ops,labels,instance,监控,container,com
来源： https://www.cnblogs.com/maxgongzuo/p/16091365.html

本站声明： 1. iCode9 技术分享网（下文简称本站）提供的所有内容，仅供技术学习、探讨和分享；
2. 关于本站的所有留言、评论、转载及引用，纯属内容发起人的个人观点，与本站观点和立场无关；
3. 关于本站的所有言论和文字，纯属内容发起人的个人观点，与本站观点和立场无关；
4. 本站文章均是网友提供，不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属；如您发现该文章侵犯了您的权益，可联系我们第一时间进行删除；
5. 本站为非盈利性的个人网站，所有内容不会用来进行牟利，也不会利用任何形式的广告来间接获益，纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

ICode9

云原生监控体系建设

监控系统建设思维导图

可观测

可感知

可预测