node_metrics output plugin producing stale data for no longer existent network devices #9400

ElectricWeasel · 2024-09-18T09:16:31Z

Bug Report

node_metrics output plugin producing stale data for no longer existent network devices, it could be observed in mimir logs and file output dump.
It is triggered somehow by veth* virtual network devices created for docker containers, metrics are repeatedly send long after device is gone. We are using docker nodes in Swarm mode to run application build and tests (Jenkins agents) so containers are short lived instances.

To Reproduce

create docker container with network
metrics dumped by file output (correct date on host 2024-09-18T06:23:54):

2024-09-18T06:23:54.163003369Z node_network_transmit_compressed_total{device="lo"} = 0
2024-09-18T06:23:54.163003369Z node_network_transmit_compressed_total{device="enp2s0"} = 0
2024-09-18T06:23:54.163003369Z node_network_transmit_compressed_total{device="wlp3s0"} = 0
2024-09-18T06:23:54.163003369Z node_network_transmit_compressed_total{device="docker0"} = 0
2024-09-18T06:23:54.163003369Z node_network_transmit_compressed_total{device="docker_gwbridge"} = 0
2024-09-18T06:23:54.163003369Z node_network_transmit_compressed_total{device="veth9131746"} = 0
2024-09-18T06:23:54.163003369Z node_network_transmit_compressed_total{device="vethceee685"} = 0
2024-09-17T13:02:54.163178349Z node_network_transmit_compressed_total{device="veth4f24bf3"} = 0
2024-09-17T13:02:54.163178349Z node_network_transmit_compressed_total{device="veth8d92a60"} = 0
2024-09-17T13:04:09.163346349Z node_network_transmit_compressed_total{device="vethb672c5c"} = 0
2024-09-17T13:20:24.163096825Z node_network_transmit_compressed_total{device="veth204c129"} = 0
2024-09-17T13:19:24.320445397Z node_network_transmit_compressed_total{device="veth80e9d0a"} = 0
2024-09-17T13:29:54.162821225Z node_network_transmit_compressed_total{device="veth3dc421c"} = 0
2024-09-17T13:38:09.162996512Z node_network_transmit_compressed_total{device="veth7c25eb8"} = 0
2024-09-17T13:38:09.162996512Z node_network_transmit_compressed_total{device="veth6597216"} = 0
2024-09-17T13:50:24.163177111Z node_network_transmit_compressed_total{device="veth097250a"} = 0
2024-09-17T13:53:54.162955756Z node_network_transmit_compressed_total{device="vethe738f49"} = 0
2024-09-17T13:56:24.162887438Z node_network_transmit_compressed_total{device="vethc13dfc2"} = 0
2024-09-17T13:58:24.162943862Z node_network_transmit_compressed_total{device="vethdb04c37"} = 0
2024-09-17T13:58:39.163101877Z node_network_transmit_compressed_total{device="veth49217c9"} = 0
2024-09-17T14:00:54.163102836Z node_network_transmit_compressed_total{device="vethf93b1c7"} = 0
2024-09-17T14:56:09.163110435Z node_network_transmit_compressed_total{device="veth3f0323e"} = 0
2024-09-17T15:41:39.163064514Z node_network_transmit_compressed_total{device="vetha81d561"} = 0
2024-09-17T15:54:09.163025023Z node_network_transmit_compressed_total{device="vethe85281d"} = 0
2024-09-18T06:23:54.162837303Z node_memory_MemTotal_bytes = 16392421376

example mimir logs

failed pushing to ingester opentelemetry-mimir-3: user=anonymous: the sample has been rejected because its timestamp is too old (err-mimir-sample-timestamp-too-old). The affected sample has timestamp 2024-09-17T13:04:09.163Z and is from series node_network_transmit_errs_total{device="vethb672c5c", host_name="xxxx.xxx.xxxx", metrics_agent="fluent-bit", metrics_source="host-metrics"}

Steps to reproduce the problem:

Expected behavior
No stale metrics delivered

Environment

Version used: fluent-bit-3.1.7-1.x86_64
Configuration:

[SERVICE]
    # Flush
    # =====
    # set an interval of seconds before to flush records to a destination
    flush 1

    # Daemon
    # ======
    # instruct Fluent Bit to run in foreground or background mode.
    daemon Off

    # Log_Level
    # =========
    # Set the verbosity level of the service, values can be:
    #
    # - error
    # - warning
    # - info
    # - debug
    # - trace
    #
    # by default 'info' is set, that means it includes 'error' and 'warning'.
    log_level debug

    # Parsers File
    # ============
    # specify an optional 'Parsers' configuration file
    parsers_file parsers.conf
    parsers_file parsers-custom.conf

    # Plugins File
    # ============
    # specify an optional 'Plugins' configuration file to load external plugins.
    plugins_file plugins.conf

    # HTTP Server
    # ===========
    # Enable/Disable the built-in HTTP Server for metrics
    http_server  Off
    http_listen  0.0.0.0
    http_port    2020

    # Storage
    # =======
    # Fluent Bit can use memory and filesystem buffering based mechanisms
    #
    # - https://docs.fluentbit.io/manual/administration/buffering-and-storage
    #
    # storage metrics
    # ---------------
    # publish storage pipeline metrics in '/api/v1/storage'. The metrics are
    # exported only if the 'http_server' option is enabled.
    storage.metrics on

    # storage.path
    # ------------
    # absolute file system path to store filesystem data buffers (chunks).
    #
    storage.path /var/lib/fluent-bit/storage

    # storage.sync
    # ------------
    # configure the synchronization mode used to store the data into the
    # filesystem. It can take the values normal or full.
    #
    storage.sync normal

    # storage.checksum
    # ----------------
    # enable the data integrity check when writing and reading data from the
    # filesystem. The storage layer uses the CRC32 algorithm.
    #
    # storage.checksum off

    # storage.backlog.mem_limit
    # -------------------------
    # if storage.path is set, Fluent Bit will look for data chunks that were
    # not delivered and are still in the storage layer, these are called
    # backlog data. This option configure a hint of maximum value of memory
    # to use when processing these records.
    #
    # storage.backlog.mem_limit 5M
    storage.total_limit_size 512M
    storage.max_chunks_up 128

# Systemd services logs (docker)
[INPUT]
    Name systemd
    Tag systemd.*
    Systemd_Filter _SYSTEMD_UNIT=docker.service
    Lowercase on
    Strip_Underscores on
    DB /var/lib/fluent-bit/cursors/systemd.sqlite
    storage.type filesystem

[INPUT]
    Name                 node_exporter_metrics
    Tag                  node_metrics
    metrics "cpu,meminfo,diskstats,filesystem,uname,stat,time,loadavg,vmstat,netdev,filefd"
    Scrape_interval      15

# Forward/fluentd input for docker services logging
[INPUT]
    Name forward
    Unix_Path /run/fluentd-forward.sock
    Unix_Perm 0666
    storage.type filesystem

[OUTPUT]
    Match systemd.*
    Name opensearch
    Host xxxxx.xxx.xxxxxx
    Port 443
    HTTP_User fluentbit
    HTTP_Passwd xxxxxxxx
    Index systemd
    Suppress_Type_Name On
    Tls On

[OUTPUT]
    Name opentelemetry
    Match node_metrics
    Host xxx.xxx.xxx
    Port 443
    Log_response_payload False
    Tls                  On
    logs_body_key $message
    logs_span_id_message_key span_id
    logs_trace_id_message_key trace_id
    logs_severity_text_message_key loglevel
    logs_severity_number_message_key lognum
    # add user-defined labels
    add_label metrics_agent fluent-bit
    add_label metrics_source host-metrics
    add_label host_name xxxx.xxx.xxx

[OUTPUT]
    Name file
    Match node_metrics
    Path /var/log
    File metrics.log

Environment name and version: Docker CE docker-ce-25.0.3-1.el9.x86_64
Server type and version: Dell Inspiron 5577
Operating System and version: AlmaLinux 9.3
Filters and plugins:
- input node_exporter
- output opentelemetry
- output file (for debug)

The text was updated successfully, but these errors were encountered:

ElectricWeasel added the status: waiting-for-triage label Sep 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

node_metrics output plugin producing stale data for no longer existent network devices #9400

node_metrics output plugin producing stale data for no longer existent network devices #9400

ElectricWeasel commented Sep 18, 2024 •

edited

Loading

node_metrics output plugin producing stale data for no longer existent network devices #9400

node_metrics output plugin producing stale data for no longer existent network devices #9400

Comments

ElectricWeasel commented Sep 18, 2024 • edited Loading

Bug Report

ElectricWeasel commented Sep 18, 2024 •

edited

Loading