Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chunk files in error are deleted #9363

Open
pecastro opened this issue Sep 6, 2024 · 0 comments
Open

Chunk files in error are deleted #9363

pecastro opened this issue Sep 6, 2024 · 0 comments

Comments

@pecastro
Copy link

pecastro commented Sep 6, 2024

Bug Report

Describe the bug
There's been a commit 4dfc256 that possibly changes the original semantics of fluent-bit behavior when dealing with data that cannot be processed by an upstream.
Before this commit the data would stay in the chunk file in the given tail.0 directory and would be suitable to be observed and accessed whereas now the file disappears leaving little room for debugging the possible issue.

To Reproduce

  • Any log message that triggers a >=400 <500 error in an upstream.

Expected behavior
At the very least an option to keep those chunk files in the tail.0 directory

Screenshots
N/A

Your Environment

  • Version used: 3.1.6-debug
  • Configuration:
  custom_parsers.conf: |
    [PARSER[]
        Name docker_no_time
        Format json
        Time_Keep Off
        Time_Key time
        Time_Format %Y-%m-%dT%H:%M:%S.%L
  fluent-bit.conf: |
    [SERVICE[]
        Daemon                              Off
        Flush                               1
        Log_Level                           error
        Parsers_File                        /fluent-bit/etc/parsers.conf
        Parsers_File                        /fluent-bit/etc/conf/custom_parsers.conf
        HTTP_Server                         On
        HTTP_Listen                         0.0.0.0
        HTTP_Port                           2020
        Health_Check                        On
        scheduler.cap                       300
        storage.path                        /var/log/flb-storage/
        storage.max_chunks_up               128
        storage.sync                        full
        storage.backlog.mem_limit           5M
        storage.delete_irrecoverable_chunks on

    [INPUT[]
        Name                              tail
        Path                              /var/log/containers/*.log
        multiline.parser                  cri
        Tag                               kube.*
        Skip_Long_Lines                   On
        Skip_Empty_Lines                  On
        Buffer_Chunk_Size                 64KB
        Buffer_Max_Size                   128KB
        DB                                /var/log/flb-storage/containers.db
        storage.type                      filesystem
        storage.pause_on_chunks_overlimit on

    [INPUT[]
        Name                              systemd
        Tag                               host.*
        Systemd_Filter                    _SYSTEMD_UNIT=kubelet.service
        Systemd_Filter                    _SYSTEMD_UNIT=docker.service
        Systemd_Filter                    _SYSTEMD_UNIT=containerd.service
        DB                                /var/log/flb-storage/systemd.db
        Read_From_Tail                    On
        storage.type                      filesystem
        storage.pause_on_chunks_overlimit on

    [FILTER[]
        Name                kubernetes
        Match               kube.*
        Kube_URL            https://kubernetes.default.svc.cluster.local:443
        Kube_CA_File        /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        Kube_Token_File     /var/run/secrets/kubernetes.io/serviceaccount/token
        Kube_Tag_Prefix     kube.var.log.containers.
        Merge_Log           On
        Labels              On
        Annotations         On
        Buffer_Size         1MB
        Use_Kubelet         On
        namespace_labels    On

    [FILTER[]
        Name         modify
        Match        host.*
        Rename       _HOSTNAME        hostname
        Rename       _SYSTEMD_UNIT    systemd_unit
        Rename       MESSAGE          log
        Remove_regex ^((?!hostname|systemd_unit|log).)*$

    [FILTER[]
        Name         aws
        Match        host.*
        imds_version v2

    [FILTER[]
        Name  modify
        Match *
        Add   environment_name   env-name
        Add   cluster_name       cluster-name

    [FILTER[]
        Name   lua
        Match  *
        script /fluent-bit/scripts/index_name_filter.lua
        call   index_name

    [OUTPUT[]
        Name                      http
        Alias                     an-alias-name
        Match                     *
        Host                      a-host-name.com
        Port                      443
        http_User                 ${FLUENTD_USER}
        http_Passwd               ${FLUENTD_PASSWORD}
        URI                       /a-given-tag
        Format                    json
        header                    User-Agent a-user-agent
        header_tag                FLUENT-TAG
        json_date_format          iso8601
        tls                       on
        tls.verify                off
        compress                  gzip
        Retry_Limit               no_limits
        net.dns.resolver          async
        log_suppress_interval     10s
        storage.total_limit_size  500M
        Log_Level                 error
  • Environment name and version (e.g. Kubernetes? What version?): v1.28.12-eks-2f46c53
  • Server type and version: fluent-bit:3.1.6-debug
  • Operating System and version: Linux
  • Filters and plugins: tail, systemd, kubernetes, modify, http

Additional context

This stops the ability to understand why things are not being processed. Furthermore it only becomes obvious if the output error level is in warn otherwise the files will disappear with no other warning.

Logs from v3.0.3-debug

[pod/fluent-bit-w5pl9/fluent-bit] 2024-09-06T13:06:49.448184964+01:00 [2024/09/06 12:06:49] [error] [output:http:http.0] fluentd-nginx.fluentd.svc.cluster.local:9880, HTTP status=400
[pod/fluent-bit-w5pl9/fluent-bit] 2024-09-06T13:06:49.448217822+01:00 400 Bad Request
[pod/fluent-bit-w5pl9/fluent-bit] 2024-09-06T13:06:49.448222868+01:00 invalid time format: value = 2024-09-03 14:51:05.064735+00:00, error_class = ArgumentError, error = invalid xmlschema format: "2024-09-03 14:51:05.064735+00:00"
[pod/fluent-bit-w5pl9/fluent-bit] 2024-09-06T13:06:49.448225931+01:00
[pod/fluent-bit-w5pl9/fluent-bit] 2024-09-06T13:06:57.421625012+01:00 [2024/09/06 12:06:57] [error] [output:http:http.0] fluentd-nginx.fluentd.svc.cluster.local:9880, HTTP status=400
[pod/fluent-bit-w5pl9/fluent-bit] 2024-09-06T13:06:57.421695766+01:00 400 Bad Request
[pod/fluent-bit-w5pl9/fluent-bit] 2024-09-06T13:06:57.421700133+01:00 invalid time format: value = 2024-09-03 14:51:05.064735+00:00, error_class = ArgumentError, error = invalid xmlschema format: "2024-09-03 14:51:05.064735+00:00"
[pod/fluent-bit-w5pl9/fluent-bit] 2024-09-06T13:06:57.421702195+01:00
[pod/fluent-bit-w5pl9/fluent-bit] 2024-09-06T13:07:03.427607796+01:00 [2024/09/06 12:07:03] [error] [output:http:http.0] fluentd-nginx.fluentd.svc.cluster.local:9880, HTTP status=400
[pod/fluent-bit-w5pl9/fluent-bit] 2024-09-06T13:07:03.427629077+01:00 400 Bad Request
[pod/fluent-bit-w5pl9/fluent-bit] 2024-09-06T13:07:03.427635573+01:00 invalid time format: value = 2024-09-03 14:51:05.064735+00:00, error_class = ArgumentError, error = invalid xmlschema format: "2024-09-03 14:51:05.064735+00:00"
[pod/fluent-bit-w5pl9/fluent-bit] 2024-09-06T13:07:03.427638355+01:00
...
...
repeats ...

Logs from v3.1.6-debug in error level

[pod/fluent-bit-9sk5h/fluent-bit] 2024-09-06T13:08:56.469635169+01:00 [2024/09/06 12:08:56] [error] [output:http:http.0] fluentd-nginx.fluentd.svc.cluster.local:9880, HTTP status=400
[pod/fluent-bit-9sk5h/fluent-bit] 2024-09-06T13:08:56.469723692+01:00 400 Bad Request
[pod/fluent-bit-9sk5h/fluent-bit] 2024-09-06T13:08:56.469735103+01:00 invalid time format: value = 2024-09-03 14:51:05.064735+00:00, error_class = ArgumentError, error = invalid xmlschema format: "2024-09-03 14:51:05.064735+00:00"
[pod/fluent-bit-9sk5h/fluent-bit] 2024-09-06T13:08:56.469738873+01:00

Logs from v3.1.6-debug in warn level

[pod/fluent-bit-zcpxc/fluent-bit] 2024-09-06T13:11:41.456623632+01:00 [2024/09/06 12:11:41] [error] [output:http:http.0] fluentd-nginx.fluentd.svc.cluster.local:9880, HTTP status=400
[pod/fluent-bit-zcpxc/fluent-bit] 2024-09-06T13:11:41.456850559+01:00 400 Bad Request
[pod/fluent-bit-zcpxc/fluent-bit] 2024-09-06T13:11:41.456857691+01:00 invalid time format: value = 2024-09-03 14:51:05.064735+00:00, error_class = ArgumentError, error = invalid xmlschema format: "2024-09-03 14:51:05.064735+00:00"
[pod/fluent-bit-zcpxc/fluent-bit] 2024-09-06T13:11:41.456860695+01:00
[pod/fluent-bit-zcpxc/fluent-bit] 2024-09-06T13:11:41.456864332+01:00 [2024/09/06 12:11:41] [ warn] [output:http:http.0] could not flush records to fluentd-nginx.fluentd.svc.cluster.local:9880 (http_do=0), chunk will not be retried
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant