Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not able to delete the chunk and re-schedule retry #9362

Open
ankitpanwar174 opened this issue Sep 6, 2024 · 0 comments
Open

Not able to delete the chunk and re-schedule retry #9362

ankitpanwar174 opened this issue Sep 6, 2024 · 0 comments

Comments

@ankitpanwar174
Copy link

ankitpanwar174 commented Sep 6, 2024

Bug Report

Describe the bug

  1. Initially, I set up Fluent Bit to attempt retries on the same chunk 100 times within intervals of 60 to 65 seconds. However, Fluent was disconnected from the watch due to an issue with the TLS certificate. After maintaining this configuration for nearly five hours, I modified the retry settings to 3 attempts within a period of 1800 to 1805 seconds. Following this adjustment, I noticed that Fluent Bit did not retry any chunk three times during the specified interval, nor did it drop any chunks.

  2. In the second instance, after operating Fluent with the aforementioned configuration for one day, a new error emerged:
    2024-09-06T12:27:34.326321607Z [2024/09/06 12:27:34] [error] [input:tail:audit_logs] db: could not create 'in_tail_files' table
    2024-09-06T12:27:34.326324155Z [2024/09/06 12:27:34] [error] [input:tail:audit_logs] could not open/create database.

To Reproduce
1- Set the interval interval
2- Set the retry option
service: | [SERVICE] Daemon Off Flush {{ .Values.flush }} Log_Level {{ .Values.logLevel }} Parsers_File custom_parsers.conf HTTP_Server On HTTP_Listen 0.0.0.0 HTTP_Port {{ .Values.metricsPort }} Health_Check On scheduler.base 60 scheduler.cap 65 storage.path /var/fluent-bit/state/flb-storage/ storage.sync normal storage.checksum Off storage.backlog.mem_limit {{ .Values.storageBacklogMemlimit }}

outputs: | [OUTPUT] Name cloudwatch_logs Match application.* region {{ .Values.awsRegion }} log_group_name /aws/containerinsights/${CLUSTER_ID}/application_logs log_stream_prefix ${HOST_NAME}- log_retention_days 30 auto_create_group true Retry_Limit 100 storage.total_limit_size {{ .Values.containerLogsFileBufferLimit }}

3- Make fluent bit pod up and running
4- Make sure fluent is not able to send data to the cloud by doing the below steps
4.a - Proxy server down
4.b- Disconnect the internet where the fluent bit is running
4.c- Change the secret if any used in the fluent configuration.
5- Run fluent bit for 3 to 4 hours
6 - Changes the configuration as below
service: | [SERVICE] Daemon Off Flush {{ .Values.flush }} Log_Level {{ .Values.logLevel }} Parsers_File custom_parsers.conf HTTP_Server On HTTP_Listen 0.0.0.0 HTTP_Port {{ .Values.metricsPort }} Health_Check On scheduler.base 1800 scheduler.cap 1805 storage.path /var/fluent-bit/state/flb-storage/ storage.sync normal storage.checksum Off storage.backlog.mem_limit {{ .Values.storageBacklogMemlimit }} storage.max_chunks_up 28
Out put setting
outputs: | [OUTPUT] Name cloudwatch_logs Match application.* region {{ .Values.awsRegion }} log_group_name /aws/containerinsights/${CLUSTER_ID}/application_logs log_stream_prefix ${HOST_NAME}- log_retention_days 30 auto_create_group true Retry_Limit 3 storage.total_limit_size {{ .Values.containerLogsFileBufferLimit }}

Error logs
*********************** Re-Scheduling the retry it self *********************
[2024/09/04 11:57:29] [ info] [task] re-schedule retry=0x7fa0ce47c308 754 in the next 1803 seconds
[2024/09/04 11:57:30] [ info] [task] re-schedule retry=0x7fa0ce47ce70 748 in the next 1801 seconds
[2024/09/04 11:57:30] [ info] [task] re-schedule retry=0x7fa0ce47cf38 747 in the next 1802 seconds
[2024/09/04 11:57:32] [ info] [task] re-schedule retry=0x7fa0ce479bf8 756 in the next 1803 seconds
[2024/09/04 11:57:33] [ info] [task] re-schedule retry=0x7fa0ce47c4c0 755 in the next 1806 seconds
[2024/09/04 11:57:34] [ info] [task] re-schedule retry=0x7fa0ce5e2000 757 in the next 1804 seconds
[2024/09/04 11:57:38] [ info] [task] re-schedule retry=0x7fa0ce479fe0 761 in the next 1801 seconds
[2024/09/04 11:57:39] [ info] [task] re-schedule retry=0x7fa0ce479d60 760 in the next 1804 seconds
[2024/09/04 11:57:41] [ info] [task] re-schedule retry=0x7fa0ce5e2230 767 in the next 1806 seconds

**************************** DB Issue **********************************
2024-09-06T12:27:34.325971968Z [2024/09/06 12:27:34] [ info] [input:tail:audit_logs] initializing
2024-09-06T12:27:34.325981276Z [2024/09/06 12:27:34] [ info] [input:tail:audit_logs] storage_strategy='filesystem' (memory + filesystem)
2024-09-06T12:27:34.326316807Z [2024/09/06 12:27:34] [error] [sqldb] error=disk I/O error
2024-09-06T12:27:34.326321607Z [2024/09/06 12:27:34] [error] [input:tail:audit_logs] db: could not create 'in_tail_files' table
2024-09-06T12:27:34.326324155Z [2024/09/06 12:27:34] [error] [input:tail:audit_logs] could not open/create database
2024-09-06T12:27:34.326367790Z [2024/09/06 12:27:34] [error] failed initialize input tail.1
2024-09-06T12:27:34.326371266Z [2024/09/06 12:27:34] [error] [engine] input initialization failed
2024-09-06T12:27:34.326477580Z [2024/09/06 12:27:34] [error] [lib] backend failed


Expected behavior
Fluent bit should delete the older chunk after 3 retries and should not crash

Your Environment

  • Version used: 2.2.2

  • Configuration:

  • Environment name and version (e.g. Kubernetes? What version?): Kube-1.28

  • Server type and version:

  • Operating System and version: SUSE 5.5:2.0.4

  • Filters and plugins: ` filters: |

    [FILTER]
    Name parser
    Match application.*
    Key_name log
    Parser crio

    [FILTER]
    Name grep
    Match sysd.generic
    Exclude SYSLOG_FACILITY (4|10)$
    Regex PRIORITY [0-4]$

    [FILTER]
    Name kubernetes
    Match application.*
    Kube_URL https://kubernetes.default.svc:443
    Merge_Log On
    Merge_Log_Key log_processed
    Keep_Log false
    K8S-Logging.Parser On
    K8S-Logging.Exclude false
    Buffer_Size 0
    Kube_Tag_Prefix application.var.log.containers.
    Labels Off
    Annotations Off
    Use_Kubelet On
    Kubelet_Port 10250

    [FILTER]
    Name kubernetes
    Match kubernetes.components.core.*
    Kube_URL https://kubernetes.default.svc:443
    Merge_Log On
    Merge_Log_Key log_processed
    Keep_Log false
    K8S-Logging.Parser On
    K8S-Logging.Exclude false
    Buffer_Size 0
    Kube_Tag_Prefix kubernetes.components.core.var.log.containers.
    Labels Off
    Annotations Off
    Use_Kubelet On
    Kubelet_Port 10250

    [FILTER]
    Name modify
    Match *
    Add cluster_id ${CLUSTER_ID}

    [FILTER]
    Name modify
    Match kubeaudit.*
    Add host_name ${HOST_NAME}

    [FILTER]
    Name modify
    Match kubernetes.components.kubelet.*
    Add host_name ${HOST_NAME}

    -- https://docs.fluentbit.io/manual/pipeline/outputs

    outputs: |
    [OUTPUT]
    Name cloudwatch_logs
    Match application.*
    region {{ .Values.awsRegion }}
    log_group_name /aws/containerinsights/${CLUSTER_ID}/application_logs
    log_stream_prefix ${HOST_NAME}-
    log_retention_days {{ .Values.logRetentionDays }}
    auto_create_group true
    Retry_Limit {{ .Values.retryLimit }}
    storage.total_limit_size {{ .Values.containerLogsFileBufferLimit }}

    [OUTPUT]
    Name cloudwatch_logs
    Match kubeaudit.*
    region {{ .Values.awsRegion }}
    log_group_name /aws/containerinsights/${CLUSTER_ID}/kubernetes_audit_logs
    log_stream_prefix ${HOST_NAME}-
    log_retention_days {{ .Values.auditLogRetentionDays }}
    auto_create_group true
    Retry_Limit {{ .Values.retryLimit }}
    storage.total_limit_size {{ .Values.auditLogsFileBufferLimit }}

    [OUTPUT]
    Name cloudwatch_logs
    Match kubernetes.components.*
    region {{ .Values.awsRegion }}
    log_group_name /aws/containerinsights/${CLUSTER_ID}/core_kubernetes_logs
    log_stream_prefix ${HOST_NAME}-
    log_retention_days {{ .Values.logRetentionDays }}
    auto_create_group true
    Retry_Limit {{ .Values.retryLimit }}
    storage.total_limit_size {{ .Values.coreKubernetesLogsFileBufferLimit }}

    [OUTPUT]
    Name cloudwatch_logs
    Match sysd.*
    region {{ .Values.awsRegion }}
    log_group_name /aws/containerinsights/${CLUSTER_ID}/operating_system_logs
    log_stream_prefix ${HOST_NAME}-
    log_retention_days {{ .Values.logRetentionDays }}
    auto_create_group true
    Retry_Limit {{ .Values.retryLimit }}
    storage.total_limit_size {{ .Values.osLogsFileBufferLimit }}

    [OUTPUT]
    Name prometheus_exporter
    Match *_metrics`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant