Not able to delete the chunk and re-schedule retry #9362

ankitpanwar174 · 2024-09-06T13:00:28Z

Bug Report

Describe the bug

Initially, I set up Fluent Bit to attempt retries on the same chunk 100 times within intervals of 60 to 65 seconds. However, Fluent was disconnected from the watch due to an issue with the TLS certificate. After maintaining this configuration for nearly five hours, I modified the retry settings to 3 attempts within a period of 1800 to 1805 seconds. Following this adjustment, I noticed that Fluent Bit did not retry any chunk three times during the specified interval, nor did it drop any chunks.
In the second instance, after operating Fluent with the aforementioned configuration for one day, a new error emerged:
2024-09-06T12:27:34.326321607Z [2024/09/06 12:27:34] [error] [input:tail:audit_logs] db: could not create 'in_tail_files' table
2024-09-06T12:27:34.326324155Z [2024/09/06 12:27:34] [error] [input:tail:audit_logs] could not open/create database.

To Reproduce
1- Set the interval interval
2- Set the retry option
service: | [SERVICE] Daemon Off Flush {{ .Values.flush }} Log_Level {{ .Values.logLevel }} Parsers_File custom_parsers.conf HTTP_Server On HTTP_Listen 0.0.0.0 HTTP_Port {{ .Values.metricsPort }} Health_Check On scheduler.base 60 scheduler.cap 65 storage.path /var/fluent-bit/state/flb-storage/ storage.sync normal storage.checksum Off storage.backlog.mem_limit {{ .Values.storageBacklogMemlimit }}

outputs: | [OUTPUT] Name cloudwatch_logs Match application.* region {{ .Values.awsRegion }} log_group_name /aws/containerinsights/${CLUSTER_ID}/application_logs log_stream_prefix ${HOST_NAME}- log_retention_days 30 auto_create_group true Retry_Limit 100 storage.total_limit_size {{ .Values.containerLogsFileBufferLimit }}

3- Make fluent bit pod up and running
4- Make sure fluent is not able to send data to the cloud by doing the below steps
4.a - Proxy server down
4.b- Disconnect the internet where the fluent bit is running
4.c- Change the secret if any used in the fluent configuration.
5- Run fluent bit for 3 to 4 hours
6 - Changes the configuration as below
service: | [SERVICE] Daemon Off Flush {{ .Values.flush }} Log_Level {{ .Values.logLevel }} Parsers_File custom_parsers.conf HTTP_Server On HTTP_Listen 0.0.0.0 HTTP_Port {{ .Values.metricsPort }} Health_Check On scheduler.base 1800 scheduler.cap 1805 storage.path /var/fluent-bit/state/flb-storage/ storage.sync normal storage.checksum Off storage.backlog.mem_limit {{ .Values.storageBacklogMemlimit }} storage.max_chunks_up 28
Out put setting
outputs: | [OUTPUT] Name cloudwatch_logs Match application.* region {{ .Values.awsRegion }} log_group_name /aws/containerinsights/${CLUSTER_ID}/application_logs log_stream_prefix ${HOST_NAME}- log_retention_days 30 auto_create_group true Retry_Limit 3 storage.total_limit_size {{ .Values.containerLogsFileBufferLimit }}

Error logs
*********************** Re-Scheduling the retry it self *********************
[2024/09/04 11:57:29] [ info] [task] re-schedule retry=0x7fa0ce47c308 754 in the next 1803 seconds
[2024/09/04 11:57:30] [ info] [task] re-schedule retry=0x7fa0ce47ce70 748 in the next 1801 seconds
[2024/09/04 11:57:30] [ info] [task] re-schedule retry=0x7fa0ce47cf38 747 in the next 1802 seconds
[2024/09/04 11:57:32] [ info] [task] re-schedule retry=0x7fa0ce479bf8 756 in the next 1803 seconds
[2024/09/04 11:57:33] [ info] [task] re-schedule retry=0x7fa0ce47c4c0 755 in the next 1806 seconds
[2024/09/04 11:57:34] [ info] [task] re-schedule retry=0x7fa0ce5e2000 757 in the next 1804 seconds
[2024/09/04 11:57:38] [ info] [task] re-schedule retry=0x7fa0ce479fe0 761 in the next 1801 seconds
[2024/09/04 11:57:39] [ info] [task] re-schedule retry=0x7fa0ce479d60 760 in the next 1804 seconds
[2024/09/04 11:57:41] [ info] [task] re-schedule retry=0x7fa0ce5e2230 767 in the next 1806 seconds

**************************** DB Issue **********************************
2024-09-06T12:27:34.325971968Z [2024/09/06 12:27:34] [ info] [input:tail:audit_logs] initializing
2024-09-06T12:27:34.325981276Z [2024/09/06 12:27:34] [ info] [input:tail:audit_logs] storage_strategy='filesystem' (memory + filesystem)
2024-09-06T12:27:34.326316807Z [2024/09/06 12:27:34] [error] [sqldb] error=disk I/O error
2024-09-06T12:27:34.326321607Z [2024/09/06 12:27:34] [error] [input:tail:audit_logs] db: could not create 'in_tail_files' table
2024-09-06T12:27:34.326324155Z [2024/09/06 12:27:34] [error] [input:tail:audit_logs] could not open/create database
2024-09-06T12:27:34.326367790Z [2024/09/06 12:27:34] [error] failed initialize input tail.1
2024-09-06T12:27:34.326371266Z [2024/09/06 12:27:34] [error] [engine] input initialization failed
2024-09-06T12:27:34.326477580Z [2024/09/06 12:27:34] [error] [lib] backend failed

Expected behavior
Fluent bit should delete the older chunk after 3 retries and should not crash

Your Environment

Version used: 2.2.2
Configuration:
Environment name and version (e.g. Kubernetes? What version?): Kube-1.28
Server type and version:
Operating System and version: SUSE 5.5:2.0.4
Filters and plugins: ` filters: |

[FILTER]
Name parser
Match application.*
Key_name log
Parser crio

[FILTER]
Name grep
Match sysd.generic
Exclude SYSLOG_FACILITY (4|10)$
Regex PRIORITY [0-4]$

[FILTER]
Name kubernetes
Match application.*
Kube_URL https://kubernetes.default.svc:443
Merge_Log On
Merge_Log_Key log_processed
Keep_Log false
K8S-Logging.Parser On
K8S-Logging.Exclude false
Buffer_Size 0
Kube_Tag_Prefix application.var.log.containers.
Labels Off
Annotations Off
Use_Kubelet On
Kubelet_Port 10250

[FILTER]
Name kubernetes
Match kubernetes.components.core.*
Kube_URL https://kubernetes.default.svc:443
Merge_Log On
Merge_Log_Key log_processed
Keep_Log false
K8S-Logging.Parser On
K8S-Logging.Exclude false
Buffer_Size 0
Kube_Tag_Prefix kubernetes.components.core.var.log.containers.
Labels Off
Annotations Off
Use_Kubelet On
Kubelet_Port 10250

[FILTER]
Name modify
Match *
Add cluster_id ${CLUSTER_ID}

[FILTER]
Name modify
Match kubeaudit.*
Add host_name ${HOST_NAME}

[FILTER]
Name modify
Match kubernetes.components.kubelet.*
Add host_name ${HOST_NAME}

-- https://docs.fluentbit.io/manual/pipeline/outputs

outputs: |
[OUTPUT]
Name cloudwatch_logs
Match application.*
region {{ .Values.awsRegion }}
log_group_name /aws/containerinsights/${CLUSTER_ID}/application_logs
log_stream_prefix ${HOST_NAME}-
log_retention_days {{ .Values.logRetentionDays }}
auto_create_group true
Retry_Limit {{ .Values.retryLimit }}
storage.total_limit_size {{ .Values.containerLogsFileBufferLimit }}

[OUTPUT]
Name cloudwatch_logs
Match kubeaudit.*
region {{ .Values.awsRegion }}
log_group_name /aws/containerinsights/${CLUSTER_ID}/kubernetes_audit_logs
log_stream_prefix ${HOST_NAME}-
log_retention_days {{ .Values.auditLogRetentionDays }}
auto_create_group true
Retry_Limit {{ .Values.retryLimit }}
storage.total_limit_size {{ .Values.auditLogsFileBufferLimit }}

[OUTPUT]
Name cloudwatch_logs
Match kubernetes.components.*
region {{ .Values.awsRegion }}
log_group_name /aws/containerinsights/${CLUSTER_ID}/core_kubernetes_logs
log_stream_prefix ${HOST_NAME}-
log_retention_days {{ .Values.logRetentionDays }}
auto_create_group true
Retry_Limit {{ .Values.retryLimit }}
storage.total_limit_size {{ .Values.coreKubernetesLogsFileBufferLimit }}

[OUTPUT]
Name cloudwatch_logs
Match sysd.*
region {{ .Values.awsRegion }}
log_group_name /aws/containerinsights/${CLUSTER_ID}/operating_system_logs
log_stream_prefix ${HOST_NAME}-
log_retention_days {{ .Values.logRetentionDays }}
auto_create_group true
Retry_Limit {{ .Values.retryLimit }}
storage.total_limit_size {{ .Values.osLogsFileBufferLimit }}

[OUTPUT]
Name prometheus_exporter
Match *_metrics`

The text was updated successfully, but these errors were encountered:

ankitpanwar174 added the status: waiting-for-triage label Sep 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not able to delete the chunk and re-schedule retry #9362

Not able to delete the chunk and re-schedule retry #9362

ankitpanwar174 commented Sep 6, 2024 •

edited

Loading

-- https://docs.fluentbit.io/manual/pipeline/outputs

Not able to delete the chunk and re-schedule retry #9362

Not able to delete the chunk and re-schedule retry #9362

Comments

ankitpanwar174 commented Sep 6, 2024 • edited Loading

Bug Report

-- https://docs.fluentbit.io/manual/pipeline/outputs

ankitpanwar174 commented Sep 6, 2024 •

edited

Loading