Add Go error tracking #1004

marctc · 2024-07-09T15:50:21Z

This PR adds the ability to automatically correct Go errors happening during an HTTP request.
The BPF probe only collects the last error, which is reported using fmt.Errorf, happened during
the request.

The stack trace and error message are used to add an exception event to the Span of the request:
https://opentelemetry.io/docs/specs/semconv/exceptions/exceptions-spans/

The env var BEYLA_TRACES_REPORT_EXCEPTION_EVENTS allows to enable this feature, which is
disabled by default. In a follow up PR i will add docs for config option and details of the feature.

Special thanks to @tpaschalis @inkel @rlankfo which contributed to this project.

Fixes #874

grcevski

Very cool! I think this will generally work! I left a few comments.

bpf/headers/profile.h

go.mod

codecov-commenter · 2024-07-10T10:48:13Z

Codecov Report

Attention: Patch coverage is 88.29787% with 11 lines in your changes missing coverage. Please review.

Project coverage is 81.73%. Comparing base (27dc018) to head (b4f9c50).

Files	Patch %	Lines
pkg/internal/ebpf/common/spanner.go	70.58%	4 Missing and 1 partial ⚠️
pkg/internal/goexec/instructions.go	33.33%	4 Missing ⚠️
pkg/internal/ebpf/common/pids.go	86.66%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1004      +/-   ##
==========================================
+ Coverage   81.68%   81.73%   +0.05%     
==========================================
  Files         137      137              
  Lines       10923    10970      +47     
==========================================
+ Hits         8922     8966      +44     
- Misses       1494     1497       +3     
  Partials      507      507

Flag	Coverage Δ
integration-test	`56.47% <72.34%> (+0.09%)`	⬆️
k8s-integration-test	`58.89% <57.44%> (-0.18%)`	⬇️
oats-test	`37.36% <60.63%> (+0.03%)`	⬆️
unittests	`50.99% <31.91%> (-0.03%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

grcevski · 2024-07-10T13:04:19Z

There's one concern I came with thinking more about the symbol table. We essentially need to keep the symbol table live for the duration of the executable being instrumented. We should figure out how much memory is this for something reasonably large, like the OTel Collector, Alloy or Prometheus. If we were to instrument 10 applications, if this symbol table memory footprint is very large, we'll easily balloon the overall Beyla memory usage.

I'm not sure if there's a solution, but perhaps this can be optionally enabled for certain applications, if it's large memory footprint.

mariomac · 2024-07-17T09:11:23Z

It mostly LGTM, thanks for the great work!

I have a proposal for some improvements with respect to memory (as previously stated by @grcevski ). They don't have to be done in this PR but I'd suggest to do them in another PR.

Instead of directly storing the symtabs map in the PIDs filter, create an interface like:

type symtabs struct {
   Get(pid) *Symtab
   Put(pid, execPath, *Symtab)
   Delete(pid)
}

Then, provide two implementations, one using a hash map (to be used when BEYLA_TRACES_REPORT_EXCEPTION_EVENTS=true) and another that does nothing (to be used when BEYLA_TRACES_REPORT_EXCEPTION_EVENTS=false). That would prevent unnecessarily storing the symtabs when they we don't need them.

POST-EDIT: ah, I just saw that you already nil the symtabs if error reporting is disabled!

In the previous definition of Put, I added an executable Path, or something that uniquely identifies each symtab. That would allow to store only one symtab per executable. If I understood the code correctly, if Beyla instruments 4 gunicorn processes, it will store 4 copies of the gunicorn symtab, right? Then the Delete method would need to have a reference count for each symtab, to remove it only when all the pids pointing to it have been removed.

If I misunderstood the way the code works, please forget point 2.

grcevski · 2024-09-09T13:57:45Z

pkg/internal/ebpf/common/spanner.go

+	var stacktrace strings.Builder
+	if symTab != nil && trace.Error.UstackSz > 0 {
+		for _, pc := range trace.Error.Ustack {
+			f := symTab.PCToFunc(pc)


I think we need to check how this works with multiple processes launched from the same executable. I'm wondering if the symbol table will be meant for the address range of the initial executable which launched. What I mean by this is that I think you must know the executable base memory address to be able to do the PC to symbol translation, because technically the symbol table has only offsets from the executable start, but the PC that comes from eBPF is the absolute memory address of the instruction pointer.

Let's take the following example:

Imagine we have a symbol table with one function foo() at offset 0x100.

We launch the executable once and its base code virtual address is 0x20000.

We launch the executable one more time and this one starts at virtual address 0x30000.

We get exception at 0x200105 and then at 0x300105. Technically that's exception at two different addresses, but it's the same exception in two different processes.

We'll only parse the symbol table once, since we detect that the executable is already instrumented.

I think this code translation must take into account the base memory address of the executable, I think the symbol table is not sufficient.

grcevski reviewed Jul 9, 2024

View reviewed changes

bpf/headers/profile.h Outdated Show resolved Hide resolved

go.mod Outdated Show resolved Hide resolved

marctc force-pushed the add_error_tracking branch 2 times, most recently from 04f16b5 to 2513121 Compare July 11, 2024 14:28

marctc marked this pull request as ready for review July 12, 2024 14:12

marctc requested a review from mariomac as a code owner July 12, 2024 14:12

marctc requested a review from grcevski July 12, 2024 14:12

marctc changed the title ~~Add error tracking~~ Add Go error tracking Jul 12, 2024

marctc marked this pull request as draft July 12, 2024 15:51

marctc marked this pull request as ready for review July 15, 2024 13:57

marctc force-pushed the add_error_tracking branch from dd270be to 12a90af Compare July 16, 2024 16:18

Capture Go errors automatically

b4f9c50

marctc force-pushed the add_error_tracking branch from 12a90af to b4f9c50 Compare July 16, 2024 17:10

grcevski added the 1.9.0 label Jul 17, 2024

marctc added 2.0 Needed for Beyla 2.0 1.9.0 and removed 1.9.0 2.0 Needed for Beyla 2.0 labels Jul 18, 2024

grcevski reviewed Sep 9, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Go error tracking #1004

Add Go error tracking #1004

marctc commented Jul 9, 2024 •

edited

Loading

grcevski left a comment

codecov-commenter commented Jul 10, 2024 •

edited

Loading

grcevski commented Jul 10, 2024

mariomac commented Jul 17, 2024 •

edited

Loading

grcevski Sep 9, 2024

Add Go error tracking #1004

Are you sure you want to change the base?

Add Go error tracking #1004

Conversation

marctc commented Jul 9, 2024 • edited Loading

grcevski left a comment

Choose a reason for hiding this comment

codecov-commenter commented Jul 10, 2024 • edited Loading

Codecov Report

grcevski commented Jul 10, 2024

mariomac commented Jul 17, 2024 • edited Loading

grcevski Sep 9, 2024

Choose a reason for hiding this comment

marctc commented Jul 9, 2024 •

edited

Loading

codecov-commenter commented Jul 10, 2024 •

edited

Loading

mariomac commented Jul 17, 2024 •

edited

Loading