Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Go error tracking #1004

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Add Go error tracking #1004

wants to merge 1 commit into from

Conversation

marctc
Copy link
Contributor

@marctc marctc commented Jul 9, 2024

This PR adds the ability to automatically correct Go errors happening during an HTTP request.
The BPF probe only collects the last error, which is reported using fmt.Errorf, happened during
the request.

The stack trace and error message are used to add an exception event to the Span of the request:
https://opentelemetry.io/docs/specs/semconv/exceptions/exceptions-spans/

The env var BEYLA_TRACES_REPORT_EXCEPTION_EVENTS allows to enable this feature, which is
disabled by default. In a follow up PR i will add docs for config option and details of the feature.

Special thanks to @tpaschalis @inkel @rlankfo which contributed to this project.

Fixes #874

Copy link
Contributor

@grcevski grcevski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool! I think this will generally work! I left a few comments.

bpf/headers/profile.h Outdated Show resolved Hide resolved
go.mod Outdated Show resolved Hide resolved
@codecov-commenter
Copy link

codecov-commenter commented Jul 10, 2024

Codecov Report

Attention: Patch coverage is 88.29787% with 11 lines in your changes missing coverage. Please review.

Project coverage is 81.73%. Comparing base (27dc018) to head (b4f9c50).

Files Patch % Lines
pkg/internal/ebpf/common/spanner.go 70.58% 4 Missing and 1 partial ⚠️
pkg/internal/goexec/instructions.go 33.33% 4 Missing ⚠️
pkg/internal/ebpf/common/pids.go 86.66% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1004      +/-   ##
==========================================
+ Coverage   81.68%   81.73%   +0.05%     
==========================================
  Files         137      137              
  Lines       10923    10970      +47     
==========================================
+ Hits         8922     8966      +44     
- Misses       1494     1497       +3     
  Partials      507      507              
Flag Coverage Δ
integration-test 56.47% <72.34%> (+0.09%) ⬆️
k8s-integration-test 58.89% <57.44%> (-0.18%) ⬇️
oats-test 37.36% <60.63%> (+0.03%) ⬆️
unittests 50.99% <31.91%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@grcevski
Copy link
Contributor

There's one concern I came with thinking more about the symbol table. We essentially need to keep the symbol table live for the duration of the executable being instrumented. We should figure out how much memory is this for something reasonably large, like the OTel Collector, Alloy or Prometheus. If we were to instrument 10 applications, if this symbol table memory footprint is very large, we'll easily balloon the overall Beyla memory usage.

I'm not sure if there's a solution, but perhaps this can be optionally enabled for certain applications, if it's large memory footprint.

@marctc marctc force-pushed the add_error_tracking branch 2 times, most recently from 04f16b5 to 2513121 Compare July 11, 2024 14:28
@marctc marctc marked this pull request as ready for review July 12, 2024 14:12
@marctc marctc requested a review from mariomac as a code owner July 12, 2024 14:12
@marctc marctc requested a review from grcevski July 12, 2024 14:12
@marctc marctc changed the title Add error tracking Add Go error tracking Jul 12, 2024
@marctc marctc marked this pull request as draft July 12, 2024 15:51
@marctc marctc marked this pull request as ready for review July 15, 2024 13:57
@mariomac
Copy link
Contributor

mariomac commented Jul 17, 2024

It mostly LGTM, thanks for the great work!

I have a proposal for some improvements with respect to memory (as previously stated by @grcevski ). They don't have to be done in this PR but I'd suggest to do them in another PR.

  1. Instead of directly storing the symtabs map in the PIDs filter, create an interface like:
type symtabs struct {
   Get(pid) *Symtab
   Put(pid, execPath, *Symtab)
   Delete(pid)
}

Then, provide two implementations, one using a hash map (to be used when BEYLA_TRACES_REPORT_EXCEPTION_EVENTS=true) and another that does nothing (to be used when BEYLA_TRACES_REPORT_EXCEPTION_EVENTS=false). That would prevent unnecessarily storing the symtabs when they we don't need them.

POST-EDIT: ah, I just saw that you already nil the symtabs if error reporting is disabled!

  1. In the previous definition of Put, I added an executable Path, or something that uniquely identifies each symtab. That would allow to store only one symtab per executable. If I understood the code correctly, if Beyla instruments 4 gunicorn processes, it will store 4 copies of the gunicorn symtab, right? Then the Delete method would need to have a reference count for each symtab, to remove it only when all the pids pointing to it have been removed.

If I misunderstood the way the code works, please forget point 2.

@grcevski grcevski added the 1.9.0 label Jul 17, 2024
@marctc marctc added 2.0 Needed for Beyla 2.0 1.9.0 and removed 1.9.0 2.0 Needed for Beyla 2.0 labels Jul 18, 2024
var stacktrace strings.Builder
if symTab != nil && trace.Error.UstackSz > 0 {
for _, pc := range trace.Error.Ustack {
f := symTab.PCToFunc(pc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to check how this works with multiple processes launched from the same executable. I'm wondering if the symbol table will be meant for the address range of the initial executable which launched. What I mean by this is that I think you must know the executable base memory address to be able to do the PC to symbol translation, because technically the symbol table has only offsets from the executable start, but the PC that comes from eBPF is the absolute memory address of the instruction pointer.

Let's take the following example:

  1. Imagine we have a symbol table with one function foo() at offset 0x100.
  2. We launch the executable once and its base code virtual address is 0x20000.
  3. We launch the executable one more time and this one starts at virtual address 0x30000.
  4. We get exception at 0x200105 and then at 0x300105. Technically that's exception at two different addresses, but it's the same exception in two different processes.

We'll only parse the symbol table once, since we detect that the executable is already instrumented.

I think this code translation must take into account the base memory address of the executable, I think the symbol table is not sufficient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Capture exceptions and errors in Go
4 participants