Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: eBPF architecture #394

Closed
JeroenSoeters opened this issue Feb 12, 2023 · 5 comments
Closed

RFC: eBPF architecture #394

JeroenSoeters opened this issue Feb 12, 2023 · 5 comments
Labels
eBPF Subsystem: Observe Logging. Monitoring. Streams. Metrics. Tracing.

Comments

@JeroenSoeters
Copy link
Contributor

JeroenSoeters commented Feb 12, 2023

Background

In Aurae we leverage eBPF to surface kernel-level information. As of now, we expose POSIX signals being generated from the kernel, but in the future, many more use cases will likely be built upon the eBPF subsystem. Think of syscall tracing, tracing the OOM killer, etc.

Problem

The eBPF probe we have traces every process on the host. We want to be able to narrow the scope of the eBPF instrumentation to one or more user-specified Aurae workloads. The different workload types Aurae is planning to support are executables, cells, pods, virtual machines, and spawned Aurae instances [1]. We also have to consider that users could spawn (fork) other processes from executables that they schedule via the Aurae API. The Aurae daemon, as it stands now, doesn't have knowledge of these processes. However, when we instrument a workload that is running such a forked process, we should surface instrumentation for this process and thus be able to associate it with the Aurae workload it is running in.

There are two main problems:

  1. We need a way to associate the instrumentation with an Aurae workload.
  2. In cases where the workload has unshared one or more namespaces, and the instrumentation contains information from such a namespace, we need to map the host-level information from the eBPF probes to the namespaced version of that information.

There is additional complexity with (1) as there might be no meaningful way to create this association from kernel facilities when processing that instrumentation. Typically we should be able to associate instrumentation with a workload via the cgroup. The instrumentation will be associated with a process (PID), and we could look up the cgroup of that process in procfs to create the association with an Aurae workload. However, when we receive an event pertaining to the exiting of a process (signal/signal_generate with signr 9, sched/sched_process_exit for a process, or maybe a kprobe that traces the oom_kill_process kernel function), that process won't be registered anymore in procfs and we won't be able to do the look up to determine which workload the instrumentation is associated with.

This complexity exists for (2) as well in cases where the PID namespace is unshared as we won't be able to look up the NSPid anymore from procfs after a process has exited.

Proposal 1: Enrich every instrumentation with cgroup and nspid information

We could augment every kernel-level event with cgroup information. There is a BPF helper u64 bpf_get_current_cgroup_id(void) [2] that returns a cgroup id, and we should be able to map this to a cgroup path and thus associate it with a workload. There is a similar helper for getting the nspid from the kernel: long bpf_get_ns_current_pid_tgid(u64 dev, u64 ino, struct bpf_pidns_info *nsdata, u32 size).

Drawbacks/constraints for proposal 1

  • This doesn't actually solve for the namespace problem (2). Found a helper for getting the nspid, updated the proposal ☝️.
  • I'm not entirely sure if this works with cgroups v1 as the cgroup id could theoretically map to multiple cgroups in the v1 hierarchy
  • We need a 5.7+ kernel version for those BPF helpers.

Proposal 2: Leverage our cache and use eBPF to keep the cache up-to-date

We could start registering executables in a cache in the daemon. We could then leverage eBPF and attach a kprobe to syscall__execve to reverse-register processes that Aurae-scheduled executables are forking in the cache. Once we have all the executables in the cache, we can do the workload association via the executables in the cache.

Drawbacks/constraints for proposal 2

  • This will almost certainly introduce a dependency on CO-RE and BTF debug symbols [3] as we are going to have to read from the task struct to create those executables in Auraed

References

[1] https://aurae.io/
[2] https://man7.org/linux/man-pages/man7/bpf-helpers.7.html
[3] https://facebookmicrosites.github.io/bpf/blog/2020/02/19/bpf-portability-and-co-re.html

@JeroenSoeters JeroenSoeters added Subsystem: Observe Logging. Monitoring. Streams. Metrics. Tracing. eBPF labels Feb 12, 2023
@krisnova
Copy link
Contributor

In my opinion we should begin with proposal 1: enrich every instrumentation with cgroup information.

Additionally I believe that the auraed daemon should try to intercept every syscall__execve function with a kprobe however we should not depend on this mechanism if at all possible.

The guiding principle I would like to push the project towards is:

All workloads are cgroups.

Essenetially what I am saying is that any workload on a host running auraed as a pid 1 should have a strong guarantee that every process (and subsequent nested processes) will be instrumented. If a process cannot be instrumented we shouldn't schedule the process.

As we manage executables, cells, pods, vms, etc we should always have a cgroup associated with the workload even if just to manage an empty auraed with meta information and a socket connection.

Reminding ourselves that Aurae intends to manage every process on a host, we have unveiled another guiding principle:

All processes belong to Aurae.

I believe with these two guiding principles we can clearly see that the safest way to manage a host is to surface the cgroup information, and nspid detail with every instrumentation.

If we can guarantee that every workoad has a cgroup, we can guarantee that we can map back to the original workload.

Decision

Go with proposal 1, and let's start building our instrumentation metadata that is common to all instrumentations as well as a system in Rust to do the association at runtime.

Cgroups are the parent feature we want to be able to trust to tie everything together.

@JeroenSoeters
Copy link
Contributor Author

Update 2/13/2023

The bad news:

I have been looking into the pid mapping and unfortunately we won't be able to leverage the long bpf_get_ns_current_pid_tgid(u64 dev, u64 ino, struct bpf_pidns_info *nsdata, u32 size) helper function. The only way to get dev and ino is by querying /proc. It doesn't make sense to query /proc in userspace to look up dev and ino just to pass them to the kernel (via a BPF map or smt) so the eBPF program can then look up the nspid. At that point we can just get the nspid from /proc ourselves.

The good news:

We will likely be able to do this fairly trivially by adding a pair of tracepoint probes to monitor process creation and exiting: task/new_task and sched/sched_process_exit. I will create a separate issue to track this work.

@krisnova
Copy link
Contributor

Can you please reference the authentication if we are hooking into task/new_task please? If we are instrumenting all new tasks, we should consider authenticating them as well.

@JeroenSoeters
Copy link
Contributor Author

Yep, will make sure to mention this in the new issue. Will share the link in the #ebpf-kernel channel when I've typed it up.

@JeroenSoeters
Copy link
Contributor Author

Closing this one as this is all implemented now. As we didn't end up hooking task/new_task I suggest we track the authentication work in the other issue: #401 @krisnova

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
eBPF Subsystem: Observe Logging. Monitoring. Streams. Metrics. Tracing.
Projects
None yet
Development

No branches or pull requests

2 participants