Back to News for Developers

BPF Iterator: Retrieving Kernel Data With Flexibility and Efficiency

The BPF iterator enables high-performance, in-kernel data retrieval and aggregation. In this blog post, we talk about the motivation behind developing the bpf iterator tool and using it to retrieve kernel data into user space flexibly and efficiently.

Why BPF Iterator

There are few existing ways to dump kernel data into user space. The most popular one is the /proc system. For example, 'cat /proc/net/tcp6' dumps all tcp6 sockets in the system, and 'cat /proc/net/netlink' dumps all netlink sockets in the system. However, their output format tends to be fixed, and if users want more information about these sockets, they have to patch the kernel, which often takes time to publish upstream and release. The same is for popular tools like ss where any additional information needs a kernel patch.

To solve this problem the drgn tool is often used to dig out the kernel data with no kernel change. But, the main drawback for drgn is performance, as it cannot do pointer tracing inside the kernel. In addition, drgn may produce wrong results if the pointer becomes invalid inside the kernel.

The BPF iterator solves the above problem by providing flexibility on what to collect with a one-time change for a particular data structure in the kernel and doing all the pointer chasing inside the kernel. This flexibility is achieved by using bpf programs. The correctness is ensured by implementing pointer tracing inside the kernel with proper reference counting or locking protection. In its current state, the iterator changes only a small portion of data structures in the kernel.

How to Use BPF Iterator

Kernel bpf selftests is a great place to illustrate how to use the BPF iterator in user space. Typically you need to implement a bpf program first.

The following are a few examples of selftest bpf programs:

Let us take a look at bpf_iter_task_file.c.

SEC("iter/task_file")
int dump_task_file(struct bpf_iter__task_file *ctx)
{
        struct seq_file *seq = ctx->meta->seq;
        struct task_struct *task = ctx->task;
        __u32 fd = ctx->fd;
        struct file *file = ctx->file;
        ...
}
        

In the above example, the section name (SEC), iter/task_file, indicates that the program is a BPF iterator program to iterate all files from all tasks. The context to the program is bpf_iter__task_file. You can find the definition of bpf_iter__task_file in vmlinux.h.

struct bpf_iter__task_file {
        union {
        	struct bpf_iter_meta *meta;
        };
        union {
        	struct task_struct *task;
        };
        u32 fd;
        union {
          	struct file *file;
        };
};
        

In the above code, the field 'meta' contains the metadata, which is the same for all bpf iterator programs. The rest of the fields are specific for different iterators. For example, for task_file iterators, the kernel layer provides the 'task', 'fd' and 'file' field values. The 'task' and 'file' are reference counted, so they won't go away when the bpf program runs.

With a bpf program, the user space part is also needed to trigger the bpf program to run and collect the data. The selftest, bpf_iter.c provides a sample of writing the bpf iterator user space part. The below illustrates a typical sequence:

  • load the bpf program into kernel
  • create a bpf_link with the bpf program
  • get a bpf_iter_fd from bpf_link
  • read(bpf_iter_fd) until no data is available
  • close(bpf_iter_fd)
  • if needed to reread the data, get a new bpf_iter_fd and do the read again.

The BPF iterator uses the kernel seq_file to pass data to user space. The data can be a formatted string or raw data. In the case of a formatted string, you can use the bpftool iter sub command to create and pin a bpf iterator through bpf_link to a path in the BPF File System (bpffs). You can then do a 'cat <path>' to print the results similar to 'cat /proc/net/netlink'.

For example, you can use the following command to pin the bpf program in the bpf_iter_ipv6_route.o object file to the /sys/fs/bpf/my_route path:

  $ bpftool iter pin ./bpf_iter_ipv6_route.o  /sys/fs/bpf/my_route

And then print out the results using the following command:

$ cat /sys/fs/bpf/my_route

How to Implement a Bpf Iterator in the kernel?

To implement a bpf iterator in the kernel, the developer must fill the following key data structure defined in the bpf.h file.

struct bpf_iter_reg {
          const char *target;
          bpf_iter_attach_target_t attach_target;
          bpf_iter_detach_target_t detach_target;
          bpf_iter_show_fdinfo_t show_fdinfo;
          bpf_iter_fill_link_info_t fill_link_info;
          bpf_iter_get_func_proto_t get_func_proto;
          u32 ctx_arg_info_size;
          u32 feature;
          struct bpf_ctx_arg_aux ctx_arg_info[BPF_ITER_CTX_ARG_MAX];
          const struct bpf_iter_seq_info *seq_info;
};
        

After filling the data structure fields, call 'bpf_iter_reg_target()' to register the iterator to the main bpf iterator subsystem.

The following is the breakdown for each field in struct bpf_iter_reg.

Fields

Description

target

Specifies the name of the bpf iterator. For example: 'bpf_map', 'bpf_map_elem'.

The name should be different from other bpf_iter target names in the kernel.

attach_target and detach_target

Allows for target specific link_create action since some targets may need special processing.

Called during the user space link_create stage.

show_fdinfo and fill_link_info

Called to fill target specific information when user tries to get link info associated with the iterator.

get_func_proto

Permits a bpf iterator to access bpf helpers specific to the iterator.

ctx_arg_info_size and ctx_arg_info

Specifies the verifier states for bpf program arguments associated with the bpf iterator.

feature

Specifies certain action requests in the kernel bpf iterator infrastructure. Currently, only BPF_ITER_RESCHED is supported. This means that the kernel function cond_resched() is called to avoid other kernel subsystem (e.g., rcu) misbehaving.

seq_info

Specifies certain action requests in the kernel bpf iterator infrastructure. Currently, only BPF_ITER_RESCHED is supported. This means that the kernel function cond_resched() is called to avoid other kernel subsystem (e.g., rcu) misbehaving.

Click here to see an implementation of the task_vma bpf iterator in the kernel.

Existing Use Cases for BPF Iterator

The following lists the bpf iterators available in the latest upstream kernel grouped by bpf program section name:

Section Name

Available Upstream Bpf Iterators

iter/bpf_prog

iter/bpf_map

iter/bpf_map_elem (covering hash, percpu hash, lru hash, percpu lru hash, array, percpu array)

iter/task

iter/task_file

iter/task_vma

iter/bpf_sk_storage

iter/sock_map

iter/tcp (both tcp4 and tcp6)

iter/udp (both udp4 and udp6)

iter/ipv6_route

iter/netlink

iter/unix

In Meta, we use the bpftool that uses the bpf task_file iterator to display the pids that reference a particular bpf program/map/link.

The sudo bpftool prog command displays the following output:

  1254794: kprobe  name trace_connect_v  tag b81e89cf4f522e62  gpl run_time_ns 
27119 run_cnt 30
          loaded_at 2022-02-13T10:54:46-0800  uid 0
          xlated 640B  jited 374B  memlock 4096B  map_ids 732740,732739
          btf_id 1163033
          pids python3.8(443701)
  1254795: kprobe  name trace_connect_v  tag a12d26e14608b148  gpl run_time_ns
1662739 run_cnt 2552
          loaded_at 2022-02-13T10:54:46-0800  uid 0
          xlated 648B  jited 382B  memlock 4096B  map_ids 732740,732738
          btf_id 1163033
          pids python3.8(443701)
        

We also have fbflow using the bpf_sk_storage iterator and dyno using the task_iter. For task_iter in dyno, performance improved significantly over the old way of netlink-based taskstats for all tasks.

Next Steps

There are upstream discussions to implement a bpf iterator for bpf_links. We also see people implementing a bpf iterator for mounts (not upstreamed yet). As people discover more use cases, we expect more users implementing bpf iterators in the kernel.

To learn more about Meta Open Source, visit our open source site, subscribe to our YouTube channel, or follow us on Twitter, Facebook and LinkedIn.