Debugging the DDlog version of ovn-northd

This document gives some tips for debugging correctness issues in the DDlog implementation of ovn-northd. To keep things conrete, we assume here that a failure occurred in one of the test cases in ovn-e2e.at, but the same methodology applies in any other environment. If none of these methods helps, ask for assistance or submit a bug report.

Before trying these methods, you may want to check the northd log file, tests/testsuite.dir/<test_number>/northd/ovn-northd.log for error messages that might explain the failure.

Compare OVSDB tables generated by DDlog vs C

The first thing I typically want to check when ovn-northd-ddlog does not behave as expected is how the OVSDB tables computed by DDlog differ from what the C implementation produces. Fortunately, all the infrastructure needed to do this already exists in OVN.

First, let’s modify the test script, e.g., ovn.at to dump the contents of OVSDB right before the failure. The most common issue is a difference between the logical flows generated by the two implementations. To make it easy to compare the generated flows, make sure that the test contains something like this in the right place:

ovn-sbctl dump-flows > sbflows
AT_CAPTURE_FILE([sbflows])

The first line above dumps the OVN logical flow table to a file named sbflows. The second line ensures that, if the test fails, sbflows get logged to testsuite.log. That is not particularly useful for us right now, but it means that if someone later submits a bug report, that’s one more piece of data that we don’t have to ask for them to submit along with it.

Next, we want to run the test twice, with the C and DDlog versions of northd, e.g., make check -j6 TESTSUITEFLAGS="-d 111 112" if 111 and 112 are the C and DDlog versions of the same test. The -d in this command line makes the test driver keep test directories around even for tests that succeed, since by default it deletes them.

Now you can look at sbflows in each test log directory. The ovn-northd-ddlog developers have gone to some trouble to make the DDlog flows as similar as possible to the C ones, right down to white space and other formatting. Thus, the DDlog output is often identical to C aside from logical datapath UUIDs.

Usually, this means that one can get informative results by running diff, e.g.:

diff -u tests/testsuite.dir/111/sbflows tests/testsuite.dir/111/sbflows

Running the input through the uuidfilt utility from OVS will generally get rid of the logical datapath UUID differences as well:

diff -u <(uuidfilt tests/testsuite.dir/111/sbflows) <(uuidfilt tests/testsuite.dir/111/sbflows)

If there are nontrivial differences, this often identifies your bug.

Often, once you have identified the difference between the two OVSDB dumps, this will immediately lead you to the root cause of the bug, but if you are not this lucky then the next method may help.

Record and replay DDlog execution

DDlog offers a way to record all input table updates throughout the execution of northd and replay them against DDlog running as a standalone executable without all other OVN components. This has two advantages. First, this allows one to easily tweak the inputs, e.g. to simplify the test scenario. Second, the recorded execution can be easily replayed anywhere without having to reproduce your OVN setup.

Use the --ddlog-record option to record updates, e.g. --ddlog-record=replay.dat to record to replay.dat. (OVN’s built-in tests automatically do this.) The file contains the log of transactions in the DDlog command format (see https://github.com/vmware/differential-datalog/blob/master/doc/command_reference/command_reference.md).

To replay the log, you will need the standalone DDlog executable. By default, the build system does not compile this program, because it increases the already long Rust compilation time. To build it, add NORTHD_CLI=1 to the make command line, e.g. make NORTHD_CLI=1.

You can modify the log before replaying it, e.g., adding dump <table> commands to dump the contents of relations at various points during execution. The <table> name must be fully qualified based on the file in which it is declared, e.g. OVN_Southbound::<table> for southbound tables or lrouter::<table>. for lrouter.dl. You can also use dump without an argument to dump the contents of all tables.

The following command replays the log generated by OVN test number 112 and dumps the output of DDlog to replay.dump:

northd/ovn_northd_ddlog/target/release/ovn_northd_cli < tests/testsuite.dir/112/northd/replay.dat > replay.dump

Or, to dump just the table contents following the run, without having to edit replay.dat:

(cat tests/testsuite.dir/112/northd/replay.dat; echo 'dump;') | northd/ovn_northd_ddlog/target/release/ovn_northd_cli --no-delta --no-init-snapshot > replay.dump

Depending on whether and how you installed OVS and OVN, you might need to point LD_LIBRARY_PATH to library build directories to get the CLI to run, e.g.:

export LD_LIBRARY_PATH=$HOME/ovn/_build/lib/.libs:$HOME/ovs/_build/lib/.libs

Note

The replay output may be less informative than you expect because DDlog does not, by default, keep around enough information to include input relation and intermediate relations in the output. These relations are often critical to understanding what is going on. To include them, add the options --output-internal-relations --output-input-relations=In_ to DDLOG_EXTRA_FLAGS for building ovn-northd-ddlog. For example, configure as:

./configure DDLOG_EXTRA_FLAGS='--output-internal-relations --output-input-relations=In_'

Debugging by Logging

One limitation of the previous method is that it allows one to inspect inputs and outputs of a rule, but not the (sometimes fairly complicated) computation that goes on inside the rule. You can of course break up the rule into several rules and dump the intermediate outputs.

There are at least two alternatives for generating log messages. First, you can write rules to add strings to the Warning relation declared in ovn_north.dl. Code in ovn-northd-ddlog.c will log any given string in this relation just once, when it is first added to the relation. (If it is removed from the relation and then added back later, it will be logged again.)

Second, you can call using the warn() function declared in ovn.dl from a DDlog rule. It’s not straightforward to know exactly when this function will be called, like it would be in an imperative language like C, since DDlog is a declarative language where the user doesn’t directly control when rules are triggered. You might, for example, see the rule being triggered multiple times with the same input. Nevertheless, this debugging technique is useful in practice.

You will find many examples of the use of Warning and warn in ovn_northd.dl, where it is frequently used to report non-critical errors.

Debugging panics

TODO: update these instructions as DDlog’s internal handling of panic’s is improved.

DDlog is a safe language, so DDlog programs normally do not crash, except for the following three cases:

  • A panic in a Rust function imported to DDlog as extern function.

  • A panic in a C function imported to DDlog as extern function.

  • A bug in the DDlog runtime or libraries.

Below we walk through the steps involved in debugging such failures. In this scenario, there is an array-index-out-of-bounds error in the ovn_scan_static_dynamic_ip6() function, which is written in Rust and imported to DDlog as an extern function. When invoked from a DDlog rule, this function causes a panic in one of DDlog worker threads.

Step 1: Check for error messages in the northd log. A panic can generally lead to unpredictable outcomes, so one cannot count on a clean error message showing up in the log (Other outcomes include crashing the entire process and even deadlocks. We are working to eliminate the latter possibility). In this case we are lucky to observe a bunch of error messages like the following in the northd log:

2019-09-23T16:23:24.549Z|00011|ovn_northd|ERR|ddlog_transaction_commit(): error: failed to receive flush ack message from timely dataflow thread

These messages are telling us that something is broken inside the DDlog runtime.

Step 2: Record and replay the failing scenario. We use DDlog’s record/replay capabilities (see above) to capture the faulty scenario. We replay the recorded trace:

northd/ovn_northd_ddlog/target/release/ovn_northd_cli < tests/testsuite.dir/117/northd/replay.dat

This generates a bunch of output ending with:

thread 'worker thread 2' panicked at 'index out of bounds: the len is 1 but the index is 1', /rustc/eae3437dfe991621e8afdc82734f4a172d7ddf9b/src/libcore/slice/mod.rs:2681:10
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace.

We re-run the CLI again with backtrace enabled (as suggested by the error message):

RUST_BACKTRACE=1 northd/ovn_northd_ddlog/target/release/ovn_northd_cli < tests/testsuite.dir/117/northd/replay.dat

This finally yields the following stack trace, which suggests array bound violation in ovn_scan_static_dynamic_ip6:

0: backtrace::backtrace::libunwind::trace
          at /cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.29  10: core::panicking::panic_bounds_check
          at src/libcore/panicking.rs:61
[SKIPPED]
11: ovn_northd_ddlog::__ovn::ovn_scan_static_dynamic_ip6
12: ovn_northd_ddlog::prog::__f
[SKIPPED]

Finally, looking at the source code of ovn_scan_static_dynamic_ip6, we identify the following line, containing an unsafe array indexing operator, as the culprit:

ovn_ipv6_parse(&f[1].to_string())

Clean build

Occasionally it’s desirable to a full and complete build of the DDlog-generated code. To trigger that, delete the generated ovn_northd_ddlog directory and the ddlog.stamp witness file, like this:

rm -rf northd/ovn_northd_ddlog northd/ddlog.stamp

or:

make clean-ddlog

Submitting a bug report

If you are having trouble with DDlog and the above methods do not help, please submit a bug report to bugs@openvswitch.org, CC ryzhyk@gmail.com.

In addition to problem description, please provide as many of the following as possible:

  • Are you running with the right DDlog for the version of OVN? OVN and DDlog are both evolving and OVN needs to build against a specific version of DDlog.

  • replay.dat file generated as described above

  • Logs: ovn-northd.log and testsuite.log, if you are running the OVN test suite