ESP-Now BATMAN/OGM progress

More dull text to look at but this is the output from one of my mesh nodes now I have the OGM routing protocol working.

I was right in thinking that getting ELP working nicely would lead to being able to implement an interpretation of OGM pretty smoothly. Again this is not an exact implementation as my use case is different but it is quite similar to the documentation. I'm using the TQ measure in BATMAN IV, not the throughput model in BATMAN V.

Now, even with sharing of neighbours under ELP disabled the whole network knows about all the nodes whether they're reachable in one hop or not. If I turn neighbour sharing on again in ELP the mesh should build more quickly with more redundant links.

It appears that the routing table picks a good route to each node, including a two-hop route where there is a poor reliability via a one-hop route.

Next step is implementing my own 'ping' and 'traceroute' style functions so I can verify this, but again because I have already implemented discovery of routes and packet forwarding this should be not overly painful. BATMAN routing is intended to be simple.

Once that's done I'll make as big a network of real nodes as I can manage, which I hope to get up to about twenty-five now some more Wemos D1 Mini Pro have arrived, and see what happens.

ESP-Now BATMAN/ELP progress

Over the last few weeks I've been tinkering with ESP-Now and have implemented an interpretation of BATMAN's 'echo location protocol' ELP, as set out here.

As their documents say at the top, this is an old version, but for my purposes it will suffice and I'm not looking to implement something compatible with BATMAN as deployed on other platforms.

I'd definitely class what I've done as an interpretation rather than an implementation. While the documentation suggests ESP-Now can generate broadcasts, which BATMAN specifies for ELP, I missed this on first reading (it's mentioned once in passing) and worked around this. The model for ESP-Now communication is between pre-defined peers so baking in management of these peers as part of my take on ELP is not lost effort. Yet.

In BATMAN each thing on the network is a node. If it's actively routing traffic with BATMAN it is an originator and if it is reachable in one hop it is a neighbour.

ESP-Now's concept of peers maps fairly directly to BATMAN's concept of neighbours. However you can't send data to an arbitrary node without adding it as a peer first. Peers are referred to by their primary MAC address, much like in BATMAN. You can have a maximum of 20 peers unless you use encryption when it's reduced to 10. Which is why making management of ESP-Now peers part of my interpretation of ELP, rather than just broadcasting, isn't a bad thing. You can trivially send packets to every peer and have the ESP-Now library do the management of that process for you, but it's not a broadcast and won't be sent to non-peers.

Something ESP-Now adds that isn't in BATMANs broadcast model for ELP is you get delivery confirmations from peers. Normally BATMAN measures the quality of transmission to first hops on the network with its routing protocol OGM as these packets are echoed back to the sender. With ESP-Now unicast I have been able to get a measure of transmission quality (TQ) from ELP alone.

What ESP-Now doesn't handle, and BATMAN achieves with broadcasts, is initial discovery of peers, there is an expectation this is hardcoded or done via other means of pairing. So I'm using the standard Wi-Fi SoftAP and AP scanning methods from the ESP8266 Wi-Fi libraries. The scan looks for any SSIDs that match what it's looking for and attempt to add the device as a peer if it isn't already. Once a node has some peers it switches its SoftAP off, but among a group of peers at least one leaves the SoftAP on so that group of peers can be found. I might replace this mechanism with something that involves ESP-Now broadcasts, but at the moment it works quite nicely. It's worth noting that the BSSID of an ESP8266 is not it's primary MAC address. For the BSSID they use a locally administered variant of the primary MAC address, so you have to unset the two least significant bits of the second octet to derive the primary MAC address.

As ELP includes a list of neighbours in the packet, I've a mesh where each node is an originator that discovers new neighbours, then is notified of their neighbours. Often some of these are one-hop reachable too so the mesh forms very nicely. The measure of TQ I have works very well to ensure only reachable neighbours are advertised through ELP, so each node only has believably reachable peers added and any that aren't can be aged out.

I also spent a lot of time building a basic user interface into this on the serial console, which you can see in the screenshot. When you're making your own network protocol you need to also make your own tools for monitoring and troubleshooting it. Being able to look at the peer table and logs of every node has made this much less of a head scatching task than it might have been. This UI console code is all wrapped up in conditional compiler directives so I plan on leaving it in place in the code long term. I do have some aggro though as the Windows driver for the USB serial chipset used on the WeMos D1 mini appears to be flaky. At some point it stops sending data to the ESP so you can't control the UI even though you can still see it. This doesn't happen on Linux so I'm pretty sure it's nothing to do with my code.

It's got to the point where I've had up to 12 nodes in a network and they've been up for 100+ hours managing neighbours coming and going without complaint or failure that I can see.

Emboldened by this I did a field test during a game over the weekend. I had four nodes with PIR motion detectors sending packets when triggered to a bare node connected to a laptop I was using as a prop.

This was a total failure. Not massively surprising as I hacked the code together from existing hardware over lunch. It did work when connected to the laptop but when powered by USB charge banks the nodes couldn't even discover each other, which I know works well. As I was mid-game I had no time to troubleshoot.

Assuming that failure was something trivial, I now need to move on to implementing OGM, which will allow the mesh to do proper multi-hop routing. Right now it can only deal with routing two hops to 'peers of peers' and it needs to be able to route end-to-end across the whole mesh. I think I've laid a solid foundation for this with my interpretation of ELP so I'm hoping this won't be too hard work. OGM produces a full routing table to all nodes with a composite TQ to each one made from the TQ of each hop. Once this is done it's only a small step to being able to forward arbitrary packets across the whole mesh.