ESP-Now BATMAN/ELP progress

Over the last few weeks I've been tinkering with ESP-Now and have implemented an interpretation of BATMAN's 'echo location protocol' ELP, as set out here.

As their documents say at the top, this is an old version, but for my purposes it will suffice and I'm not looking to implement something compatible with BATMAN as deployed on other platforms.

I'd definitely class what I've done as an interpretation rather than an implementation. While the documentation suggests ESP-Now can generate broadcasts, which BATMAN specifies for ELP, I missed this on first reading (it's mentioned once in passing) and worked around this. The model for ESP-Now communication is between pre-defined peers so baking in management of these peers as part of my take on ELP is not lost effort. Yet.

In BATMAN each thing on the network is a node. If it's actively routing traffic with BATMAN it is an originator and if it is reachable in one hop it is a neighbour.

ESP-Now's concept of peers maps fairly directly to BATMAN's concept of neighbours. However you can't send data to an arbitrary node without adding it as a peer first. Peers are referred to by their primary MAC address, much like in BATMAN. You can have a maximum of 20 peers unless you use encryption when it's reduced to 10. Which is why making management of ESP-Now peers part of my interpretation of ELP, rather than just broadcasting, isn't a bad thing. You can trivially send packets to every peer and have the ESP-Now library do the management of that process for you, but it's not a broadcast and won't be sent to non-peers.

Something ESP-Now adds that isn't in BATMANs broadcast model for ELP is you get delivery confirmations from peers. Normally BATMAN measures the quality of transmission to first hops on the network with its routing protocol OGM as these packets are echoed back to the sender. With ESP-Now unicast I have been able to get a measure of transmission quality (TQ) from ELP alone.

What ESP-Now doesn't handle, and BATMAN achieves with broadcasts, is initial discovery of peers, there is an expectation this is hardcoded or done via other means of pairing. So I'm using the standard Wi-Fi SoftAP and AP scanning methods from the ESP8266 Wi-Fi libraries. The scan looks for any SSIDs that match what it's looking for and attempt to add the device as a peer if it isn't already. Once a node has some peers it switches its SoftAP off, but among a group of peers at least one leaves the SoftAP on so that group of peers can be found. I might replace this mechanism with something that involves ESP-Now broadcasts, but at the moment it works quite nicely. It's worth noting that the BSSID of an ESP8266 is not it's primary MAC address. For the BSSID they use a locally administered variant of the primary MAC address, so you have to unset the two least significant bits of the second octet to derive the primary MAC address.

As ELP includes a list of neighbours in the packet, I've a mesh where each node is an originator that discovers new neighbours, then is notified of their neighbours. Often some of these are one-hop reachable too so the mesh forms very nicely. The measure of TQ I have works very well to ensure only reachable neighbours are advertised through ELP, so each node only has believably reachable peers added and any that aren't can be aged out.

I also spent a lot of time building a basic user interface into this on the serial console, which you can see in the screenshot. When you're making your own network protocol you need to also make your own tools for monitoring and troubleshooting it. Being able to look at the peer table and logs of every node has made this much less of a head scatching task than it might have been. This UI console code is all wrapped up in conditional compiler directives so I plan on leaving it in place in the code long term. I do have some aggro though as the Windows driver for the USB serial chipset used on the WeMos D1 mini appears to be flaky. At some point it stops sending data to the ESP so you can't control the UI even though you can still see it. This doesn't happen on Linux so I'm pretty sure it's nothing to do with my code.

It's got to the point where I've had up to 12 nodes in a network and they've been up for 100+ hours managing neighbours coming and going without complaint or failure that I can see.

Emboldened by this I did a field test during a game over the weekend. I had four nodes with PIR motion detectors sending packets when triggered to a bare node connected to a laptop I was using as a prop.

This was a total failure. Not massively surprising as I hacked the code together from existing hardware over lunch. It did work when connected to the laptop but when powered by USB charge banks the nodes couldn't even discover each other, which I know works well. As I was mid-game I had no time to troubleshoot.

Assuming that failure was something trivial, I now need to move on to implementing OGM, which will allow the mesh to do proper multi-hop routing. Right now it can only deal with routing two hops to 'peers of peers' and it needs to be able to route end-to-end across the whole mesh. I think I've laid a solid foundation for this with my interpretation of ELP so I'm hoping this won't be too hard work. OGM produces a full routing table to all nodes with a composite TQ to each one made from the TQ of each hop. Once this is done it's only a small step to being able to forward arbitrary packets across the whole mesh.

No comments: