ESP-Now BATMAN scaling work

A while back I built a little test rig to help with checking how my mesh network scales.

A test set up of two or three nodes simply doesn't show up the issues you get when there's twenty or more all talking.

Since I've managed to rework my code into an easily used library I'm inching towards releasing it and that means lots of testing so I'm sure it's not a complete dud.

The test rig allows me to have twenty-two nodes on my desk, eight WeMos D1 mini on USB and fourteen ESP-01 in the rig. Programming all these is quite time consuming so I tweaked the code on the USB connected nodes and when I felt it had reached an interesting point rolled it out to the ESP-01s.

When I first built the rig I was getting a lot of 'collisions', where my code detected another packet arriving while I was still working on the last one. This is done by setting and un-setting a flag to indicate 'node is currently doing something else'. I quickly found I'd made a simple coding error, not always un-setting the flag, and fixing this made all these 'collisions' go away.

What I did see however was a LOT of failures to send or forward packets once the mesh got to about 20 nodes. ESP-Now includes an acknowledgement so you can tell if a packet has been received, but complete failure to send just doesn't happen at a low node count.

ESP-Now does not use broadcasts when you send packets to all a node's neighbours/peers, it iterates through them. There's no documentation on how this is done, so I'll need to get a WiFi sniffer out to work out the exact behaviour.

Regardless this means the number of packets in the air goes up with the square of the number of neighbours when neighbours forward packets to all their neighbours, even though each one only forwards it once. This makes it believable there would be the odd in-air collision, and this is reflected in some missing ACKs, but I was seeing far too many failures to send.

I started coding in my own CSMA/CD re-transmission algorithm but had no luck reducing the failures and there's no documentation about what a complete failure to send means.

Many hours of fiddling around got me nowhere until it dawned on me this was only happening with the smallest packets, larger ones would be sent reliably even if they are not always received. As the code for this is identical to the other packet types, in a frustrated random guess I increased the packet size and it cured the problem.

I can only assume the ESP-Now library does its own CSMA/CD re-transmission and this breaks down with small packet sizes. I'll have to change my code so it pads out to some minimum size. This appears to be an ESP-Now payload of about 30 bytes. A payload of 18 bytes fails consistently once you've got 20 nodes.

With the boxed outdoor nodes I can add another ten to the test. Once I've done the code changes I'll add them in and see how things behave. This gets me very close to my target mesh size.

After this I need to check the routing algorithm again as I can see that seems to break down with lots of valid choices, flapping badly. Likewise the time sync protocol has got messy and is doing a poor job of syncing above about eight nodes. I have tried to make it discriminate and pick the sync packet that's traversed the fewest hops but clearly this is not working.

This is all quite time consuming to test but I can feel progress being made and I want to ensure when I release the library it stands up to scrutiny.

No comments: