Is this a stand up fight...

I may have found my first bug in a major open source project for years. This has made me unexpectedly excited.

My current project revolves around using Ciseco's RFu328 microcontroller/radio combo and a GPS. This is to do some communication between devices using mesh radio so that they all know where they are relative to each other.

I've built one piece of this system into a small project box and had nothing but trouble with it. Over Easter I was in Brussels and had a few spare days so I took what I needed to work on it with me.

The failure condition was that the microcontroller, which is Arduino compatible, would hang after a while of running. Sometimes there were problems even programming it. As my sketch had got bigger and more complicated the issue became more pronounced.

Thinking I might have a hardware problem I put the code on a genuine Uno R3 and got much the same behaviour, although slightly less flaky.

I put in a watchdog timer to reboot the device if it hung and even this didn't work. It would restart then almost immediately hang.

Messing around wading through my code, chopping piece by piece from it, I narrowed it down to the LLAP serial library I'd been using. So I swapped to a different one. Which behaved the same way despite being a massive refactor and almost complete rewrite of the original.

Hours of head scratching later, I had the culprit. It is almost certainly the Arduino flush() function. Comment it out from the library and the problem evaporates. The library is using this in a justifiable, sensible fashion to ensure that radio output gets done ASAP in a single packet.

Looking online suggests there was a race condition seen in older versions of the Arduino IDE that caused this. Now I've got to get this boiled down to its bare bones and submit a decent bug report with reproducible examples.

No comments: