A couple of years ago some concerns were raised that if a station crashed, froze, or was arbitrarily halted, it could leave NDIO outputs in states that weren't meant to be held for long periods of time. Some examples of this were a valve left open or a compressor running.
To address this, we felt it was necessary to establish some sort of keep alive/reset mechanism that would return the NDIO outputs to known default states if a problem was detected. This would allow application developers to design their apps with this in mind.
To explain how this was implemented requires a better understanding that basically there are three core components to NDIO: the NDIO processor, the NDIO daemon, and the NDIO driver.
The NDIO processor is a TI MSP430 IO co processor. It contains a small IO engine for reading/writing the physical inputs and outputs.
The NDIO daemon is a small process that acts as the bridge and buffer between a station and the 430. It communicates with the 430 over the I2C trunk.
The NDIO driver is much like any other driver with one major exception. Since the IO is local, there isn't any queuing mechanisms for reads or writes. Instead, data is read from and written to a shared memory block created by the NDIO daemon which watches for changes.
When a jace is booted, the NDIO processor starts and enters a default state. At the same time NDIO daemon starts and establishes communications with the NDIO processor. At this point, if the processor doesn't receive any messages from the daemon for 30 seconds, the 430 resets and the jace must be rebooted. Fortunately the daemon talks to the IO processors about every 200ms.
Once a station that contains a BNDIONetwork has been started, a keep alive message is sent from BNDIONetwork to the daemon about every 5 seconds. This is done by the standard Clock.schedulePeriodically method which is processed on the station's engine thread.
If for some reason (like the station crashes or the engine thread gets bogged down) the keep alive isn't sent to the daemon within 30 seconds, the daemon halts communications with the 430. This, of course, results in the 430 resetting back to the default state and requires a reboot.
There are a handful of properties that pertain to the functioning of the reboot logic and are described below.
In BNDIONetwork:
- averageKeepAliveLifetime - the average time between keep alives sent since the start of the station
- averageKeepAliveRecent - the average time between the last five keep alives sent
- keepAlivePeak - the largest time between keep alives sent since the start of the station
- totalKeepAlives - the total number of keep alives sent
- totalKeepAliveLateStarts - the total number of keep alives that weren't sent within 10 seconds of the previous keep alive (5 sec + 5 sec tolerance)
In BNDIOPingMonitor:
- rebootEnabled - when enabled, will cause the station to be saved and the jace rebooted when the 430 has been reset
- rebootPeriod - the period of time that the max number of reboots can occur before halting the reboot cycle
- maxRebootPeriodCount - the number of reboots that can occur in the rebootPeriod before halting the reboot cycle
- totalRebootRequests - the total number of reboots requested in the reboot period
- lastRebootRequest - the time of the last reboot request
Knowledge Base article NiagaraAX NDIO Guide provides more detail of NDIO in general.
One of the side effects of implementing the NDIO reboot logic is that a jace that uses NDIO becomes highly susceptible to routines that consume a lot of time on the engine thread. In general (even without NDIO) this is considered bad news. If a routine consumes too much time on the engine thread, then BNDIONetwork is unable to send the keep alive in time. For this reason, long or potentially long running routines should be offloaded to a worker thread.
One might ask "Why is the keep alive being sent on the engine thread instead of it's own worker thread?" Good question. The answer is that even if NDIO was able to send the keep alive, a bogged down engine thread may still result in outputs not being set in a timely fashion. By sending the keep alive on the engine thread, we ensure that everyone is getting the time they need.
Because of this side effect, we periodically get support requests where the jace has been rebooted and the logs indicate that NDIO requested the reboot. The incorrect assumption is that there is a problem with NDIO. In actuality NDIO is simply responding to a condition that has been caused by some other routine.
Unfortunately diagnosis can be challenging. If you find yourself in this position, the best way I've found to try and diagnose this is to:
In AX 3.1 and above, check to see if there is a stack dump is in the station output when it happens. When NDIO requests a reboot, it also requests a stack dump before saving and rebooting. Sometimes there are clues as to which thread (and therefore the routine) is the culprit.
Trial and Error. Start disabling various pieces and see if the problem clears.
Also look for patterns in the reboot such as it occurs every hour, etc. Intervals such as reboots every hour may provide a hint as to who is behind the problem.
Many times it is custom program objects or drivers that cause this to happen.