A friend calls me the other day with a strange bug. He was running a Python simulation for his thesis, and it gave different results in different machines. “The strange thing is that the simulation is deterministic“, he said. “The damn thing should produce the exact same result in each and every execution“.
This kind of non-deterministic execution path is one of the most frequent – and frustrating – phenomena in software development. The reasons may vary greatly – from compilers and bus architecture differences, all the way to rsync problems causing wrong code to execute remotely. However, there are some common caveats that should be considered when approaching such a problem.
Common non-deterministic factors in your code
The poster child: I/O and Network
I/O and network is the first place to look for non-deterministic behavior. External resources can vary in permissions, existence, contents, availability, encoding and many more factors, causing code to behave differently in slightly different – or even the same – environment.
The following checklist can help diagnose popular file access caveats:
- Permissions and ownership: Can your code access all of its resources? Who’s executing the process? Is something altering permissions on the way? rsync, for example, can alter ownership during transfer.
- Absolute paths: If the resource is outside the directory tree of your code, make sure that the path is an argument – either command line argument or a value in a configuration file – and never hard-coded.
- Are all buffers properly flushed? If not, you might not be reading the same length from the file in each access.
- Encoding: Is the code silently relying on default system encoding for reading a text file? Only naïve developers believe in “Plain ASCII text”.
- Do you have enough disk space and quota for writing your output?
- Is your code the only process that accesses the file? In Posix environments, lsof (“list open files”) is a useful analysis tool. Another good place to check is crontab, which might periodically update your resource.
When interacting with a network socket, make sure that your communication to the desired resource is allowed locally. Amazon EC2, for instance (no pun intended) used to block outside communication to new machines by default. On the other hand, make sure that the receiving side does not block connections from your machines. Never assume that the requests are returned in the same order they were dispatched. The proper data structure for responses is a map from URL to response, not an array or a list of responses – which might suggest order.
My approach for debugging I/O is extreme explicitness. Make the software screams hard whenever a resource has the slightest problem - either by exiting immediately, writing to stdout, or logging an error.
Concurrency and parallelism
Implicit concurrency and parallelism (“I didn’t start that thread!“)
Spawning processes and threads almost guarantees non-deterministic execution path. In most operating systems, context switching takes the current load factor into consideration, causing an almost arbitrary switching order. The common caveat is that invoking a new thread or process is not always explicit. For example
Iterating non-ordered data structures
>>> s=set(['a','b','c']) >>> ''.join(s) 'acb'
Using object reference as value