Content-type: text/html
prll [ -b | -B ] [ -c num ] [ -q | -Q ] { -s str | funct } { -p | -0 | args }
prll (pronounced "parallel") is a utility for use with sh-compatible shells, such as bash(1), zsh(1) and dash(1). It provides a convenient interface for parallelizing the execution of a single task over multiple data files, or actually any kind of data that you can pass as a shell function argument. It is meant to make it simple to fully utilize a multicore/multiprocessor machine, or to just run long running tasks in parallel. Its distinguishing feature is the ability to run shell functions in the context of the current shell.
All names beginning with 'prll_' are reserved and should not be used. The following are intended for use in user supplied functions:
prll is designed to be used not just in shell scripts, but especially in interactive shells. To make the latter convenient, it is implemented as a shell function. This means that it inherits the whole environment of your current shell. It uses helper programs, written in C. To prevent race conditions, System V Message Queues and Semaphores are used to signal job completion. It also features full output buffering to prevent mangling of data because of concurrent output.
To execute a task, create a shell function that does something to its first argument. Pass that function to prll along with the arguments you wish to execute it on.
As an alternative, you may pass the -s flag, followed by a string. The string will be executed as if it were the body of a shell function. Therefore, you may use '$1' to reference its first (and only) argument. Be sure to quote the string properly to prevent shell expansion.
Instead of arguments, you can use options -p or -0. prll will then take its arguments from stdin. The -p flag will make it read lines and the -0 flag will make it read null-delimited input. This mode emulates the xargs(1) utility a bit, but is easier for interactive use because xargs(1) makes it hard to pass complex commands. Reading large arguments (such as lines several megabytes long) in this fashion is slow, however. If your data comes in such large chunks, it is much faster to split it into several files and pass a list of those to prll instead.
The -b option disables output buffering. See below for explanation. Alternatively, buffering may be disabled by setting the PRLL_BUFFER environment variable to 'no'. Use the -B option to override this.
The -q and -Q options provide two levels of quietness. Both suppress progress reports. The -Q option also disables the startup and end messages. They both let errors emited by your jobs through.
The number of tasks to be run in parallel is provided with the -c option or via the PRLL_NR_CPUS environment variable. If it is not provided, prll will look into the /proc/cpuinfo file and extract the number of CPUs in your computer.
Execution can be suspended normally using Ctrl+Z. prll should be subject to normal job control, depending on the shell.
If you need to abort execution, you can do it with the usual Ctrl+C key combination. prll will wait for remaining jobs to complete before exiting. If the jobs are hung and you wish to abort immediately, use Ctrl+Z to suspend prll and then kill it using your shell's job control.
The command prll_interrupt is available from within your functions. It causes prll to abort execution in the same way as Ctrl+C.
prll cleans after itself, except when you force termination. If you kill prll, jobs and stale message queues and semaphores will be left lying around. The jobs' PIDs are printed during execution so you can track them down and terminate them. You can list the queues and semaphores using the ipcs(1) command and remove them with the ipcrm(1) command. Refer to your system's documentation for details. Be aware that other programs might (and often do) make use of IPC facilities, so make sure you remove the correct queue or semaphore. Their keys are printed when prll starts.
Transport of data between programs is normally buffered by the operating system. These buffers are small (e.g. 4kB on Linux), but are enough to enhance performance. Multiple programs writing to the same destination, as is the case with prll, is then arranged like this:
+-----+ +-----------+ | job |--->| OS buffer |\ +-----+ +-----------+ \ \ +-----+ +-----------+ \+-------------+ | job |--->| OS buffer |--->| Output/File | +-----+ +-----------+ /+-------------+ / +-----+ +-----------+ / | job |--->| OS buffer |/ +-----+ +-----------+The output can be passed to another program, over a network or into a file. But the jobs run in parallel, so the question is: what will the data they produce look like at the destination when they write it at the same time?
If a job writes less data than the size of the OS buffer, then everything is fine: the buffer is never filled and the OS flushes it when the job exits. All output from that job is in one piece because the OS will flush only one buffer at a time.
If, however, a job writes more data than that, then the OS flushes the buffer each time it is filled. Because several jobs run in parallel, their outputs become interleaved at the destination, which is not good.
prll does additional job output buffering by default. The actual arrangement when running prll looks like this:
+-----+ +-----------+ +-------------+ | job |--->| OS buffer |--->| prll buffer |\ +-----+ +-----------+ +-------------+ \ | \ +-----+ +-----------+ +-------------+ \+-------------+ | job |--->| OS buffer |--->| prll buffer |--->| Output/File | +-----+ +-----------+ +-------------+ /+-------------+ | / +-----+ +-----------+ +-------------+ / | job |--->| OS buffer |--->| prll buffer |/ +-----+ +-----------+ +-------------+Note the vertical connections between prll buffers: they synchronise so that they only write data to the destination one at a time. They make sure that all of the output of a single job is in one piece. To keep performance high, the jobs must keep running, therefore each buffer must be able to keep taking in data, even if it cannot immediately write it. To make this possible, prll buffers aren't limited in size: they grow to accomodate all data a job produces.
This raises another concern: you need to have enough memory to contain the data until it can be written. If your jobs produce more data than you have memory, you need to redirect it to files. Have each job create a file and redirect all its output to that file. You can do that however you want, but there should be a helpful utility available on your system: mktemp(1). It is dedicated to creating files with unique names.
As noted in the usage instructions, prll's additional buffering can be disabled. It is not necessary to do this when each job writes to its own file. It is meant to be used as a safety measure. prll was written with interactive use in mind, and when writing functions on the fly, it can easily happen that an error creeps in. If an error causes spurious output (e.g. if the function gets stuck in an infinite loop) it can easily waste a lot of memory. The option to disable buffering is meant to be used when you believe that your jobs should only produce a small amount of data, but aren't sure that they actually will.
It should be noted that buffering only applies to standard output. OS buffers standard error differently (i.e. by lines) and prll does nothing to change that.
Suppose you have a set of photos that you wish to process using the mogrify(1) utility. Simply do
myfn() { mogrify -flip $1 ; } prll myfn *.jpgThis will run mogrify on each jpg file in the current directory. If your computer has 4 processors, but you wish to run only 3 tasks at once, you should use
prll -c 3 myfn *.jpgOr, to make it permanent in the current shell, do
PRLL_NR_CPUS=3on a line of its own. You don't need to export the variable because prll automatically has access to everything your shell can see.
All examples here are very short. Unless you need it later, it is quicker to pass such a short function on the command line directly:
prll -s 'mogrify -flip $1' *.jpgprll now automatically wraps the code in an internal function so you don't have to. Don't forget about the single quotes, or the shell will expand $1 before prll is run.
If you have a more complicated function that has to take more than one argument, you can use a trick: combine multiple arguments into one when passing them to prll, then split them again inside your function. You can use shell quoting to achieve that. Inside your function, prll_splitarg is available to take the single argument apart again, i.e.
myfn() { prll_splitarg process $prll_arg_1 compute $prll_arg_2 mangle $prll_arg_3 } prll myfn 'a1 b1 c1' 'a2 b2 c3' 'a3 b3 c3' ...If you have even more complex requirements, you can use the '-0' option and pipe null-delimited data into prll, then split it any way you want. Modern shells have powerful read(1) builtins.
You may wish to abort execution if one of the results is wrong. In that case, use something like this:
myfn() { compute $1; [[ $result == "wrong" ]] && prll_interrupt; }This is useful also when doing anything similar to a parallel search: abort execution when the result is found.
If you have many arguments to process, it might be easier to pipe them to standard input. Suppose each line of a file is an argument of its own. Simply pipe the file into prll:
myfn() { some; processing | goes && on; here; } cat file_with_arguments | prll myfn -p > resultsRemember that it's not just CPU-intensive tasks that benefit from parallel excution. You may have many files to download from several slow servers, in which case, the following might be useful:
prll -c 10 -s 'wget -nv "$1"' -p < links.txt
This section describes issues and bugs that were known at the time of release. Check the homepage for more current information.
Known issues:
Homepage: http://prll.sourceforge.net/