Tuesday 6 January 2015

A script for running processes in parallel in Bash

A script for running processes in parallel in Bash

--------------------------------------------------------------------------------------------------------------
In Bash you can start new processes (theads) on the background simply by running a command with ampersand &. The waitcommand can be used to wait until all background processes have finished (to wait for a certain process do wait PID where PID is a process ID). So here’s a simple pseudocode for parallel processing:
1
2
3
4
5
6
7
8
for ARG in  $*; do
    command $ARG &
    NPROC=$(($NPROC+1))
    if [ "$NPROC" -ge 4 ]; then
        wait
        NPROC=0
    fi
done
I.e. you run 4 processes at a time and wait until all of them have finished before executing the next four. This is a sufficient solution if all of the processes take equally long to finish. However this is suboptimal if running time of the processes vary a lot.
A better solution is to track the process IDs and poll if all of them are still running. In Bash $! returns the ID of last initiated background process. If a process is running, the corresponding PID is found in directory /proc/.
Based on the ideas given in a Ubuntu forum thread and a template on command line parsing, I wrote a simple script “parallel” that allows you to run virtually any simple command concurrently.
Assume that you have a program proc and you want to run something like proc *.jpg using three concurrent processes. Then simply do

parallel -j 3 proc *.jpg

The script takes care of dividing the task. Obviously -j 3 stands for three simultaneous jobs.
If you need command line options, use quotes to separate the command from the variable arguments, e.g.

parallel -j 3 "proc -r -A=40" *.jpg

Furthermore, -r allows even more sophisticated commands by replacing asterisks in the command string by the argument:

parallel -j 6 -r "convert -scale 50% * small/small_*" *.jpg

I.e. this executes convert -scale 50% file1.jpg small/small_file1.jpg for all the jpg files. This is a real-life example for scaling down images by 50% (requires imagemagick).
Finally, here’s the script. It can be easily manipulated to handle different jobs, too. Just write your command between #DEFINE COMMAND and #DEFINE COMMAND END.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
#!/bin/bash
NUM=0
QUEUE=""
MAX_NPROC=2 # default
REPLACE_CMD=0 # no replacement by default
USAGE="A simple wrapper for running processes in parallel.
Usage: `basename $0` [-h] [-r] [-j nb_jobs] command arg_list
    -h      Shows this help
    -r      Replace asterix * in the command string with argument
    -j nb_jobs  Set number of simultanious jobs [2]
 Examples:
    `basename $0` somecommand arg1 arg2 arg3
    `basename $0` -j 3 \"somecommand -r -p\" arg1 arg2 arg3
    `basename $0` -j 6 -r \"convert -scale 50% * small/small_*\" *.jpg"
function queue {
    QUEUE="$QUEUE $1"
    NUM=$(($NUM+1))
}
function regeneratequeue {
    OLDREQUEUE=$QUEUE
    QUEUE=""
    NUM=0
    for PID in $OLDREQUEUE
    do
        if [ -d /proc/$PID  ] ; then
            QUEUE="$QUEUE $PID"
            NUM=$(($NUM+1))
        fi
    done
}
function checkqueue {
    OLDCHQUEUE=$QUEUE
    for PID in $OLDCHQUEUE
    do
        if [ ! -d /proc/$PID ] ; then
            regeneratequeue # at least one PID has finished
            break
        fi
    done
}
# parse command line
if [ $# -eq 0 ]; then #  must be at least one arg
    echo "$USAGE" >&2
    exit 1
fi
while getopts j:rh OPT; do # "j:" waits for an argument "h" doesnt
    case $OPT in
    h)  echo "$USAGE"
        exit 0 ;;
    j)  MAX_NPROC=$OPTARG ;;
    r)  REPLACE_CMD=1 ;;
    \?) # getopts issues an error message
        echo "$USAGE" >&2
        exit 1 ;;
    esac
done
# Main program
echo Using $MAX_NPROC parallel threads
shift `expr $OPTIND - 1` # shift input args, ignore processed args
COMMAND=$1
shift
for INS in $* # for the rest of the arguments
do
    # DEFINE COMMAND
    if [ $REPLACE_CMD -eq 1 ]; then
        CMD=${COMMAND//"*"/$INS}
    else
        CMD="$COMMAND $INS" #append args
    fi
    echo "Running $CMD"
    $CMD &
    # DEFINE COMMAND END
    PID=$!
    queue $PID
    while [ $NUM -ge $MAX_NPROC ]; do
        checkqueue
        sleep 0.4
    done
done
wait # wait for all processes to finish before exit

No comments: