Ppss

From BITS wiki
Jump to: navigation, search

|P|P|S|S| Distributed Parallel Processing Shell Script 2.97 ([1] )allows faster processing of repeated tasks using a shared input format. The command/script sis applied to every single file in the input folder using available resources. This is obviously an application for computer with sufficient cpu and IO capacity and is not meant for a single cpu computer.


|P|P|S|S| Distributed Parallel Processing Shell Script 2.97

usage: ppss [[ -d <sourcedir> | -f <sourcefile> ]]  [[ -c '<command> "$ITEM"' ]] 
                 [[ -C <configfile> ]] [[ -j ]] [[ -l <logfile> ]] [[ -p <# jobs> ]] 
                 [[ -q ]] [[ -D <delay> ]] [[ -h ]] [[ --help ]] [[ -r ]] [[ --daemon ]] 
 
Examples:
                 ppss -d /dir/with/some/files -c 'gzip '
                 ppss -d /dir/with/some/files -c 'cp "$ITEM" /tmp' -p 2
                 ppss -f <file> -c 'wget -q -P /destination/directory "$ITEM"' -p 10

ppss manpage

ppss --help
 
|P|P|S|S| Distributed Parallel Processing Shell Script 2.97
 
PPSS is a Bash shell script that executes commands in parallel on a set  
of items, such as files in a directory, or lines in a file.
 
Usage: /opt/biotools/bin/ppss [[ MODE ]] [[ options ]] 
 
Modes are optional and mainly used for running in distributed mode. Modes are:
 
 config       Generate a config file based on the supplied option parameters.
 deploy       Deploy PPSS and related files on the specified nodes.
 erase        Erase PPSS and related files from the specified nodes.
 
 start        Starting PPSS on nodes.
 pause        Pausing PPSS on all nodes.
 stop         Stopping PPSS on all nodes.
 continue     Continuing PPSS on all nodes.
 node         Running PPSS as a node, requires additional options.
Usage /opt/biotools/bin/ppss [[ options ]] 
 
--command | -c     Command to execute. Syntax: '<command> ' including the single quotes.
                   Example: -c 'ls -alh '. It is also possible to specify where an item 
                   must be inserted: 'cp "$ITEM" /somedir'.
 
--sourcedir | -d   Directory that contains files that must be processed. Individual files
                   are fed as an argument to the command that has been specified with -c.
 
--sourcefile | -f  Each single line of the supplied file will be fed as an item to the
                   command that has been specified with -c. Read input from stdin with
                   -f -
 
--config | -C      If the mode is config, a config file with the specified name will be
                   generated based on all the options specified. In the other modes.
                   this option will result in PPSS reading the config file and start
                   processing items based on the settings of this file.
 
--disable-ht | -j  Disable hyper threading. Is enabled by default.
 
--log | -l         Sets the name of the log file. The default is ppss-log.txt.
 
--processes | -p   Start the specified number of processes. Ignore the number of available
                   CPUs.
 
--quiet | -q       Shows no output except for a progress indication using percents.
 
--delay | -D       Adds an initial random delay to the start of all parallel jobs to spread
                   the load. The delay (seconds) is only used at the start of all 'threads'.
 
--daemon           Daemon mode. Do not exit after items are professed, but keep looking 
                   for new items and process them. Read the manual how to use this!
                   See --help for important additional options regarding daemon mode.
 
--disable-inotify  Linux users can use real-time inotify filesystem events when using
                   daemon mode. Requires inotify-tools. Enabled by default if available.
                   Automatically disabled if NFS is used as the daeon source dir.
 
--no-traversal|-r  By default, PPSS uses the regular 'find' command to list all files
                   within the directory specified by the -d option. If you do not wish
                   for PPSS to process files in sub directories, use this option.
                   Only files within the specified directory will be processed. Any
                   subdirectories will then be ignored.
 
--email | -e       PPSS sends an e-mail if PPSS has finished. It is also used if processing
                   of an item has failed (configurable, see -h). 
 
--debug            Enable debugging output to the |P|P|S|S| log file.
 
--help             Extended help, including options for distributed mode.
 
The following options are used for distributed execution of PPSS.
 
--master | -m      Specifies the SSH server that is used for communication between nodes.
                   Using SSH, file locks are created, informing other nodes that an item 
                   is locked. If items are files that must be processed, they must reside
                   on this host. SCP is used to transfer files from this host to nodes
                   for local procesing.
 
--node | -n        File containig a list of nodes that act as PPSS clients. One IP / DNS
                   name per line.
 
--key | -k         The SSH key that a node uses to connect to the master.
 
--known-hosts | -K The file that contains the server public key. Can often be found on  
                   hosts that already once connected to the server. See the file 
                   ~/.ssh/known_hosts or else, manualy connect once and check this file.
 
--user | -u        The SSH user name that is used by the node when logging in into the
                   master SSH server.
 
--script | -S      Specifies the script/program that must be copied to the nodes for
                   execution through PPSS. Only used in the deploy mode.
                   This option should be specified if necessary when generating a config.
 
--download         This option specifies that an item will be downloaded by the node
                   from the server or share to the local node for processing.
 
--upload           This option specifies that the output file will be copied back to
                   the server, the --outputdir option is mandatory.
 
--no-scp | -b      Do not use scp for downloading items. Use cp instead. Assumes that a
                   network file system (NFS/SMB) is mounted under a local mount point.
 
--outputdir | -o   Directory on server where processed files are put. If the result of 
                   encoding a wav file is an mp3 file, the mp3 file is put in the 
                   directory specified with this option.
 
--homedir | -H     Directory in which PPSS is installed on the node.
                   Default is 'ppss-home'.
 
--script | -S      Script to run on the node. PPSS must copy this script to the node.
 
--randomize | -R   Randomise which items to process by the client in distributed mode.
                   This makes sure that with many nodes, it is prevented that some
                   clients spend all their time trying to get a lock on an item.
 
Example: encoding some wav files to mp3 using lame:
 
/opt/biotools/bin/ppss -c 'lame ' -d /path/to/wavfiles -j 
 
Running PPSS based on a configuration file.
 
/opt/biotools/bin/ppss -C config.cfg
 
Generating a configuration file. Wavs are converted to mp3. SCP is used for data transfer.
 
/opt/biotools/bin/ppss config -C ppss-config.cfg -d /some/dir -o output --download --upload -K known_hosts \
-k ppss-key.dsa -n nodes.txt -m 10.0.0.100 \
-c 'lame --quiet "$ITEM" -o "$OUTPUT_DIR/$OUTPUT_FILE".mp3' 
 
Running PPSS on a client as part of a cluster.
 
/opt/biotools/bin/ppss node -d /somedir -c 'cp "$ITEM" /some/destination' -m 10.0.0.50 -u ppss -k ppss-key.key

References:
  1. http://code.google.com/p/ppss/



[ Main_Page ]