dibig_tools is a collection of useful tools and scripts for HiPerGator. To use this module, execute:
module load dibig_tools
This module currently includes the following tools:
- submit.py – User-friendly replacement for qsub
- actor.py – Self-documenting scripts for HiPerGator
- split-sam.sh – Split large SAM/BAM files.
- csvtoxls.py – Convert tab-delimited files to Excel format.
The submit.py command
submit.py is a Python script that replaces the standard qsub command to submit jobs to HiPerGator. Its main advantage is that it allows you to easily pass command-line arguments to the submitted script. Instructions on using submit.py can be obtained by calling it with no arguments:
usage: submit.py [-after jobid] [-done name] scriptName [arguments…]
Submits script “scriptName” using qsub, passing the values of “arguments” to the script as $arg1, $arg2, $arg3, etc.
If “-after” is specified, the script will run after the job indicated by “jobid” has terminated successfully. Multiple “-after” arguments may be specified.
If “-done” is specified, the script will create a file called “name” when it terminates. This can be used to detect that execution of the script has finished.
Additional qsub options can be read from a file (by default, “.qsubrc” in your home directory). You can change the name of this file with the “-conf” argument. The path is relative to your home directory, so for example “-conf confs/largejob.txt” will read configuration options from the largejob.txt file in the confs/ subdirectory of your home. Set this argument to a non-existent file to disable option loading.
This command returns the id of the submitted job, which is suitable as the -after argument for a subsequent job. For example:
STEP2=`submit.py -after $STEP1 step2.sh`
Copyright (c) 2013, University of Florida
A. Riva (email@example.com)
For example, a script to perform the grep command as a HiPerGator job could be written as follows:
grep $string $filename
Assuming the script above is saved as grep.qsub, it could be called as follows:
$ submit.py grep.qsub word filename.txt
The split-sam.sh command
The following is the documentation for split-sam.sh:
split-sam.sh – Split SAM or BAM files into multiple, smaller files for easier processing.
Usage: /apps/dibig_tools/1.0/bin/split-sam.sh inputfile [nlines] [prefix] [npost]
inputfile – required, should be either a .sam or .bam file.
nlines – number of lines in each output file (default: 1000000).
prefix – prefix of the output files (default: ‘part-‘).
npost – number of letters to use in filenames postfix (see below).
Output file names are generated by concatenating ‘prefix’ with the strings aaa, aab, aac, etc. This allows for at most 1000 output files. If this is not sufficient, you can increase the number of letters used with the npost argument. For example, if the value for that argument is 4, the strings will be aaaa, aaab, aaac, etc (10000 max output files). Output files are written in SAM format.
After alignment, the resulting BAM files can be concatenated together with the samtools ‘cat’ or ‘merge’ commands. Please see samtools documentation for details.
The csvtoxls.py command
csvtoxls.py is a script to convert one or more tab-delimited files into an Excel file (in xlsx format). The documentation for this command can be obtained by calling it with no arguments, and is reported here.
csvtoxls.py – Convert a tab-delimited file to .xlsx format
csvtoxls.py outfile.xlsx file1.csv [file1 options…] [file2.csv [file2 options…]] …
Each .csv file appearing on the command line is written to outfile.xlsx as a separate sheet. The csv file name may be followed by one or more options, that apply only to that file. Valid options are:
-name S – set the sheet name to S.
-delim D – file uses delimiter D. Possible values are: ‘tab’, ‘space’, or a single character (default: tab).
-width N – set the width of all columns to N.
-firstrow N – place the first row of the csv file in row N of the sheet (default: 1).
-firstcol N – place the first column of the csv file in column N of the sheet (default: 1).
-rowhdr N – format row N of the csv file as header (bold). This option may appear multiple times with different values of N.
-colhdr N – format column N of the csv file as header (bold). This option may appear multiple times with different values of N.
-firstrowhdr – equivalent to -rowhdr 1 (first row will be bold).
-firstcolhdr – equivalent to -colhdr 1 (first column will be bold).
Full documentation and source code are available on GitHub: