KBAS is distributed as a GNU/Linux command-line tool designed to be included in automated pipelines for genome-wide association analysis.
How does KBAS work?
The work flow of KBAS involves three steps: hypothesis generation, refinement and testing. The process begins with a set of hypotheses, each one containing a user-defined set of markers based on preexisting expert knowledge. The hypotheses are then evaluated, ranked, refined and selected according to how well they fit the available data using a variation of the CHC class of GA algorithms. The designed GA receives a user-defined number of randomly selected potential solutions to the problem under consideration (population size). Each solution is a unique combination of markers with randomly assigned weights. Solutions then are evaluated according to a user-specified fitness function, and are evolved over a user-defined simulated time period (number of generations). Over the generations, the GA engine refines the initial marker set by removing markers that show limited contribution to the trait, and produces the model that best fits the fitness function. The GA stops either at the last generation or if in a particular generation the fitness of the best fitting model exceeds a pre-determined threshold specified by the user.
After downloading the compressed KBAS package using the link below, copy it to an appropriate directory and execute the following command:
tar -xvf kbas-1.0.tar.gz
This will create a directory called kbas-1.0/, which contains the ‘kbas’ and ‘demo’ executable scripts.
KBAS includes an interactive demonstration that explains the basic steps in running the program. To run the demonstration, start the ‘demo’ script in the kbas-1.0 directory and follow the instructions. We highly recommend running the demonstration before starting to use KBAS, since it will provide you with an overview of the KBAS process and with a description of the most important command-line options.
First Step: Converting genotype files to KBAS binary format
KBAS inputs consist of two sets of genotypes, one for the control population and one for the case population, and a file containing a list of SNP identifiers. The case-control datasets can be either in tab-delimited format or in VCF format, and should be converted into the binary format usable by KBAS. This is done using the
-convert command followed by the appropriate conv-… arguments as follows:
- Input genotypes file to be converted.
- The pathname of directory where the converted files should be stored. If not specified, current directory is used as default.
- Prefix to use to name converted files i.e. P.bin, P-names.txt and P-snps.txt files where P is the specified prefix.
- Format of input file. Possible values are: vcf, csv (both alleles contained in a single column), csv2 (alleles contained in separate columns).
- Delimiter used in input file to separate the fields in each row when the input file is in csv or csv2 formats. Possible values are: tab, comma, space, colon, semicolon.
- Column containing marker identifiers in genotypes file.
- Column containing subject names in genotypes file.
- Column containing subject classification if the genotype file contains subjects of different classes (e.g. 0=unaffected, 1=affected).
- Column containing allele(s).If using the cvs format (both alleles contained in a single column), specify a single number. When using the cvs2 format (alleles in separate columns), specify two column numbers separated by comma (e.g. 3,4)
- Character between alleles if both alleles are in the same column e.g. / for alleles represented as A/C. No need to be specified if alleles are not separate e.g. AC.
- File containing subject names.
- Column containing subject names in conv-names file.
./kbas -convert \
-conv-in databases/case/case-genotypes.csv \
-conv-dir databases/case/ \
-conv-prefix case \
-conv-format csv \
-conv-delim tab \
-conv-marker-col 1 \
-conv-subj-col 2 \
Second Step: Running GA on the prepared case-control files
Once the P.bin, P-names.txt and P-snps.txt files generated for each of case and control groups, the GA can be initiated and run using the -run command and its relevant arguments as follows:
- The local path of the stored P.bin file containing genotypes of the case group.
- The local path of the stored P.bin file containing genotypes of the control group.
- The local path of the file containing the name of markers (e.g. SNPs) included in the study.
- Method is used for numerically encoding genotypes(e.g. AA, AB, BB) . It takes a three-digit number between 000 and 999 (e.g. 012) where the first number (here 0) specifies the numeric value of AA genotype, the second number (here 1) shows the numeric value of AB genotype, and the third one (here 2) refers to the numeric value of BB genotype. (default = 012).
- The GA class used in the analysis. Each of the predefined GA classes provides a different way to determine the score-variable and to measure the fitness. Use the -list-ga command to get a list of all available GA classes.
- The maximum number of generations the GA should run for if desired level of fitness is not met in any generation (default = 1000).
- Population size or number of chromosomes in each generation of GA (default = 500).
- Number of bits used for giving weight to each marker in the chromosomes (default = 1).
- Number of markers having non-zero weight when GA is initialized (default = 20).
- Overall mutation rate at which the weight of individual markers may be mutated in each generation (default = 0.05). Its valid range is between 0 and 1.
- Zero to one mutation rate at which the zero weight of individual markers may be mutated to non-zero values in each generation (default = 1). Its valid range is between 0 and 1.
- Non-zero to zero mutation rate at which the non-zero weights of individual markers may be mutated to zero values in each generation (default = 1). Its valid range is between 0 and 1.
- Desired fitness value which if is met before the last generation, GA stops (default = 10).
- Number of rounds of randomization test is performed by GA on case and control datasets using final successful model (default = 0).
- Filename to write final model to. If specified, the program will write the results out to this file in tab-delimited format.
./kbas -run \
-case databases/case/case.bin \
-ctrl databases/ctrl/ctrl.bin \
-markers markers/SNP-set-1.csv \
-ga-enc “012” \
-ga-class 1 \
-ga-gen 500 \
-ga-pop 200 \
-ga-bits 3 \
-ga-num 50 \
-ga-thld 20 \
-ga-rnd 100000 \
Please use the -help convert or -help run commands to get a detailed description of all the options available for the -convert and -run commands respectively.
The output consists of a summary of the GA inputs and parameters, followed by the contents of the successful model, represented as two columns containing markers (e.g. SNP names) their corresponding weights, respectively. If the -ga-out argument is provided, the output is also written to the specified file in machine-readable format. If randomization test was requested, its final p-value is recorded to the output as well.
- A knowledge-based method for association studies on complex diseases.
Nazarian A, Sichtig H, Riva A.
PLoS One. 2012;7(9):e44162.