FAQ - common questions
Q: Which systems does MOCAT run on?
A: UNIX and OSX, using either no queuing system or SGE or PBS. Hardware requirements vary depending on the size of your metagenome. To assemble the HMP mock community provided with MOCAT requires 5GB of RAM, and assembly of the simulated metagenome requires 13GB of RAM.
Q: In the article it says MOCAT only runs on UNIX, but you said it also runs on OSX?
A: It runs on OSX with some limitations. 1. We cannot offer support because we don't have OSX systems available to test things on. 2. SOAPaligner 2.20 is used instead of 2.21. 3. Filtering and calculating coverage for mapping mode 'allbest' is not supported.
Q: How do I install the package?
A: Download it, extract it from a Terminal using 'tar -xvf MOCAT.tar' and run the 'install_MOCAT.pl' script - DONE!
Q: After install, how do I run it?
A: During install, MOCAT adds its executable path in the .bashrc and .bash_profile files. This means, after restarting your system, you can run MOCAT from any folder by executing 'MOCAT.pl'
Q: How do I setup a new project?
A: See the article_datasets folder in the directory where you installed MOCAT. All you need is a MOCAT.cfg file (copy the one from the installation folder), a sample file with sample names and one folder for each sample containing the fastq single or paired-end files.
Q: Do I need to download additional software or databases?
A: To run the standard options, no. When installing MOCAT you can choose to download included databases as well as example datasets. If you want to screen custom FASTA-files or use
MetaGeneMark to predict genes, these two software require additional download because of their licences. Download Usearch or MetaGeneMark. NOTE: MOCAT v1.1 ONLY SUPPORT USEARCH VERSION 5, AND NOT USERACH VERSION 6. MOCAT v1.2 and later supports Usearch v6.
Q: MOCAT_mapping_mode: allbest[allbest, random, unique]. It’s said corresponds to SOAPaligner2’s ‘r’ option, what does it mean?
A: Allbest means that if a read maps to multiple locations, all are kept. Random, that one of these are kept at random, and unique means that only reads mapping to unique locations are kept. Had you run SOAPaligner manually, this would be set with the –r option in the software.
Q: assembly_db_calc_insertsize:1506MG, it’s said “the database “1506MG” is provided, which consists of 40 marker genes from 1506 reference genomes.” I don’t know what the database is. What kind of genomes are in the reference genomes? Has the database already downloaded during MOCAT installation?
A: The database is downloaded automatically. The database contains a set of single copy marker genes found in all bacteria, from 1506 bacterial species, all complete genomes from NCBI at the time of creation of the DB. In general, if your sample, is large enough using “mapping”, rather than the other option ”assembly” will work fine.
Q: readtrimfilter_length_cutoff:45, readtrimfilter_qual_cutoff:20. Does it mean the raw reads are first removed if they are <45, then the remaining reads are trimmed at the quality cutoff of 20? So the output of reads may be shorter than 45?
A: Reads are first quality trimmed form the 3’ and 5’ end, then any read shorter than the length cut off are removed. Thus NO read will be shorted than the length cut off.
Q: For the sample data that only has single-end reads (eg, the raw data is SRR172902.fq), why does output for read trim filter step have the files like pair-end reads like SRR172902.pair.1.fq.gz SRR172902.pair.2.fq.gz?
A: MOCAT is designed for paired-end reads in mind, but support for single end reads is fully developed as well. The paired files are empty files and created because MOCAT processing steps expects these files to exist. But since they’re empty it does not affect the analysis.
Q: For the screen database step, the output has three folders: screened(kept), extracted, and mapped(match the database). I’m not very clear what the reads in each folder stand for. In the paper, it’s said that “reads from a human fecal metagenomic sample can be mapped to hg19 to remove reads of human origin using SOAPAligner2”. Does it mean also in the simulated_metagenome dataset, if the read match to the 100_species_references, it will be removed and only the reads that in the screened folder are used for the following assembly and gene prediction steps? What’s extracted reads? Are they the reads that removed in screen step? What’s the difference between the extracted and mapped reads?
A: You can also map reads from the simulated metagenome against the hg19 database. In the examples it’s not done, because it doesn’t make sense, because there are no human reads in the simulated metagenome, because it’s made in silico from bacterial genomes. The mapped folder has the actual bam or soap files. The output from the aligner. The extracted folder contains those reads as in the mapped folder, but in fastq format. The screened folder contains the original reads you used to map against a database minus the reads in the extracted folder. Which means, if we first map against hg19, we want to use the screened reads in the next step, ie the reads that didn’t match the hg19 database.
Q: I don’t fully understand what FILTER step is used for. In the MOCAT.cfg file, it has a section of FILTER. In the manual, it’s said “Additional filtering of reads that have been mapped using the s|screen command. Percentage ID and length cut off are defined in the config file. With this option you can map reads at a lower cutoff and then filter them at different (higher) cutoffs. It is required to run this step before running calculate coverage below, even though you do not want to filter at a higher cut off. If you do not want to filter at a higher cut off, set the values in the config file to the same ones as for the screen step.”
A: We have chosen to do mapping against database in two steps for a number of reasons, as said in the manual, the main reason is that you can map at eg 90% and then filter on both 95% and eg 97%, this is something we do often. Yes, you have to run filtering before calculate coverage, even though you don’t want to filter at higher percentage.
Q: Why are the number of bases of the reads that mapped (match) the bacteria genome in the 100_species_references database not integers?
A: Because some reads map to multiple locations, and then these are split so that the sum of each read is 1, meaning a sequence may have a non integer as it’s coverage.
Q: How do you calculate the normalized base and insert counts? What does n/a mean in the .norm file?
A: The normalized base and read counts are the base and read counts divided by the sequence length. n/a is because we cannot calculate the -1 fraction for normalized counts, because we don’t have the length of sequences that are not in the dataset.
Q: Is each row in this file corresponding to the base.coverage.count/norm file? What does “-1” mean in this file?
A: -1 is the read and base counts that were not mapped to any sequence. For the simulated metagenome the -1 fraction are the 80 reads that didn’t match the DB. In reality, the -1 fraction is usually much higher depending on what type of data you map to which type of DB.
Q: What does the end in each row, like “|_1_153766” mean, in the unique headers file?
A: gi|114797051|ref|NC_008358.1| is the name of the sequences. _1_XXX is the first base and the last base that the coverages are counted on. Usually this is always from 1 to the total length of the sequence. But if we only want to calculate the coverage of a portion of a sequence, this could (for advanced users) be edited in the DB.coord file located in the MOCAT data folder.