Mobster documentation
Mobster is built using maven, and assumes that this is in your path. To build mobster run "install.sh" this will package the required classes into a fat executable jar in the "target" directory. Typically called MobileInsertions-.jar. For example MobileInsertions-0.2.2.jar.
Installation
cd <dir containing mobster
./install.sh
Install MOSAIK manually (we have tested Mobster with MOSAIK v 2.1.33)
Try the binaries directly or try building from source
wget https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/mosaik-aligner/MOSAIK-2.1.33-Linux-x64.tar
or
wget https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/mosaik-aligner/MOSAIK-2.1.33-source.tar
Or try installing via bioconda or visiting
MOSAIC at github
Don't forget to update Mobster.properties to include the location for MOSAIK.
Finally add the Mosaik binaries like MosaikBuild to your $PATH, you can do this by:
cd
vim .bash_profile #or .profile
#Add the directory where the mosaik binaries are located to the PATH= line.
#or if you do not want this, change the MOBIOME_MAPPING_CMD in the ./Mobster/jars/Mobster.properties file to point to the location of both MosaikBuild and MosaikAlign
You are ready to go!
Test mobster by:
cd <dir containing mobster>/jars/
java -cp "./*" org.umcn.me.pairedend.Mobster -properties Mobster.properties
After a few minutes there will be almost 3000 MEI predictions in
./Mobster/test-out/
Usage
Before use, please check the properties file whether everything is set up correctly for your input BAM file (see below).
Single sample calling using 8Gb of RAM and current working directory where MobileInsertions-<version>.jar is located:
java -Xmx8G -cp "./*" org.umcn.me.pairedend.Mobster -properties Mobster.properties -in input.bam -sn test_sample -out mobster_test
Multiple sample calling using 8Gb of RAM and current working directory where MobileInsertions-<version>.jar is located:
java -Xmx8G -cp "./*" org.umcn.me.pairedend.Mobster -properties Mobster.properties -in A1_child.bam,A1_father.bam,A1_mother.bam -sn A1_child,A1_father,A1_mother -out A1_trio_mobster
Make sure that the bam files provided to the -in argument match the sample names provided to the -sn argument. BAM files and sample names should be comma separated.
In addition Mobster can also be run on single BAM files which contain multiple samples as long as they have proper readgroup tags with a provided sample name per tag.
Properties file
The
most important property to set up correctly is
MAPPING_TOOL. Make sure this property is set to
unspecified
when your input BAM file was mapped using an other mapper than
bwa aln
or
mosaik
.
Even when
bwa mem
was used, please use
unspecified
.
unspecified
should work for most mapping tools and can even be used for
bwa aln
or
mosaik
aligned BAM files. Since Mobster 0.2.0 the default value for MAPPING_TOOL is
unspecified
.
unspecified
is unsupported in 0.1.6.
Other properties
worth checking or setting are
READ_LENGTH
,
MULTIPLE_SAMPLE_CALLING
and
MULTIPLE_SAMPLE_CALLING_STRINGENT
(this value should normally be set to true to prevent high numbers of false positives)
,
READS_PER_CLUSTER
(recommended value = 1),
MINIMUM_MAPQ_ANCHOR
(in case
MAPPING_TOOL
is set to unspecified) and
REPEATMASK_FILE
(make sure the file is for the correct hg build).
For more detailed explanation of
all properties, see the table below. Properties are categorized as follows:
| Properties relating to sequencing library such as mean fragment length |
| Properties relating to input/output and processing |
| Properties relating to evidence needed for MEI calling |
| Properties relating to clustering of MEI supporting reads |
Category | Property | Value type | Description |
| USE_PICARD | {true, false} | Whether or not to use Picard to estimate the Insert Size metrics. This aids the determination of prediction borders, especially for events with only support from one side. |
| PICARD_COLLECT_INSERT_SIZE_METRICS_JAR | String (Location) | Location of Picard's CollectInsertSizeMetrics.jar. Will only be used when USE_PICARD is set to true. |
| MEAN_FRAGMENT_LENGTH | Integer | Mean fragment length of sequencing library. Not necessary to provide when USE_PICARD is set to true as Picard will estimate this value. |
| SD_FRAGMENT_LENGTH | Integer | Standard deviation of fragment length of sequencing library. Not necessary to provide when USE_PICARD is set to true as Picard will estimate this value. |
| LENGTH_99PROCENT_OF_FRAGMENTS | Integer | 99 Percent of fragment lengths of sequencing library should be <= than this value. Not necessary to provide when USE_PICARD is set to true as Picard will estimate this value. |
| IN_FILE | String (Location) | Input BAM file. Note that Input BAM file can also be specified on the command line. When specified on command line this value will be ignored. |
| OUT_FILE | String (Location) | Output prefix. Note that if output prefix is specified on the command line, that this value will be ignored. |
| USE_READ_LENGTH | {true, false} | Whether or not to use read length to slightly improve the prediction windows of single cluster predictions. |
| READ_LENGTH | Integer | The read length. Will only be used if USE_READ_LENGTH is set to true. |
| MULTIPLE_SAMPLE_CALLING | {true, false} | Whether to turn on multiple sample calling. Can be used when multiple bam files are specified as input or when one bam file contains multiple samples as specified through the sample name field in the @RG tag. If multiple bam files are specified as input, multiple sample calling will be turned on, even if the value of this property was set to false |
| MULTIPLE_SAMPLE_CALLING_STRINGENT | {true, false} | Stringent mode: at least one sample needs to have >= MINIMUM_SUPPORTING_READS. When false the total of supporting reads for all samples should be >= MINIMUM_SUPPORTING_READS |
| MAPPING_TOOL | {unspecified, bwa, mosaik} | The mapper which was used for the INPUT bam file. Chose unspecified when you are unsure which mapper was used. NOTE: use bwa only if the reads were aligned with bwa aln. When bwa mem was used, mapper should be unspecified. |
| MINIMUM_MAPQ_ANCHOR | Integer [def: 20] | MINIMUM_MAPQ_ANCHOR will only be used when MAPPING_TOOL is set to unspecified. It specifies at what MAPQ a read can be considered as an anchor for a potential MEI. |
| USE_SPLIT | {true, false} | Whether mobster should try and search for clipped reads |
| MINIMUM_CLIP_LENGTH | Integer | When searching for clipped reads, how long should the unmapped sequence at least be (bp) |
| MAXIMUM_CLIP_LENGTH | Integer | If a clipped read is found with MINIMUM_CLIP_LENGTH, how long may the other end of the same read be MAXIMUM clipped? |
| MINIMUM_AVG_QUALITY | Integer | If a clipped read is found with MINIMUM_CLIP_LENGTH, what should the minimum average base quality be of that read? |
| MINIMUM_POLYA_LENGTH | Integer | How long should the poly A/T stretch minimally be in the clipped part of a read for it to be classified as a polyA/T tail. |
| MAXIMUM_MISMATCHES_POLYA | Integer | How many times may there be a different nucleotide than A for a polyA stretch to be still classified as polyA tail or different from T for a polyT tail? |
| MAX_RECORDS_IN_RAM | Integer | Number of records which are held in memory. Increase this number if you have more RAM available. According to Picard's sourceforge a rule of thumb for reads of ~100bp is to set MAX_RECORDS_IN_RAM to be 250,000 reads per each GB given to the -Xmx parameter. This should also roughly apply to Mobster. |
| MOBIOME_MAPPING_CMD | String | The command which will be used for mobiome mapping. When using mosaik as mobiome mapper the -p parameter is interesting to speed up the mapping step. -p indicates the number of processors used for mapping. |
| SAMPLENAME | String | Name of sample, can also be provided over the command line. Command line value overrides the property value. |
| MOBIOME_MAPPING_TOOL | {mosaik, bwa} | Mapping tool used for mobiome mapping. When changing this property you should modify the MOBIOME_MAPPING_CMD accordingly. |
| PAIRED_END | {true, false} | Deprecated property. Should always be true, single-end reads are not supported |
| READS_PER_CLUSTER | Integer | For a discordant cluster supporting a MEI event (either forward or reverse), how many reads should there at least be in the cluster to be kept. |
| DISCORDANT_CLUSTERS_MAX_OVERLAP | Integer | For discordant clusters coming from both the 5' and 3' side of the same insertion, how many bp may the clusters overlap to join the clusters? (To allow for target site duplications). |
| DISCORDANT_CLUSTER_MAX_DISTANCE | Integer | How many bp may the clusters be away from each other to still join the clusters? |
| MAX_SPACING_OF_CLIPPED_READS | Integer | For clipped reads coming from one side of the insertion: how much spacing in bp may there be in between the clipped alignment ends in order to join these reads into one clipped cluster? |
| MAX_OVERLAP_OF_CLIPPED_CLUSTERS | Integer | For clusters of clipped reads coming from both the 5' and 3' side of the insertion: how many bp may they overlap in order to be joined? |
| MAX_DISTANCE_OF_CLIPPED_CLUSTERS | Integer | For clusters of clipped reads coming from both the 5' and 3' side of the insertion: how many bp may they be distanced in order to be joined? |
| NEIGHBORHOOD_WINDOW_BP | Integer | How far away may discordant reads be in order to be joined into one cluster? |
| MINIMUM_SUPPORTING_READS | Integer | For a prediction to be kept: how many supporting reads should there be in total for the event (total of discordant supporting reads + clipped supporting reads). |
| MINIMUM_INITIAL_SPLIT_CLUSTER_READS | Integer | How many reads should there minimally be in a clipped cluster. |
| REPEATMASK_FILE | String (Location) | Location of the repeat mask file, in order to filter predictions which are indicative of a reference MEI. |
| TEMP_DIR | String (Location) | Directory to write temporary files to. Only needed if default tmp directory is too small. |
Native output file format
MEI predictions will be outputted to
*_predictions.txt
which is a tab-delimited format using the following columns (the .txt file can be converted to VCF using
MobsterToVCF):
Column | Description |
Chr | Reference chromosome in which predicted MEI has taken place |
Mobile Element | Predicted mobile superfamily (ALU, L1, SVA, HERVK) |
Insert Point | Predicted insertion coordinate |
border5 | Leftmost coordinate for 'prediction window' |
border3 | Rightmost coordinate for 'prediction window'. The predicted MEI has most likely been inserted somewhere within the prediction window. |
Merged | Whether in a second iteration round clusters have been merged for this prediction. |
Sample | Comma-seperated list of samples in which the event was called. |
sample_counts | Number of supporting reads per sample (should be the sum of cluster5 and cluster3 hits) |
cluster5 length | (End position of rightmost supporting discordant pair anchor from anchors on the 5' of the insertion point/mapped on forward strand) - (Start position of leftmost supporting discordant pair anchor from anchors on the 5' of the insertion point/mapped on forward strand)) . A discordant pair anchor consists of an anchor which maps uniquely to the reference genome and a mate which maps to the mobiome. (Set to NA if there is no cluster5) |
cluster3 length | (End position of rightmost supporting discordant pair anchor from anchors on the 3' of the insertion point/mapped on reverse strand)) - (Start position of leftmost supporting discordant pair anchor from anchors on the 3' of the insertion point/mapped on reverse strand) . (Set to NA if there is no cluster3) |
cluster5 hits | Number of supporting anchors on the 5' of insertion point (Set to NA if there are no cluster5 hits). These hits include both the discordant and split supporting reads (split5). |
cluster3 hits | Number of supporting anchors on 3' of insertion point (Set to NA if there are no cluster3 hits). These hits include both the discordant and split supporting reads (split3). |
split5 hits | Number of split reads on 5' of insertion point, clipped on the right side. Clipped part should be mappable to mobiome or polyA/polyT. |
split3 hits | Number of split reads on the 3' of insertion point, clipped on the left side. Clipped part should be mappable to mobiome or polyA/polyT. |
polyA5 hits | Number of split5 reads with a polyA |
polyT5 hits | Number of split5 reads with a polyT |
polyA3 hits | Number of split3 reads with a polyA |
polyT3 hits | Number of split3 reads with a polyT |
original discordant unique | Of the supporting discordant pair anchors: how many anchors were in pairs mapped both to a unique site in the reference genome. Being discordant because of mapping orientation or insert size. |
original multiple | Of the supporting discordant pair anchors: how many anchors were in pairs in which the mate of the anchor mapped to multiple sites in the reference genome. |
original unmapped | Of the supporting discordant pair anchors: how many anchors were in pairs in which the mate of the anchor did not map to the reference genome. |
leftclipped max dist | The maximum distance between clipping positions of split3 reads (based on mapping) |
rightclipped max dist | The maximum distance between clipping positions split5 reads (based on mapping) |
leftclipped same pos | Fraction split3 reads with same clipping position (based on mapping) |
rightclipped same pos | Fraction of split5 reads with same clipping position (based on mapping) |
clipped avg qual | The mean base quality of the clipped subsequences. (-1 if there are no split reads) |
clipped avg length | The mean length of the clipped subsequences. (-1 if there are no split reads) |
target site duplication | Is there evidence for a target site duplication or deletion? |
Converting to VCF
You can convert the
*_predictions.txt
file to VCF using
MobsterToVCF.
java -jar MobsterVCF-<version>.jar -file <input_file> -out <output_vcf>
Changing the mobilome: MEI detection in other organisms
Mobster has been designed to detect MEIs in human samples and has also only been tested for this purpose. However it is possible to
use Mobster for detection of MEIs in other organisms or even use Mobster to detect viral reintegrations. This can be done
by changing the mobilome from active mobile element consensus sequences specific to human to the consensus sequences of your interest.
Note that Mobster has not been tested very well on other organisms or viral reintegrations. Therefore the accuracy of Mobster may be
greatly reduced.
If you are interested in using Mobster for other purposes you can follow the below instructions:
- Download the sequences you are interested in finding, in fasta format. If you need mobile element sequences, this is a good resource: Repbase
- Reformat the fasta header (example below is for a plant):
Original fasta header for two helitron repeats may be:
>HELITRONY3 Helitron Arabidopsis thaliana
>HELITRONY2 Helitron Arabidopsis thaliana
Rename the fasta headers to:
>HELITRONY3.ME_CAT.Helitron
>HELITRONY2.ME_CAT.Helitron
The headers will always need to be formatted as ><sub family>.ME_CAT.<superfamily>
.
Clustering of reads will then be done on the basis of <superfamily>
, superfamily can be an arbitrary name. If you want HELITRONY3 and HELITRONY2 reads not to cluster together then
rename to HELITRONY3.ME_CAT.HELITRONY3
and HELITRONY2.ME_CAT.HELITRONY2
- Build the mobilome database using Mosaik:
MosaikBuild -fr <input fasta consensus sequences> -oa <output_database>.dat
MosaikJump -ia <output_database>.dat -out <output_jump_prefix> -hs 9
NOTE: increase the -hs (hash parameter) when using a lot of sequences to reduce mapping time (e.g. to 13 or so).
- Edit the Mobster properties file:
Now in the properties file which you will supply to MobileInsertions-<version>.jar
edit the MOBIOME_MAPPING_CMD
to match your newly formed .dat and and jump file
In addition modify the property REPEATMASK_FILE
to point to a dummy repeat mask file, i.e. a repeatmask file containing only 1 repeat mask entry on a coordinate not relevant for your organism.
Unfortunately the way Mobster is designed, it can not filter known reference MEIs for other organisms than human and will crash without a repeatmask file. This is why you should supply a dummy repeatmask file.
After calling, it is recommend to filter for known reference MEIs, see bullet point below.
- Post-filtering of known MEIs in reference genome of organims of interest
Download a list of known reference MEIs for your organism from example the UCSC Genome Browser.
Now filter out known reference MEIs using your own scripts or for instance converting the MEI calls to VCF using our tool MobsterToVCF and filter with bedtools.
Changing the genome build
By default Mobster is configured to run with HG19, but it can be configured to run for different builds of the Human Genome.
You can change the genome build by modifying the property
REPEATMASK_FILE
in
Mobster.properties
to the location of the hg38 repeast mask file by
REPEATMASK_FILE=../resources/repmask/alu_l1_herv_sva_other_grch38_ucsc_repeatmasker.txt