Mobster documentation

Mobster is built using maven, and assumes that this is in your path. To build mobster run "install.sh" this will package the required classes into a fat executable jar in the "target" directory. Typically called MobileInsertions-.jar. For example MobileInsertions-0.2.2.jar.

Installation

cd <dir containing mobster
./install.sh

Install MOSAIK manually (we have tested Mobster with MOSAIK v 2.1.33) Try the binaries directly or try building from source

wget https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/mosaik-aligner/MOSAIK-2.1.33-Linux-x64.tar

wget https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/mosaik-aligner/MOSAIK-2.1.33-source.tar

Or try installing via bioconda or visiting MOSAIC at github
Don't forget to update Mobster.properties to include the location for MOSAIK.
Finally add the Mosaik binaries like MosaikBuild to your $PATH, you can do this by:


cd

vim .bash_profile #or .profile

#Add the directory where the mosaik binaries are located to the PATH= line.

#or if you do not want this, change the MOBIOME_MAPPING_CMD in the ./Mobster/jars/Mobster.properties file to point to the location of both MosaikBuild and MosaikAlign

You are ready to go!
Test mobster by:

cd <dir containing mobster>/jars/

java -cp "./*" org.umcn.me.pairedend.Mobster -properties Mobster.properties

After a few minutes there will be almost 3000 MEI predictions in ./Mobster/test-out/

Usage

Before use, please check the properties file whether everything is set up correctly for your input BAM file (see below).

Single sample calling using 8Gb of RAM and current working directory where MobileInsertions-<version>.jar is located:

java -Xmx8G -cp "./*" org.umcn.me.pairedend.Mobster -properties Mobster.properties -in input.bam -sn test_sample -out mobster_test

Multiple sample calling using 8Gb of RAM and current working directory where MobileInsertions-<version>.jar is located:

java -Xmx8G -cp "./*" org.umcn.me.pairedend.Mobster -properties Mobster.properties -in A1_child.bam,A1_father.bam,A1_mother.bam -sn A1_child,A1_father,A1_mother -out A1_trio_mobster

Make sure that the bam files provided to the -in argument match the sample names provided to the -sn argument. BAM files and sample names should be comma separated.
In addition Mobster can also be run on single BAM files which contain multiple samples as long as they have proper readgroup tags with a provided sample name per tag.

Properties file

The most important property to set up correctly is MAPPING_TOOL. Make sure this property is set to unspecified when your input BAM file was mapped using an other mapper than bwa aln or mosaik. Even when bwa mem was used, please use unspecified. unspecified should work for most mapping tools and can even be used for bwa aln or mosaik aligned BAM files. Since Mobster 0.2.0 the default value for MAPPING_TOOL is unspecified. unspecified is unsupported in 0.1.6.

Other properties worth checking or setting are READ_LENGTH, MULTIPLE_SAMPLE_CALLING and MULTIPLE_SAMPLE_CALLING_STRINGENT (this value should normally be set to true to prevent high numbers of false positives) ,READS_PER_CLUSTER (recommended value = 1), MINIMUM_MAPQ_ANCHOR (in case MAPPING_TOOL is set to unspecified) and REPEATMASK_FILE (make sure the file is for the correct hg build).

For more detailed explanation of all properties, see the table below. Properties are categorized as follows:

	Properties relating to sequencing library such as mean fragment length
	Properties relating to input/output and processing
	Properties relating to evidence needed for MEI calling
	Properties relating to clustering of MEI supporting reads

Property	Value type	Description
USE_PICARD	{true, false}	Whether or not to use Picard to estimate the Insert Size metrics. This aids the determination of prediction borders, especially for events with only support from one side.
PICARD_COLLECT_INSERT_SIZE_METRICS_JAR	String (Location)	Location of Picard's CollectInsertSizeMetrics.jar. Will only be used when USE_PICARD is set to true.
MEAN_FRAGMENT_LENGTH	Integer	Mean fragment length of sequencing library. Not necessary to provide when USE_PICARD is set to true as Picard will estimate this value.
SD_FRAGMENT_LENGTH	Integer	Standard deviation of fragment length of sequencing library. Not necessary to provide when USE_PICARD is set to true as Picard will estimate this value.
LENGTH_99PROCENT_OF_FRAGMENTS	Integer	99 Percent of fragment lengths of sequencing library should be <= than this value. Not necessary to provide when USE_PICARD is set to true as Picard will estimate this value.
IN_FILE	String (Location)	Input BAM file. Note that Input BAM file can also be specified on the command line. When specified on command line this value will be ignored.
OUT_FILE	String (Location)	Output prefix. Note that if output prefix is specified on the command line, that this value will be ignored.
USE_READ_LENGTH	{true, false}	Whether or not to use read length to slightly improve the prediction windows of single cluster predictions.
READ_LENGTH	Integer	The read length. Will only be used if USE_READ_LENGTH is set to true.
MULTIPLE_SAMPLE_CALLING	{true, false}	Whether to turn on multiple sample calling. Can be used when multiple bam files are specified as input or when one bam file contains multiple samples as specified through the sample name field in the @RG tag. If multiple bam files are specified as input, multiple sample calling will be turned on, even if the value of this property was set to false
MULTIPLE_SAMPLE_CALLING_STRINGENT	{true, false}	Stringent mode: at least one sample needs to have >= MINIMUM_SUPPORTING_READS. When false the total of supporting reads for all samples should be >= MINIMUM_SUPPORTING_READS
MAPPING_TOOL	{unspecified, bwa, mosaik}	The mapper which was used for the INPUT bam file. Chose unspecified when you are unsure which mapper was used. NOTE: use bwa only if the reads were aligned with bwa aln. When bwa mem was used, mapper should be unspecified.
MINIMUM_MAPQ_ANCHOR	Integer [def: 20]	MINIMUM_MAPQ_ANCHOR will only be used when MAPPING_TOOL is set to unspecified. It specifies at what MAPQ a read can be considered as an anchor for a potential MEI.
USE_SPLIT	{true, false}	Whether mobster should try and search for clipped reads
MINIMUM_CLIP_LENGTH	Integer	When searching for clipped reads, how long should the unmapped sequence at least be (bp)
MAXIMUM_CLIP_LENGTH	Integer	If a clipped read is found with MINIMUM_CLIP_LENGTH, how long may the other end of the same read be MAXIMUM clipped?
MINIMUM_AVG_QUALITY	Integer	If a clipped read is found with MINIMUM_CLIP_LENGTH, what should the minimum average base quality be of that read?
MINIMUM_POLYA_LENGTH	Integer	How long should the poly A/T stretch minimally be in the clipped part of a read for it to be classified as a polyA/T tail.
MAXIMUM_MISMATCHES_POLYA	Integer	How many times may there be a different nucleotide than A for a polyA stretch to be still classified as polyA tail or different from T for a polyT tail?
MAX_RECORDS_IN_RAM	Integer	Number of records which are held in memory. Increase this number if you have more RAM available. According to Picard's sourceforge a rule of thumb for reads of ~100bp is to set MAX_RECORDS_IN_RAM to be 250,000 reads per each GB given to the -Xmx parameter. This should also roughly apply to Mobster.
MOBIOME_MAPPING_CMD	String	The command which will be used for mobiome mapping. When using mosaik as mobiome mapper the -p parameter is interesting to speed up the mapping step. -p indicates the number of processors used for mapping.
SAMPLENAME	String	Name of sample, can also be provided over the command line. Command line value overrides the property value.
MOBIOME_MAPPING_TOOL	{mosaik, bwa}	Mapping tool used for mobiome mapping. When changing this property you should modify the MOBIOME_MAPPING_CMD accordingly.
PAIRED_END	{true, false}	Deprecated property. Should always be true, single-end reads are not supported
READS_PER_CLUSTER	Integer	For a discordant cluster supporting a MEI event (either forward or reverse), how many reads should there at least be in the cluster to be kept.
DISCORDANT_CLUSTERS_MAX_OVERLAP	Integer	For discordant clusters coming from both the 5' and 3' side of the same insertion, how many bp may the clusters overlap to join the clusters? (To allow for target site duplications).
DISCORDANT_CLUSTER_MAX_DISTANCE	Integer	How many bp may the clusters be away from each other to still join the clusters?
MAX_SPACING_OF_CLIPPED_READS	Integer	For clipped reads coming from one side of the insertion: how much spacing in bp may there be in between the clipped alignment ends in order to join these reads into one clipped cluster?
MAX_OVERLAP_OF_CLIPPED_CLUSTERS	Integer	For clusters of clipped reads coming from both the 5' and 3' side of the insertion: how many bp may they overlap in order to be joined?
MAX_DISTANCE_OF_CLIPPED_CLUSTERS	Integer	For clusters of clipped reads coming from both the 5' and 3' side of the insertion: how many bp may they be distanced in order to be joined?
NEIGHBORHOOD_WINDOW_BP	Integer	How far away may discordant reads be in order to be joined into one cluster?
MINIMUM_SUPPORTING_READS	Integer	For a prediction to be kept: how many supporting reads should there be in total for the event (total of discordant supporting reads + clipped supporting reads).
MINIMUM_INITIAL_SPLIT_CLUSTER_READS	Integer	How many reads should there minimally be in a clipped cluster.
REPEATMASK_FILE	String (Location)	Location of the repeat mask file, in order to filter predictions which are indicative of a reference MEI.
TEMP_DIR	String (Location)	Directory to write temporary files to. Only needed if default tmp directory is too small.

Native output file format

MEI predictions will be outputted to *_predictions.txt which is a tab-delimited format using the following columns (the .txt file can be converted to VCF using MobsterToVCF):

Column	Description
Chr	Reference chromosome in which predicted MEI has taken place
Mobile Element	Predicted mobile superfamily (ALU, L1, SVA, HERVK)
Insert Point	Predicted insertion coordinate
border5	Leftmost coordinate for 'prediction window'
border3	Rightmost coordinate for 'prediction window'. The predicted MEI has most likely been inserted somewhere within the prediction window.
Merged	Whether in a second iteration round clusters have been merged for this prediction.
Sample	Comma-seperated list of samples in which the event was called.
sample_counts	Number of supporting reads per sample (should be the sum of cluster5 and cluster3 hits)
cluster5 length	`(End position of rightmost supporting discordant pair anchor from anchors on the 5' of the insertion point/mapped on forward strand) - (Start position of leftmost supporting discordant pair anchor from anchors on the 5' of the insertion point/mapped on forward strand))`. A discordant pair anchor consists of an anchor which maps uniquely to the reference genome and a mate which maps to the mobiome. (Set to NA if there is no cluster5)
cluster3 length	`(End position of rightmost supporting discordant pair anchor from anchors on the 3' of the insertion point/mapped on reverse strand)) - (Start position of leftmost supporting discordant pair anchor from anchors on the 3' of the insertion point/mapped on reverse strand)`. (Set to NA if there is no cluster3)
cluster5 hits	Number of supporting anchors on the 5' of insertion point (Set to NA if there are no cluster5 hits). These hits include both the discordant and split supporting reads (split5).
cluster3 hits	Number of supporting anchors on 3' of insertion point (Set to NA if there are no cluster3 hits). These hits include both the discordant and split supporting reads (split3).
split5 hits	Number of split reads on 5' of insertion point, clipped on the right side. Clipped part should be mappable to mobiome or polyA/polyT.
split3 hits	Number of split reads on the 3' of insertion point, clipped on the left side. Clipped part should be mappable to mobiome or polyA/polyT.
polyA5 hits	Number of split5 reads with a polyA
polyT5 hits	Number of split5 reads with a polyT
polyA3 hits	Number of split3 reads with a polyA
polyT3 hits	Number of split3 reads with a polyT
original discordant unique	Of the supporting discordant pair anchors: how many anchors were in pairs mapped both to a unique site in the reference genome. Being discordant because of mapping orientation or insert size.
original multiple	Of the supporting discordant pair anchors: how many anchors were in pairs in which the mate of the anchor mapped to multiple sites in the reference genome.
original unmapped	Of the supporting discordant pair anchors: how many anchors were in pairs in which the mate of the anchor did not map to the reference genome.
leftclipped max dist	The maximum distance between clipping positions of split3 reads (based on mapping)
rightclipped max dist	The maximum distance between clipping positions split5 reads (based on mapping)
leftclipped same pos	Fraction split3 reads with same clipping position (based on mapping)
rightclipped same pos	Fraction of split5 reads with same clipping position (based on mapping)
clipped avg qual	The mean base quality of the clipped subsequences. (-1 if there are no split reads)
clipped avg length	The mean length of the clipped subsequences. (-1 if there are no split reads)
target site duplication	Is there evidence for a target site duplication or deletion?

Converting to VCF

You can convert the *_predictions.txt file to VCF using MobsterToVCF.
java -jar MobsterVCF-<version>.jar -file <input_file> -out <output_vcf>

Changing the mobilome: MEI detection in other organisms

Mobster has been designed to detect MEIs in human samples and has also only been tested for this purpose. However it is possible to use Mobster for detection of MEIs in other organisms or even use Mobster to detect viral reintegrations. This can be done by changing the mobilome from active mobile element consensus sequences specific to human to the consensus sequences of your interest. Note that Mobster has not been tested very well on other organisms or viral reintegrations. Therefore the accuracy of Mobster may be greatly reduced.

If you are interested in using Mobster for other purposes you can follow the below instructions:

Download the sequences you are interested in finding, in fasta format. If you need mobile element sequences, this is a good resource: Repbase
Reformat the fasta header (example below is for a plant):
Original fasta header for two helitron repeats may be:
>HELITRONY3 Helitron Arabidopsis thaliana >HELITRONY2 Helitron Arabidopsis thaliana
Rename the fasta headers to:
>HELITRONY3.ME_CAT.Helitron >HELITRONY2.ME_CAT.Helitron
The headers will always need to be formatted as ><sub family>.ME_CAT.<superfamily>. Clustering of reads will then be done on the basis of <superfamily>, superfamily can be an arbitrary name. If you want HELITRONY3 and HELITRONY2 reads not to cluster together then rename to HELITRONY3.ME_CAT.HELITRONY3 and HELITRONY2.ME_CAT.HELITRONY2
Build the mobilome database using Mosaik:
MosaikBuild -fr <input fasta consensus sequences> -oa <output_database>.dat

MosaikJump -ia <output_database>.dat -out <output_jump_prefix> -hs 9 NOTE: increase the -hs (hash parameter) when using a lot of sequences to reduce mapping time (e.g. to 13 or so).
Edit the Mobster properties file:
Now in the properties file which you will supply to MobileInsertions-<version>.jar edit the MOBIOME_MAPPING_CMD to match your newly formed .dat and and jump file

In addition modify the property REPEATMASK_FILE to point to a dummy repeat mask file, i.e. a repeatmask file containing only 1 repeat mask entry on a coordinate not relevant for your organism. Unfortunately the way Mobster is designed, it can not filter known reference MEIs for other organisms than human and will crash without a repeatmask file. This is why you should supply a dummy repeatmask file. After calling, it is recommend to filter for known reference MEIs, see bullet point below.
Post-filtering of known MEIs in reference genome of organims of interest
Download a list of known reference MEIs for your organism from example the UCSC Genome Browser. Now filter out known reference MEIs using your own scripts or for instance converting the MEI calls to VCF using our tool MobsterToVCF and filter with bedtools.

Changing the genome build

By default Mobster is configured to run with HG19, but it can be configured to run for different builds of the Human Genome. You can change the genome build by modifying the propertyREPEATMASK_FILE in Mobster.properties to the location of the hg38 repeast mask file by REPEATMASK_FILE=../resources/repmask/alu_l1_herv_sva_other_grch38_ucsc_repeatmasker.txt