Today I have tried for the first time to use the SVD quartet method for inferring species trees from Single Nucleotide Polymorphism (SNP) data. This also marks my first foray into the new PAUP test versions.
In case it is unclear what this is about, the situation is that you have several samples per species and genome-wide SNPs for each of them, and you want to infer the species tree, that is a phylogeny that has species instead of your samples as the OTUs, under the assumption of the coalescent model. This post thus ties in with my previous posts on species tree methods for multi-locus data and the Bayesian approach to species trees from SNPs.
And just like those it will not at all be about the theory but about how to do it in practice. For those who fancy dozens of pages of mathematical formulas, the developers of the method, Chifman & Kubatko, have provided the background on arxiv. For the moment, however, I just wanted to try it out.
The newest version of the SVD quartet method is implemented in PAUP, so head over there and download it. I am using Linux, so I have to use the command line version, but that is okay. The main work of setting up a nexus file cannot be avoided anyway.
The file needs to look like this:
#nexus
begin data;
dimensions ntax = 24 nchar = 3489 ;
[These values obviously depend on your dataset. The important
point here is that ntax actually needs to be 2x the number of samples
you have, so here 12 samples but 24 lines; we will see below why that is.]
format datatype = dna gap = - ;
matrix
yoursample1_1 AACA?ACCACA...
yoursample1_2 AACC?ACCACA...
yoursample2_1 CACCAACCACA...
yoursample2_2 CACCAACCACA...
...
[Every sample has got two lines indicating the two alleles in each case.
Often you will only have 0/2 for homozygous and 1 for heterozygous, in
which case it will be annoying to reformat the data. I have adopted a
Python script I originally wrote for transformation into STRUCTURE
format. It turns 0 into two As, 2 into two Cs, and 1 into A and C.
Not sure if the method would accept the standard data setting of PAUP
and 0s and 1s...]
;
end;
begin sets;
taxpartition yourspeciesset =
species1 : 1-8,
species2 : 9-12,
species3 : 13-17,
species4 : 18-20,
species5: 21-24 ;
[With this you assign the samples to species, in this case e.g. the samples
#5 and #6 to species #2 (again, each sample has two lines).]
end;
Now that all the preparations have been made, open a terminal, navigate to your PAUP folder, and start the program (in my case I had to set the executable flag first). In PAUP, read in the data file with execute yourfilename.
The quartets analysis command has the name svdquartets, but PAUP accepts abbreviations as long as they are unique. svdq ? will bring up its list of options. In my experiments so far, the most relevant observations have been that the defaults are not to do a species tree analysis and not to do a bootstrap. Given the point of this post, the former will have to be changed. In addition, the set of species needs to be a parameter:
svdq partition = yourspeciesset speciestree = yes
If the tree makes sense to me, I prefer saving it with taxon names in the tree block, just in case I need to copy it elsewhere later, so I use
savetrees format = altnex brlens = yes taxablk = no file = yourtreefilename.tre
My dataset is quite large, nearly 9,000 SNPs for 240 samples representing 20 species. Tree inference took only about five minutes on my desktop computer, so as expected it is very fast compared with SNAPP. Bootstrapping, however, accordingly takes hours.
My main problem now is that the tree I get does not make a lot of sense. The reason might be that I have a lot of missing data. When I am done bootstrapping I will try some simpler datasets and see if it works better with them.
Update 28 February 2016: Something is seriously off about the results, they just don't make any sense. Maybe a bug? Will try a Windows version of PAUP and see if that works better.
Update 26 March 2016: Yes, it seems as if the latest test versions PAUP have an issue with binary data, but David Swofford is working on it, and turning the 0/1 SNP data back into the bases should solve the issue.
In case it is unclear what this is about, the situation is that you have several samples per species and genome-wide SNPs for each of them, and you want to infer the species tree, that is a phylogeny that has species instead of your samples as the OTUs, under the assumption of the coalescent model. This post thus ties in with my previous posts on species tree methods for multi-locus data and the Bayesian approach to species trees from SNPs.
And just like those it will not at all be about the theory but about how to do it in practice. For those who fancy dozens of pages of mathematical formulas, the developers of the method, Chifman & Kubatko, have provided the background on arxiv. For the moment, however, I just wanted to try it out.
The newest version of the SVD quartet method is implemented in PAUP, so head over there and download it. I am using Linux, so I have to use the command line version, but that is okay. The main work of setting up a nexus file cannot be avoided anyway.
The file needs to look like this:
#nexus
begin data;
dimensions ntax = 24 nchar = 3489 ;
[These values obviously depend on your dataset. The important
point here is that ntax actually needs to be 2x the number of samples
you have, so here 12 samples but 24 lines; we will see below why that is.]
format datatype = dna gap = - ;
matrix
yoursample1_1 AACA?ACCACA...
yoursample1_2 AACC?ACCACA...
yoursample2_1 CACCAACCACA...
yoursample2_2 CACCAACCACA...
...
[Every sample has got two lines indicating the two alleles in each case.
Often you will only have 0/2 for homozygous and 1 for heterozygous, in
which case it will be annoying to reformat the data. I have adopted a
Python script I originally wrote for transformation into STRUCTURE
format. It turns 0 into two As, 2 into two Cs, and 1 into A and C.
Not sure if the method would accept the standard data setting of PAUP
and 0s and 1s...]
;
end;
begin sets;
taxpartition yourspeciesset =
species1 : 1-8,
species2 : 9-12,
species3 : 13-17,
species4 : 18-20,
species5: 21-24 ;
[With this you assign the samples to species, in this case e.g. the samples
#5 and #6 to species #2 (again, each sample has two lines).]
end;
Now that all the preparations have been made, open a terminal, navigate to your PAUP folder, and start the program (in my case I had to set the executable flag first). In PAUP, read in the data file with execute yourfilename.
The quartets analysis command has the name svdquartets, but PAUP accepts abbreviations as long as they are unique. svdq ? will bring up its list of options. In my experiments so far, the most relevant observations have been that the defaults are not to do a species tree analysis and not to do a bootstrap. Given the point of this post, the former will have to be changed. In addition, the set of species needs to be a parameter:
svdq partition = yourspeciesset speciestree = yes
If the tree makes sense to me, I prefer saving it with taxon names in the tree block, just in case I need to copy it elsewhere later, so I use
savetrees format = altnex brlens = yes taxablk = no file = yourtreefilename.tre
My dataset is quite large, nearly 9,000 SNPs for 240 samples representing 20 species. Tree inference took only about five minutes on my desktop computer, so as expected it is very fast compared with SNAPP. Bootstrapping, however, accordingly takes hours.
My main problem now is that the tree I get does not make a lot of sense. The reason might be that I have a lot of missing data. When I am done bootstrapping I will try some simpler datasets and see if it works better with them.
Update 28 February 2016: Something is seriously off about the results, they just don't make any sense. Maybe a bug? Will try a Windows version of PAUP and see if that works better.
Update 26 March 2016: Yes, it seems as if the latest test versions PAUP have an issue with binary data, but David Swofford is working on it, and turning the 0/1 SNP data back into the bases should solve the issue.
No comments:
Post a Comment