Do we always need to remap published aDNA data?

The second step when another ancient human DNA paper is published (the first one is obviously to read it) is usually to download the genomic data in order to use it for our projects. Most papers (especially those concerned with Western Eurasian prehistory) try to include all previously published individuals, which is good for various reasons. The data comes in various shapes: unprocessed raw reads, trimmed/merged reads or bam files of already mapped reads or even genotype calls. The question is what kind of data to use. A recent discussion on whether to remap everything using our pipeline inspired me to this post.

Various forms of f statistics are widely used in the population genetic analysis of aDNA data. Essentially, they measure correlations between allele frequencies in different populations. The usual interpretation of correlated allele frequencies would be some shared ancestry. There is a certain risk that such correlations might arise from technical artefacts in the handling of different data sets and individuals. The results of at least two papers during the last year may have been substantially affected by some kind of technical artefacts (Gallego Llorente et al. and Mondal et al. see also here and here).

I downloaded the raw reads for 163 ancient individuals from Mathieson et al. and mapped the data to the human reference using bwa aln with two different settings for the -n parameter which controls the mismatch rate to the reference. I used both the default -n 0.04 and -n 0.01 which allows for more mismatches and helps to use more of the (potentially degraded) sequences in the library. Then I merged both versions of each into one tped file by randomly choosing one read per SNP and bam file (~1.2 million captured SNP sites in total). Then, I used popstats to calculate the statistic D(Chimp, hg19_reference; IndividualX_n0.01, IndividualX_n0.04), so I tested whether the hg19 reference genome shares more alleles with either of two versions of the same individual. The answer should obviously be “no”.

 

dtest-hg19-out-hist

(left) Histogram of the Z scores (zero as blue vertical line, the empirical mean is shown in red) and (right) QQ-plot of expected vs observed Z scores

Across all 163 individuals, the results show an enrichment of positive Z scores (expectation would be 0). In fact, the entire distribution seems to be shifted towards more positive values. The median is 0.59, the mean is 0.62 which is significantly different from zero (t-test, p=3e-11). Seventeen Z scores are larger than 2, three are even larger than 3, so there is quite some room to wrongly interpret the results of D tests when the samples have been processed differently. This setting is quite artificial as nobody will ask which individual is more or less “reference”, but the human reference sequence itself is closer to some populations than to others. So spurious effects like this could also make a test individual seem more similar to some populations. I know this effect is relatively small, but it is quite a difference whether we claim that population X traces 2% of its ancestry to population Y or if it’s actually 0%.

The results make sense: allowing for less mismatches introduces some inherent reference bias as the processing with more mismatches will accept the non-reference allele more often. The random sampling of one read per SNP position is based on the assumption that reference and alternative read would occur 50/50 at a heterozygous position, which seems to be violated due to the reference bias.

The reference bias is likely going to affect all sorts of f statistics and related methods (f3, f4, D, ABBA-BABA, qpAdm, TreeMix etc.) and it might affect others as well. So I’m definitely going make sure that all data is processed as equally as possible which means to process all samples through the same in silico pipeline even if that can be quite time-consuming for some big aDNA papers. I’m also assuming that other steps in the pipeline (e.g. different diploid genotype callers, base quality rescaling) could have similar effects of causing spurious correlations.

SMBE2016 ahead

On Sunday, this year’s edition of SMBE starts in Gold Coast Australia. I’m very excited not just because of flying to Australia (thanks to Wenner-Gren for the money): Emma, Helena, Cris and I organized a session entitled “Ancient DNA and the effects of human innovation on genomes”. Emma and I will host that session on site and we have some pretty exciting talks lined up. The session is on July 7th from 10:30am – 12:30pm in Room 6.

In addition to that, Cris will be presenting our results on Neolithic Iberia, where we see some interesting local patterns – both geographical and temporal (July 7th, 12:00 pm). Emma is presenting a poster on a Neolithic Scandinavian dog and how it relates to modern dog breeds. Finally, I will be talking on Wednesday afternoon (4:30pm – 6:30pm, also Room 6) about Cris’ and my results from two medieval Moroccans. They represent the first ancient genomic DNA from Northern Africa and they look like a unique mix up of populations. We don’t really know what they are but I will present the genetic results and discuss possibilities (slaves, soldiers, something else?).

I’m looking forward to some exciting talks and posters, to have some interesting discussions, meet old friends and make new ones. See you Down under!!!

Two new publications

The last two weeks have been quite successful with two new papers on the publication side, one of them in Scientific Reports and another one in Molecular Ecology.

The latter paper is somewhat special to me. Christian, Karl, Ivan from Hohenheim, and myself characterized five Arabidopsis thaliana populations from the Italian Alps genomically as well as phenotypically. We found some nice signals which suggest local adaptation to different altitudes (up to 2,350 meters) and their environments and we also studied how their demographic history relates to the last glacial maximum. So why is this one special? This is not just my last Arabidopsis paper for a while. It is also the first study where I was involved in planning the experiments as well as the sequencing. It was the reason for visiting Graham’s lab in Davis and the reason for developing Bayenv2 with him. That was almost five years ago, so working on the actual publication took a lot longer than intended but it has been quite exciting to work on it and the process itself has involved some very important steps towards “growing up”.