DNA Terminology | Sproul Project

A discussion recently in the U106 forum prompted some questions about Big Y, haplogroups, clades, etc. A response was given by one of the forum's moderators, Dr. Iain McDonald. The posting is so thorough I had asked his permission to post it here. Although the SNPs used in this example DOES NOT represent those in the Sproul Project, the process is the same. This is solely for our understanding of how we use DNA to support our genealogical research;

James has given us an example of three descendants from the three Miller brothers. Their father, the most-recent common ancestor (MRCA) for all three men, was born in 1785. (Although the third Miller has no documented link to this father, we'll presume that that's an accurate relationship for now.) They share a common "terminal" haplogroup, R-BY62556, and each has two SNPs which are "private" to his own test. We'll use this as an example.

TERMINOLOGY

-----------

The first thing we have to clarify is the terminology. Several jargon terms are employed in genetic genealogy, and some are specific to Family Tree DNA in particular.

The first is the haplogroup. In Y-DNA, a haplogroup describes a group of related tests which share a set of shared SNPs. This might be a broad, ancient haplogroup like R-U106 or R-M269, or even haplogroup R itself. Or it might be a very recent haplogroup containing only two or three tests, like R-BY62556.

A haplogroup is distinct from the mutations it defines. Sometimes a haplogroup only represents one SNP, e.g., the only SNP we know of that defines R-U106 is U106 itself. Note that the "R-" prefix is given to the haplogroup - we frequently forget to append this on the forum. However, often a haplogroup is defined by many SNPs. In the case of R-BY62556, it is currently defined by BY62556, BY145647, BY100214, BY71181, etc. In all, there are 23 different SNPs that go into making R-BY62556. There is nothing special about any one of these SNPs, and we've arbitrarily selected BY62556 to denote the haplogroup. A haplogroup may also be defined in terms of other mutations, including STR length changes, insertions, deletions or MNPs, though Family Tree DNA only allows a definition by SNPs, as these are the most robustly recorded.

Another word you will see for haplogroup is "clade". These are distinct terms. A haplogroup technically refers to the group of tests; a clade refers to the men that took those tests. However, you'll generally see "haplogroup" and "clade" used interchangeably in genetic genealogy. You may hear talk of "parent clades" and either "sub-clades" or "child clades", and even "brother clades". (Technically, clades are feminine, so get "daughter/sister/mother clade", but this doesn't make as much sense in Y-DNA). If you consider the incommplete sequence of haplogroups R-Z156 > R-S5560 > R-S5512 > R-BY62556, R-S5512 is the parent clade of R-BY62556, and R-BY62556 is the child clade of R-S5512. R-BY62556 has one brother clade, R-BY33289.

The other two elements we need to define are the oft-maligned terms "terminal" haplogroup and "private" SNPs. Any given man will belong to many different haplogroups on the Y-DNA tree, perhaps including R-M343, R-M269 and R-U106, and you can follow the Y-DNA tree down from its ancient root down through these haplogroups to reach the "terminal" haplogroup. It's called a "terminal" haplogroup because this is where the tree currently reaches a terminus. However, this doesn't mean it's a dead end, since everyone has "private" SNPs that are unique to their test and haven't been found in anyone else... yet. You might also see "private" SNPs being labelled "novel variants". In James's case, each of the three Miller men has two "private" SNPs that aren't found in the other two Miller testers.

WHEN DID A SNP FORM?

--------------------

If you're asking this, you're probably asking the wrong question.

In general, we don't care about individual SNPs. As genealogists, we are interested in relationships. SNPs are just pointers to that relationship. Consider the analogy of a beam in a building. It was common practice to re-use beams from one building in another. Dendrochronology and carbon dating can give you an accurate date for when the tree lived, and there are a number of archaeological or paper records you can use for when the building was built. You can ask the question "when was this beam cut?" (cf. when did this SNP form) and you'll get the answer "after the tree was cut down" and "before the building was built" (cf. after the parent haplogroup formed formed and before the child haplogroup formed). There are only very specific times when we care about when a beam was cut. In general, we care much more about when the building was built. Similarly, we very rarely care about when an SNP formed, we almost always care much more about when the relationship it traces was defined.

In James's case, we know that all three men who have tested R-BY62556 and descend from a common ancestor born in 1785. Hence, the foundation of the haplogroup R-BY62556 is 1785 AD. It may not always stay that way, but that is what it is now. This is our reference point. We also know that its parent haplogroup, R-S5512, formed sometime during the second millennium BC. For simplicity, let's say exactly 1500 BC, though I'll stress this isn't meant to be an exact date in reality.

We normally expect about one SNP every 125 years in a BigY test, but mutations occur randomly, so this is only an average. Remember I said that R-BY62556 was defined by 23 SNPs? Those occurred some time between circa 1500 BC and 1785 AD. That's a little slower than one SNP per 125 years, but about right. Remember also that I said that none of these SNPs is special compared to the others. They are all equal, and any one of them could have happened at any time between the foundation of R-S5512 and R-BY62556, so between circa 1500 BC and 1785 AD. We have no way of knowing exactly when each one occurred, and no way of knowing which order they came in. It may be that BY62556 itself occurred in 1785 AD, or it may have been in 1499 BC. All we know is it isn't in the people James is related to around 1500 BC, and it is in the people with whom James shares a common ancestor in 1785 AD, so it must have occurred between these two dates.

The only way we can work out which SNPs occurred in which order is to find a fourth tester who is related to James between 1500 BC and 1785 AD. He will normally be positive for some of the SNPs, and negative for some of the others. That man will establish a new haplogroup, and those 23 SNPs will be split between R-BY62556 and the new haplogroup. If that new man is BY62556-, the name R-BY62556 will retain with James and his close cousins and a new name will be chosen among the SNPs all four men share to define a haplogroup relating those four men. If the new man is BY62556+, the haplogroup R-BY62556 will belong to all four of them, and a new name will be chosen to define the haplogroup James shares with his close cousins. So names float around, but it's the relationships that underpin them that matter to us.

An example where the age of the SNP is pertient might come from a SNP pack or single SNP test. These are common through both Family Tree DNA and YSeq.net, and can give people a quick answer of whether someone is related to someone. However, it redefines the common ancestor in peculiar ways. Say someone comes along and tests only the SNP BY62556, and they come back BY62556+. All other things being equal, they are most likely related to James before 1785, so the common ancestor of R-BY62556 will probably move backwards in time. But we don't know how far. There is no way of saying whether someone with a SNP test like this related to James in 1499 BC or more recently. Only by testing more of the 23 SNPs that define R-BY62556 can we make an educated guess as to how far back he might be related. Given the cost of this, he would normally be better off taking a next-generation sequencing test like BigY test find out. This is the power of BigY - it tests across the Y chromosome, so it allows us to make good statistical estimates.

No-one knows what second half the Y chromosome looks like, because it's made up of the same repeating units that we can't tell apart.

* http://www.jb.man.ac.uk/~mcdonald/genetics/report-2017-primer.pdf

--------------------------------------------

There are two reasons SNPs can be "private". The most common is that they are unique to that person's line of descent. In James's case, that would mean those SNPs have happened since 1785, in his family line. Two SNPs in this timeframe is about the right number, since we expect one every 125 years on average. The other reason is that a SNP is simply not recorded in the other tests in that haplogroup. To understand that, we need to look at what the BigY test actually covers.

The Y chromosome is a complex structure with a wide variety of genetic code in it. The last page of my primer document* gives an overview of where and what these sections are. Some of the code is unique to specific locations on the Y chromosome. This is easily read by seuqencing tests. However, most of the chromosome isn't so easy. Some of the code has been copied from other chromosomes over the last few million years, and remains very similar to the code in those chromosomes. Most of the Y chromosome, however, is made up of highly repetitive structures. Some of these are the STRs we're familiar with from our 37 / 67 / 111 marker Y-STR tests. Other regions contain longer repeating structures which take a variety of forms. These are very difficult and even impossible to read with current technology. No-one knows what second half the Y chromosome looks like, because it's made up of the same repeating units that we can't tell apart.

* http://www.jb.man.ac.uk/~mcdonald/genetics/report-2017-primer.pdf

What sequencing tests like BigY do is break your DNA up into small segments. The "read length" of the test defines the typical size of these segments. Normally, they're about 100 to 150 base pairs long, compared to the 57 million base pairs on the Y chromosome. A test can only read a section of chromosome if these small segments can be placed correctly onto that section of our map of the Y chromosome. If there is any ambiguity about where it is placed, you'll end up with a bad mapping. When structures repeat, it can be very difficult to know which repeat each segment corresponds to, so they can't be mapped.

However, this isn't the same for each test. Variations in the quality of the genetic sample, inherited mutations in all chromosomes, and how the lab equipment is behaving that day all go into defining whether a particular section of someone's DNA will be read. Two tests from the same person will return slightly different parts of their Y chromosome. Normally, there's about a 98% overlap in the parts of the Y chromosome that two different tests will cover. But SNPs can and often do fall into the 2% that doesn't overlap. SNPs can also throw off the matching, because they represent differences from the template Y chromosome, so this ends up representing more than 2% of SNPs.

So "private" SNPs may either be unique to your family line (in James's case after 1785), or may simply be missing from a person's test. Arguably, Family Tree DNA doesn't do a very good job of differentiating between the two, and one has to go into the raw data files to check. (This is one reason we like to collect them and, although I've yet to perform this check on these three files myself, this is something we're hoping to get done routinely again.)

In James's case, he's highlighted three blue SNPs: BY205805, BY208246 and BY209136.

For privacy reasons, Family Tree DNA will not identify on their public haplotree SNPs that are private to a single individual. These SNPs are listed as equivalent to BY62556 and form part of R-BY62556. They will be those identified in the two other Miller testers, but not called as positive or negative in James's test because they weren't covered. However, we can presume that James is positive for them because we know that all three Millers descend from the same common ancestor. Without checking the raw data, we can also presume that he is not explicitly called negative for them because Family Tree DNA has not created a new haplogroup (sometimes this takes a few weeks, though).

James's example highlights the value of getting more people within a haplogroup tested. Of the 23 SNPs in R-BY62556, only 19 of them were recorded successfully in the first two tests. The three blue SNPs James highlighted, plus BY64954 (which wasn't called in the second Miller test) are all called in the third Miller test, bringing that 19 up to 23. Roughly 17% of recordable SNPs were missed in either his test or the second Miller test. This is a slightly higher percentage than normal, but not by much. A few more SNPs might be found from a fourth, fifth or sixth test. However, by that point a more advanced test (e.g. the YElite or WGS tests offered by FGC or YSeq) represent a better investment. However, you are never going to bag all the SNPs in your line until someone comes along with a test that can accurately read the entire Y chromosome. That's some years off yet.

HOW CLOSELY RELATED IS A HAPLOGROUP?

------------------------------------

James and his cousins share the same branch of the Y-DNA tree. There are no SNPs that are clearly called positive in one of the cousins and negative in two of the others - this would form a new haplogroup. We've taken as read that they share the same common ancestor, as three sons of the same founding father. However, the third son doesn't have a good paper trail to the putative father.

This isn't too dissimilar from many haplogroups high up in the tree, e.g. R-Z156 has at least a dozen different sub-clades. How close are these men related? To answer this, we need to look in detail at the random processes associated with SNP generation. Statistics of random processes are governed by Poisson statistics. So even though a SNP will occur every 125 years on average, some clades will take longer to produce their first SNP, some will take less long. For example, if you take the following formula (for use in almost any spreadsheet software): =POISSON(0,164/125,1) will give the probability of not forming a SNP over 164 years, and equals approximately 27%. Conversely, you can say that there is a 100%-27% = 73% chance that a person will form at least one SNP over 164 years. Similarly, one can compute:

=POISSON(0,86/125,1) to find that the probability of (not) forming a SNP within 86 years of a most-recent common ancestor is about 50%. (It's 86 years and not 125 years because the probability distribution isn't a Gaussian [bell curve]). So 50% of the sub-clades of R-Z156 should share a common ancestor within about 86 years (2-3 generations) of each other - i.e. the 12 known R-Z156 lines probably represent at least six different great-grandchildren of the R-Z156 founder. There is also a 50% chance that the third Millar in James's group is related more closely than 1699 AD (86 years before 1785 AD).

Of course, we can compute the same numbers for different percentages, e.g.:

=POISSON(0,311/125,1) gives an 8.3% probability of not forming a SNP within 311 years, or one in 12. So if you take the 12 lines descended from R-Z156, chances you have twelve different lines no more than 311 years (about nine generations) after the common R-Z156 ancestor. Equally, there's a 91.7% chance that the third Millar in James's group is related more closely than 1785 - 311 = 1474 AD.