Genealogy Ancestor Search Using Math

One from the archives, from an old (and not yet concluded) project to search for the origins of my name. What is this graph and what does it mean? It’s a kind of family tree, but the full explanation is below!

Most genealogy is personal and pretty boring so I’ll skip that stuff. Instead, I’ll write here about statistics, which is comparatively interesting.

My surname, Handmer, is rare enough that everyone who has it (about 20-30 living people) can have a unique URL. Previous family history efforts (before my time) sought some connection with the slightly more common name “Hanmer” and the Welsh hamlet near Whitchurch of the same name. When I got interested in this stuff, in the last few years of my paternal grandmother’s life, I was able to use DNA and digital archiving and family tree tools to replicate, in a few weeks of effort, what had taken my predecessor many years to work out, solving a few mysteries, correcting a few errors, etc, along the way. I now have a fairly complete family tree back to about 1800 but still little insight into where the Handmers come from.

Here’s what I tried and how it worked.

The first Handmers in Australia, my grandfather’s great grandfather, emigrated from Britain to South Australia in 1851 on the Trafalgar, appear occasionally in contemporary newspapers and birth/death/marriage records, and disappear by 1870, by which time only two of his eleven children lived to adulthood. The kicker – when they arrived their names were John Jones and Lydia Williams with daughter Leah Emazine Jones – Handmer didn’t appear until later. There are a few records – a birth, a marriage, and a census in the UK but that’s it. John Jones is a very common name, so traditional genealogy tools are exhausted, at least until the next tranche of parish records are digitized.

Seeking a connection with contemporary Hanmers, I mapped the family tree of the various Hanmer branches – around 1800 there were 20-30 families in various parts of Britain, but none (with maybe one exception) could possibly be connected. John and Lydia enter the story in Oswestry, Shropshire – not insanely far from Hanmer (the hamlet) but not exactly close either. John was supposedly born in Wales but no birth or death record seems to exist. Incidentally, one of the most comprehensive family tree services is called Family Search, which is run by the Mormons for reasons…

The next strategy is Y-chromosome testing – the chromosome passed down from father to son. Through this method I’ve found a couple of 8th cousins, who share a common great^6 grandfather in the early 1700s, but nothing more recent. Given infant mortality, it’s possible, even likely, that only a single line passed 2-4 generations between then and the radiation of Handmers in Australia in the late 1800s. Even better, all the close Y-chromosome relatives names are “Roberts”, not Jones or Handmer or Hanmer!

Taking DNA a step further, there are no (so far) relatives via DNA with the name “Hanmer” so if there is a connection to one of those branches, there are no living descendants who have put their DNA into the major databases.

Speaking of which, the major database in the US is Gedmatch. There are a few million users, so practically everyone in the US will have at least a bunch of third cousins on there, which has enabled quite a few cold serial killer cases to be solved. I’ll flag here that DNA databases have interesting issues with privacy because if, for example, a few dozen of your third cousins are in some database, well, you are too, whether you like it or not. There’s just not that much genetic variation in human lineages.

At this point I was pretty blocked. Well, there was one remaining possibility. What if I used Gedmatch to find my 3000 closest relatives on the database? Of course, there are probably a million people on Earth this closely related to me, but 99% of them have never been sequenced, so we have to make the best with what we have. Of these 3000, the closest 20 or so are in my family tree, so I know which branch of the family they are in. I can then search through their relatives and split the 3000 according to which branch they’re from. Roughly speaking, 1500 are via my dad and 1500 are via my mum. I’m looking here for my paternal line, so take the 1500 paternal relatives and perform the same operation recursively.

Once I’ve identified a set of about 20 relatives who are related to me exclusively via the Handmer-Jones-Williams line, map their relatives. Why? Because at this far remove there’s a lot of noise in the system, and different people conserve different small segments of the ancestors DNA. Essentially John Jones’ father was a man who lived in an uncertain time and an uncertain place, with an unknown name, but much of his DNA survives, albeit in a pretty scrambled way, among thousands of descendants, and a few hundred of them have had their DNA sequenced.

Then work my way down the list checking family trees and locations. A few relatives in Australia descended from John Jones but not on the male line. A whole bunch in the US, where DNA testing and genealogy obsession are relatively common, and a few in north west Wales, including Anglesey, an island off the coast.

I’ve now mapped where most of my ancestors came from (mostly Britain and Ireland, which is why I can’t go outside in the sun much…) and one day I’ll spend a few weeks traipsing between various miserable tiny towns trying to imagine the mindset of mostly illiterate people who jumped on a leaky boat to go to Australia – more remote and exotic to them than Mars is to us.

This analysis worked surprisingly well, but it didn’t fill in any blanks on the tree. No-one I have contacted so far has a record-backed family tree (though there are a few imaginative reconstructions!) filling in the blanks. Such is life!

But thanks to computers, we can essentially perform this task on every branch of the tree recursively. Here’s how it works.

I searched for my 3000 closest relatives on Gedmatch, then scraped data for the 3000 closest relatives of each of those 3000. This enabled me to determine how each of the 3000 were related to each other, which forms a symmetrical matrix. Then apply fairly standard clustering algorithms to permute rows and columns to minimize a particular measure corresponding to the amount of data off the axis. I did this recursively one generation at a time, corresponding to how a family tree works. The result is a rather nice looking series of blocks, a digital representation of genetic data for genealogy nuts who are related to me.

To be clear, I’m not the first person to ever do this. Gedmatch itself, in its paid tier, offers a product that does a similar clustering and can even map it to your family tree automatically, which is nifty. But last time I checked, it didn’t work to quite this level.

Let me describe what we’re seeing on this graph. Each dot on the graph represents a relation between two of my relatives, and the clustering algorithm has brought closely related relatives close to each other in order. There are very few relatives who are a long way away from the axis of the matrix, and the axis contains a few very dense clumps, but is otherwise relatively sparse. By inspection, the axis breaks up quite cleanly into clumps in powers of two (2, 4, 8, 16, etc) each representing a more distant generation.

In more detail, the dots which are far from the axis fall into three groups, and the shape of their clumps can help you tell what’s going on. A dot far from the axis means that one of my relatives is related to more than one of my great^~4 grandparents. This could mean that either:

1) They’re very closely related to me. My immediate living relatives, some of whom are in this graph, are all descended from one of my pairs of great-grandparents, so they share lots of more distant ancestors with lots of the more distant relatives in this graph, and appear as a long streak.

2) They just happen to be descended, via other people (say, siblings circa 1850), on two or more routes that share ancestors with me. It’s not that unusual to meet people in Australia who have some family connection way back to my family – and if you kept asking, it wouldn’t be that unusual to find people who have more than one family connection. Say, related to two of my otherwise unrelated great grandparents. They would then appear as a lonely dot way out in the space far from the axis. The clustering algorithm has to choose which relative ends up on the axis and it picks the closest one. It is pretty common, for example, to find that someone on the Gedmatch database who is a 4th cousin via one branch is also a 7th cousin via another and a 9th via still another.

3) They’re very closely related to each other, as well as me. The dense square blobs on the horizontal axis represent great-great-great… grandparents who happened to come from relatively isolated communities that practiced a lot of “endogamy”, or marrying within the community. Obviously, since the chart isn’t one huge blob this practice ended at some point, and the bigger the blob, the more recently someone “married out” of the group. The largest blob, at the top left of the graph, represents one of my great great grandmothers, who despite being a second generation Australian, still lived within a community comprised almost entirely of people who had emigrated from the Scottish Isle of Skye, itself a rather insular place. Even within this, and other, blocks, one can see sub-blocks showing various degrees of inter-relatedness. For my great great grandmother, three of her four grandparents were quite closely related.

To make this situation concrete, on Gedmatch I have a few hundred ~4th cousins who are all, like me, descended from someone from Skye, and are thus all ~4th cousins to each other, and thus fill the graph with ink points. The other blocks on the graph are half the size and thus one generation earlier, and date back to the generation that migrated to Australia. Judging by the block density on the axis, something like a fifth of my zeroth generation Australian ancestors came from communities that were still practicing endogamy around 1800.

But why?

The cool thing here is that I’ve essentially reconstructing a genetic family tree well beyond the limitations of actual records. Typically these groupings extend back to the early 1700s, nearly a century (or 4 generations) beyond paper records for the areas of interest. I can select a region on the axis of this graph and, by carefully comparing who the closest relatives in that group are related to, determine which branch of my family tree it comes from. That means that everyone in that section shares a common ancestor. I can look up their family trees (with their cooperation), compare their surnames, etc, and try to solve mysteries.

Each block on the diagonal of this graph represents a person, or a family, whose only “identity” in 2024 is this transient echo of their genes. All traces of their name, their birthplace, their life, hopes, dreams, loves, and losses are lost forever. In all probability, they lived and died in a community that didn’t produce any kind of written records. But with technology and math we can find the echo of their existence in the relatedness of a few hundred living descendants today.

Even though I did this work in about 2019, I was reminded of it recently because of my work on the scroll prize and the question of how much of our lost past can be recovered with statistics and skill. Even though this sort of genetic analysis extends some limited genealogical knowledge about a century beyond paper records in NW Europe, it has its limitations. Every generation scrambles the data and cuts the signal by a factor of two. When doing a family tree, the number of ancestors increases exponentially (at least at first) but then data is lost exponentially into the past, and solving problems becomes extremely difficult. With DNA we can get a bit further sometimes, but commercially available testing kits only sequence about a million single nucleotide polymorphisms (SNPs). The genome is about 3.2 billion base pairs long, but most of these are relatively conserved between all humans (we share 99% with chimps!). There are about 100 million SNPs known across all humans, but of these most are not very common. So even if standard testing went from 1 million SNPs to 100 million, we might only get 10x more data, instead of 100x. 10x more data is great, but that would only buy us another 3-4 generations, which is less than a century.

Does that mean that DNA gets so scrambled that after 10-15 generations essentially the pedigrees are completely lost? In general, yes. That is, even if we sequenced every base pair of every living person in Europe and applied infinite computational power to the problem, earlier than 1600 the number of equally probable genealogical reconstructions grows exponentially. Y-chromosome and mitochondrial DNA haplogroups can allow extension along the paternal and maternal lines back tens of thousands of years, but essentially this is solving the inverse problem – the lines of descent of the last common ancestor of some group of men and women in a given place. Go back 10 generations and each person has about 1000 discrete ancestors – and probably some overlap. Can we create a genetic footprint of all 78 million Europeans in the year 1600 and uniquely map them to their living descendants? Mostly yes, for the ones that have living descendants, and probably half of them don’t. But try this for the 61 million in 1500 and the map will be much more fuzzy. In some cases, extensive records and/or high genetic diversity can fill in many of the gaps, but in general I think it cannot be done. I may be wrong (please let me know) but extending further back becomes exponentially hard. Even this technique, which recovers a fragment of evidence for the existence of a whole person’s life, only goes back so far.

I initially found this realization a bit sad. I thought at one point that if we sequenced all 8 billion people on Earth we could reconstruct the human family tree back to the first human half a million years ago, back to mitochondrial Eve and Y-chromosomal Adam, and everyone in between. Going forward, if we keep detailed enough records (including genetic sequencing) we can resolve this ambiguity, so genealogy researchers in the year 3000 might not suffer this frustration! But in general, it is sobering to think that our bodies are constructed by genes that are, in many cases, billions of years old. Little pieces of information that have been conserved even as mountains rose and fell, and whose individual identity, even within a discrete organism or species, only really exists in a statistical sense for 10 or so generations before even the might of our statistical capability cannot sift it from the noise.

Anyway, if you know anything about John Jones or Lydia Williams of Oswestry, Shropshire, let me know!

One thought on “Genealogy Ancestor Search Using Math

  1. I have just finished re-reading Donald Kingsbury’s

    Psychohistorical Crisis

    This a novel loosely connected to Asimov’s Foundation series

    The way that information degrades as you go backwards into history is explained

    Its a very very good book –

    Your blog brought that to mind

    My surname is nearly as rare – but I have not done the work that you have

    Like

Leave a comment