Boris Smus

interaction engineering

Diffusion of Literacy in 19th Century Canada

The ancient Sumerians invented writing sometime around 3500 BCE. But how did writing get refined? How did it spread outside of Sumer? In general, tracking diffusion of ancient technology is hard. For tracking the spread of literacy, however, here's an interesting idea. Innumerate people may fudge their self-reported age to round or auspicious numbers. Individually, this leads to terrible earworms. In aggregate, this error is called age heaping and may be a decent proxy for literacy. In this post, I dig into the Province of Canada's 1852 census, scraped from automatedgenealogy.com. To whet your appetite, just look at these beautiful age heaps:

Age Heaps in Canada's 1852 census

Splitting the census data by demographics and calculating the ABCC Index on each, what can we infer about literacy in 19th Century Canada?

Timelines shminelines

It's easy to create a timeline of historical inventions. Steam Engine? No problem: 100 CE. Musical Notation? 2000 BCE. Bam! Land mines? 1277. There are many such timelines available. But the first known working date of an invention is usually a miniscule part of the whole story. More interesting is what led to the invention being practical. How was it refined? How did it spread? What was the impact? What cascade of other inventions did it lead to? What were the cultural implications, the second order effects?

Technological diffusion is hard to measure

William Gibson famously quipped, "the future is already here — it's just not very evenly distributed." Well, how uneven is the distribution? What about over time? For ancient inventions, the archaeological record is often sparse. Historians of technology fall back to depictions in art and philolology, both often controversial sources.

Aside: psychology of numbers

I recently came across some finishing times for the 2019 Boston Marathon, which included very clear peaks at the qualifying time (~3h for that year) as well as round numbers:

Boston Marathon 2019 finishing times

It's inspiring to see people strive for and achieve goals. When I tweeted this, @MelancholyYuga introduced me to a phenomenon I'd not heard of called Age Heaping.

Literacy from self-reported age?

Here's the theory: people that struggle with numbers often don't know their own age precisely, instead producing a round or lucky number. Perhaps because they can’t subtract to infer their current age from their birth year and the current year. Or perhaps because they don't know their birth year at all. In aggregate, these errors add up to create Age Heaps, clearly visible at decades, with smaller peaks at ages ending in 5.

Given any survey with self-reported ages, one can calculate a metric that estimates the literacy of a population. Various indices like Whipple's Index or ABCC Index quantify this.

Finding an early census is hard

Rather than read boring papers, I decided to try this idea on a raw dataset. I immediately thought of the Domesday book, a famous medieval English survey. Happily, it's been digitized and is available online in quite a nice form at opendomesday.org. Unfortunately, the survey was not granular enough to ask for self-reported ages, mostly focusing on the shire level.

Newer historical census data is generally owned by the government, but then digitized and indexed by private companies usually focused on genealogy, such as https://familysearch.org. A fascinating wrinkle on early US censuses: before 1840, the US census only named the head of household and then bucketed dependents into seemingly arbitrary age categories: (0-10, 10-16, 16-26, 26-45, 45+). This throws a wrench into finding any Age Heaps.

Another source for data with self-reported ages are US passenger arrival records, for example this Ellis Island arrival database. Once again, you can search the corpus via their website, but they do not provide the underlying dataset, even if you ask them nicely by email.

1852 Census of the Province of Canada

I was about to throw in the towel but then found https://automatedgenealogy.com, a crowdsourced effort to digitize old (pre-1920) Canadian censuses. Digitizing a census is tricky since the records were entirely handwritten in flowing cursive, and I don't think we have good enough OCR to make a dent on this problem. So the process involves transcribers and verifiers working in tandem. I'm not entirely sure how well vetted the data from the site is, but I scraped it to do this analysis anyway.

This census includes the following fields:

  1. Name
  2. Occupation
  3. Country of Origin
  4. Religion
  5. Self-reported Age (YAY!)
  6. Sex

The self-reported age field is key. But the other demographic information is also interesting. Looking at ABCC scores for sub-populations will be fun.

ABCC Indices by demographics

First off, I was pleasantly surprised to see very clear age heaping in this census.

ABCC ScoreSouls
Overall88%1134930
Geography
East (Quebec)91%481108
West (Ontario)85%653822
Birthlace
Born in Canada92%273873
Born outside Canada87%861057
Occupation
Labourer79%83754
Farmer86%80544
Servant71%13378
Wife89%11214
Spinster87%5992
Carpenter88%5511
Yeoman88%4505
Blacksmith90%3226
Merchant92%2327
Housekeeper82%1743
Teacher85%1737
Tailor91%1713
Cooper87%1617
Weaver81%1277
Miller89%1137
Religion
Catholic78%176343
Methodist92%145544
Church of England84%111262
Presbyterian85%104949
Episcopalian86%37826
Church of Rome73%11208
No Religion (explicitly)91%2850

Wait, what exactly are ABCC Indices measuring?

But is it really the case that if you don't know your age, you're likely to be innumerate? All signs point to yes.

More seriously, a big caveat applies here:

Though it can stand in as an acceptable proxy for literacy, our findings suggest that age-heaping is most plausibly interpreted as a broad indicator of cultural and institutional modernization rather than a measure of cognitive skills.

Mostly open questions around the edges

I suppose it makes sense that merchants (ABCC 92%) and skilled craftsmen (Blacksmith 90%, Tailor 91%) would be more advanced on a measure of "cultural and institutional modernization" as compared to servants (71%) and laborers (79%).

It also makes sense that those born in Canada would remember their age better than immigrants. The simplest explanation is that those born in Canada are generally younger. Looking at the two age distributions side-by-side is revealing:

But why are residents of Canada East so much better at telling their own age than those of Canada West?

And why do Methodists know their age so much better than Catholics? I've always found protestant denominations mysterious. The plot thickens!

Speculating wildly, if your culture is really into birthdays, you would closely track your age and the age of close relatives. Knowing your own age would then have little bearing on your ability to write or do arithmetic, throwing shade on the whole premise.

Eyes on the prize

I'd originally hoped to find a way to measure diffusion of literacy over time. But the 1852 Census of Canada only gives a single snapshot in time, and other census data available on automatedgenealogy.com does not overlap geographically. In search for more data, I found and purchased Russians to America Passenger Data, covering 1834 - 1900.

Quick pre-registration of hypotheses on that dataset:

  • Subsequent waves of Russian immigrants exhibit increasing ABCC.
  • Strong correlation between size of city of origin and ABCC.

Stay tuned and find out.