Notes and Technical Details

The online version of this family history was first generated by scanning in the typewritten version, and using optical character recognition (OCR) – via pyocr and tesseract – to convert the typed pages into plain text. (The pages were first converted from PDF to jpeg images using the Python Image Library, or Pillow.) This resulted in mostly usable text, but still required fairly significant manual cleaning to correct errors.

Creating a script to properly parse all of the details in the text proved challenging, and ultimately the text was manually inputted into three CSV files, designating people, children, and marriages.

Note: During this process, gender was added as an additional piece of information for each person. However, adding this information involved making some assumptions. First, while many names are largely masculine or feminine, in some cases a gender-neutral name meant that I had to guess a person’s gender based on the name of their spouse, which may have led to inaccuracies. Second, in all cases, the assumption is that individuals identify with the gender associated with their birth name, which may not always be the case. Individuals may be transgender, non-binary, or otherwise not self-identify with the gender that is stated here. (If you are aware of any inaccuracies in this regard, please let me know and I am happy to correct it.) Gender can have such a large role to play in people’s lives, so it felt important to include here; but it is important to keep the assumptions about this data in mind.

Initially, data was inputted into a Neo4j database, a graph database that stores data as a network of nodes and edges (i.e., connections between nodes). Given that a family tree constitutes a graph structure, this seemed like a natural fit. However, after running some performance benchmarks, PostgreSQL still turned out to provide performance speedups (reductions in latency of ~50-70%) over Neo4j. Note that these benchmarks included the performance of both the database and the Python packages used to interface with the database, and were also highly specific to the particular needs of this app. In addition, in absolute terms Neo4j still provided results on the order of hundredths of a second—for this relatively small amount of data, the choice of database is largely inconsequential. However, PostgreSQL was chosen in the end in order to also be able to export the data to CSV format in a more readily consumable layout.

The app itself uses Flask, a Python web framework, using psycopg2 to access the Postgres database. The graphical family tree representations are drawn using dTree. Both the Flask app and the database are set up in Docker containers for easier maintenance and dependency management.

Code for the app is available on Github.