data

Exploring family relations from online obituaries using text mining and data visualisation tools

english tags:

Language section:

Using a copy of all the obituaries published online by the Ahram newspaper from January 2002 till April 2008 it is possible to use Linux command line tools (gawk, sed, bash) to find family relations between individuals in certain professions. An example given here explores the family links between a sample of 456 Egyptian state security officers.

This is a very brief description of the method.

  1. The first step is to convert the HTML files downloaded by curl into one giant text file.
  2. Then to move each separate obituary into a line of it's own.
  3. Extract officer names sandwiched between rank and place of work into a separate text file.
  4. Search for the names of each officer through each obituary, family links between different officers can be discovered.
  5. The output is in GraphViz .dot format, which draws a graph similar to the one below.
  6. Graph showing 63 family links between 174 officers from 456 Egyptian state security officers. Each link corresponds to a family tie of a variable degree of relationship including in-laws.

    Graph showing 63 family links between 174 officers from 456 Egyptian state security officers. Each link corresponds to a family tie of a variable degree of relationship including in-laws.

    This is just a preview of what might be possible using a data set of 43,156 obituary. Without control group(s), this graph says nothing other than it's pretty and that there are family links between officers. Other methods for analysis of data could be done using statistical methods to answer different questions.

    UPDATE: I attached the list of SS officers and the script used to find links between officers and output a .dot Graphviz file. You can download the obituaries dataset from here.

Share this post

Subscribe to RSS - data