Exploring family relations from online obituaries using text mining and data visualisation tools
Warning! This post is very old and may contain information or opinions that are no longer valid or embarrassing.
Using a copy of all the obituaries published online by the Ahram newspaper from January 2002 till April 2008 it is possible to use Linux command line tools (gawk, sed, bash) to find family relations between individuals in certain professions. An example given here explores the family links between a sample of 456 Egyptian state security officers.
This is a very brief description of the method.
The first step is to convert the HTML files downloaded by curl into one giant text file.
Then to move each separate obituary into a line of it's own.
Extract officer names sandwiched between rank and place of work into a separate text file.
Search for the names of each officer through each obituary, family links between different officers can be discovered.
The output is in GraphViz .dot format, which draws a graph similar to the one below.
Graph showing 63 family links between 174 officers from 456 Egyptian state security officers. Each link corresponds to a family tie of a variable degree of relationship including in-laws.
This is just a preview of what might be possible using a data set of 43,156 obituary. Without control group(s), this graph says nothing other than it's pretty and that there are family links between officers. Other methods for analysis of data could be done using statistical methods to answer different questions.
UPDATE: I attached the list of SS officers and the script used to find links between officers and output a .dot Graphviz file. You can download the obituaries dataset from here.
UPDATE Zeinobia wrote a very interesting post about the use of Ahram obituaries by the IDF before 1973 and what the obituaries mean to many Egyptian families:
Before the Yom Kippur war in 1973 the IDF soldiers and officers used to stand on the East bank of the Suez Canal calling Egyptian army officers and soldiers by name along with their families names surprisingly in order to prove that the IDF is the best army in this universe that knew everything everywhere.
The intelligence did not take to much long to figure out how the IDF and also Mossad got their info : Al Ahram Obituary pages , the most famous Who is Who pages in Egypt society and the cash cow of the famous newspaper along with its ad. Egyptian families especially the big and rich ones like to show off with their relations through big obituary , it has become a shallow and silly tradition that the longer the obituary is regardless of how expensive it will be , the most prestigious the family is. As personal experience there are still family members who are angry from my grandma on how she dared and published a small obituary for my grandfather in the famous newspaper !!
UPDATE If you look carefully you will find a couple of mistakes, duplicate names. I will try to fix those, but not now as I am incredibly busy.
Fucking brilliant man!
This is absolutely brilliant
That is so damn interesting and geeky! My geek senses are having an orgasm just for the possibility of doing that on linux :) i actually didnt know its possible to parse arabic text using sed and awk!
Well done, would be very kind if you could share the scripts too; i would really love to have a look at these.
LOL, gawk and bash gave me a headache not orgasm..
Actually, it's Graphviz ability to do Arabic by creating SVG files that amused me the most.
I attached the files.
Excellent work and i wonder how did you get all these obituaries? Impressive!
I'd like to see the same done for the judiciary, the police and public universities. I already have an idea of what the graphs for those would look like, but it would still be cool to visualize.