Blog Stéphane Bortzmeyer: StackOverflow data to PostgreSQL

StackOverflow data to PostgreSQL

First publication of this article on 14 June 2009

The great social site Stack Overflow just announced the publication of its entire database under a Creative Commons free licence. I believe it is the first time such an important social networking site publishes its data, so it is a great day for data miners. In this small article, I will explain how I entered these data into a PostgreSQL database for easier mining. (The work was done on a Debian machine but it should work on any Unix.)

The original file is huge (200 megabytes today, and growing). To avoid killing the Stack Overflow servers, it is distributed in a peer-to-peer fashion with BitTorrent. I just downloaded the torrent file http://blog.stackoverflow.com/wp-content/uploads/so-export-2009-06.7z.torrent to my download directory and BitTorrent does the rest. I then extract the XML files with p7zip. Each XML file store a class of Stack Overflow objects:

Users, today 88,558 (but many of them registered once but never came back afterwards),
Posts, today 182,742 questions and 698,923 answers,
Comments, today 705,085,
Votes,
Badges.

I then created one SQL table for each class. The complete DDL instructions are in file so-create.sql. I can then create the database and its schema:

% createdb --encoding=UTF-8 so
% psql -f so-create.sql so

I am normally a fan of integrity constraints in databases. But, in the case of Stack Overflow, there are many obvious integrity constraints that I did not include because they are violated at least one, especially in old data (presumably during beta testing of the site). For instance, 18,239 users (20 %) has no name (see for instance http://stackoverflow.com/users/57428, the one with the highest reputation) and therefore I cannot write name TEXT NOT NULL.

Same problem with the accepted answer, some posts reference an answer which is not available (for instance post #202271 has an accepted answer in the XML file, #202526, which does not exist).

Once the database is set up, we just have to parse the XML files and to load them in the database. I choose the Python programming language and the ElementTree XML library. I produce SQL files which uses the COPY instruction.

The Python program is available as so-so2postgresql.py. To execute it, you just indicate the directory where the Stack Overflow XML files have been extracted. Then, you run PostgreSQL with the name of the produced SQL COPY file, and with the name of the database:

% python so-so2postgresql.py /where/the/files/are > so.sql
% psql -f so.sql so

This so-so2postgresql.py program requires a lot of memory, because it keeps the entire XML tree in memory (that is a weakness of the XML library used). Do not attempt to run it on a machine with less than eight gigabytes of physical RAM, swapping will make it awfully slow. You may also have to increase the available memory with ulimit.

Once the program is over, you can start studying the data:

so=>  SELECT avg(reputation)::INTEGER FROM Users;
 avg 
-----
 183
(1 row)

so=> SELECT avg(reputation)::INTEGER FROM Users WHERE reputation > 1;
 avg 
-----
 348
(1 row)

The first analysis produced with this database was an exploration of the "fastest gun in West" syndrome (available at stack-overflow-short-tail.html, in French).

Many people already posted on the subject of the Stack Overflow database. For instance:

Data Mining the StackOverflow Database (SQL server, only as a video in this article but other articles are more text-friendly),
What interesting stats can I obtain from the Stack Overflow data-dump?, with many interesting statistics,
Stack Overflow: Up and Down Voting Pattern Analysis,
Statistics about tags, users and questions at StackOverflow, ServerFault and SuperUser,
How to Import the StackOverflow XML into SQL Server, which focuses on the import into a database, like this article,
A Python script to load StackOverflow data in the rdbhost system. Unlike my script, it uses SAX and not DOM and therefore can run with much less memory,
StackOverflow question statistics (nice graphs).
Franck Piochaud made a much improved version (specially performance-wise) of my code.

Version PDF de cette page (mais vous pouvez aussi imprimer depuis votre navigateur, il y a une feuille de style prévue pour cela)

Source XML de cette page (cette page est distribuée sous les termes de la licence GFDL)

Mon blog

Autres trucs

StackOverflow data to PostgreSQL