First publication of this article on 14 June 2009
The great social site Stack Overflow just announced the publication of its entire database under a Creative Commons free licence. I believe it is the first time such an important social networking site publishes its data, so it is a great day for data miners. In this small article, I will explain how I entered these data into a PostgreSQL database for easier mining. (The work was done on a Debian machine but it should work on any Unix.)
The original file is huge (200 megabytes today, and growing). To
avoid killing the Stack Overflow servers, it is distributed in a
peer-to-peer fashion with
BitTorrent. I just downloaded the torrent file
http://blog.stackoverflow.com/wp-content/uploads/so-export-2009-06.7z.torrent
to my download directory and BitTorrent does the rest. I then
extract the XML files with
p7zip. Each XML file store a class of Stack
Overflow objects:
I then created one SQL table for each
class. The complete DDL instructions are in
file so-create.sql
. I can then create the database and
its schema:
% createdb --encoding=UTF-8 so % psql -f so-create.sql so
I am normally a fan of integrity constraints in
databases. But, in the case of Stack Overflow, there are many obvious
integrity constraints that I did not include
because they are violated at least one, especially in old data
(presumably during beta testing of the site). For instance, 18,239
users (20 %) has no name (see for instance http://stackoverflow.com/users/57428
, the one with the highest
reputation) and
therefore I cannot write name TEXT NOT NULL
.
Same problem with the accepted answer, some posts reference an answer which is not available (for instance post #202271 has an accepted answer in the XML file, #202526, which does not exist).
Once the database is set up, we just have to parse the XML files
and to load them in the database. I choose the Python
programming language and the ElementTree XML
library. I produce SQL files which uses the COPY
instruction.
The Python program is available as so-so2postgresql.py
. To execute it, you just indicate the
directory where the Stack Overflow XML files have been
extracted. Then, you run PostgreSQL with the name of the produced SQL
COPY file, and with the name of the database:
% python so-so2postgresql.py /where/the/files/are > so.sql % psql -f so.sql so
This so-so2postgresql.py
program requires a
lot of memory, because it keeps the entire XML
tree in memory (that is a weakness of the XML library used). Do not
attempt to run it on a machine with less than eight gigabytes of
physical RAM, swapping will make it awfully
slow. You may also have to increase the available memory with
ulimit.
Once the program is over, you can start studying the data:
so=> SELECT avg(reputation)::INTEGER FROM Users; avg ----- 183 (1 row) so=> SELECT avg(reputation)::INTEGER FROM Users WHERE reputation > 1; avg ----- 348 (1 row)
The first analysis produced with this database was an exploration
of the "fastest gun in West" syndrome (available at stack-overflow-short-tail.html
, in
French).
Many people already posted on the subject of the Stack Overflow database. For instance:
Version PDF de cette page (mais vous pouvez aussi imprimer depuis votre navigateur, il y a une feuille de style prévue pour cela)
Source XML de cette page (cette page est distribuée sous les termes de la licence GFDL)