Python, XML and XML-RPC : A perfect combination

The Problem

I used to host this website over at Startlogic.com and since my account was expiring soon, I wanted to take the opportunity to try out a new web hosting service. Now, the move itself wasn't that bad with regards to the site content, but the problem was to move the hundreds of posts I had made on my blog. I wasn't ready to just delete all that hardwork and move on ... and I had to come up with a solution fast - before they restricted my account. This is where a little bit of Python hacking came handy along with the ability to export my SQL tables for Wordpress into an XML file and the XML-RPC Blogger API that Wordpress supports. More on that later.

Wordpress Posts

Wordpress stores all the posts and comments in a MySQL database and in tables that are intuitively named. Therefore, I went into myPhpAdmin and using a simple SQL statement,

select * from wp_posts
I got a list of all my blog postings and conveniently exported them to an XML file using the in-built feature of myPhpAdmin. Thus, now I had a soft copy of all the posts in one conveniently formatted file.  The  problem now was to parse this file, extract the relevant data and then somehow post it to my new blog, all automatically.

The XML file

The XML file that was created by myPhpAdmin looked somewhat like this:

<farhanah_wordpress>
  <!-- Table wp_posts -->
    <wp_posts>
        <ID>1</ID>
        <post_author>1</post_author>
        <post_date>2004-12-12 17:59:51</post_date>
        <post_date_gmt>2004-12-13 01:59:51</post_date_gmt>
        <post_content>Welcome to FarhanAhmed.net!

I am sorry about the downtime for the last couple of weeks. I realized that the service I should except from my hosting service provider should be directly proportional to the amount they charge me - unfortunately they turned out to be really unreliable and so I had to switch. Let's see how these people at &lt;a href=&quot;http://www.startlogic.com&quot;&gt;StartLogic&lt;/a&gt; turn out to be.

</post_content>
        <post_title>Welcome to FarhanAhmed.net</post_title>
        <post_category>1</post_category>
        <post_excerpt></post_excerpt>
        <post_status>publish</post_status>
        <comment_status>open</comment_status>
        <ping_status>open</ping_status>
        <post_password></post_password>
        <post_name>hello-world</post_name>
        <to_ping></to_ping>
        <pinged></pinged>
        <post_modified>2004-12-12 19:15:10</post_modified>
        <post_modified_gmt>2004-12-13 03:15:10</post_modified_gmt>
        <post_content_filtered></post_content_filtered>
        <post_parent>0</post_parent>
        <guid>http://farhanahmed.net/?p=1</guid>
        <menu_order>0</menu_order>
    </wp_posts>
As you can see, each post is a embedded into the item <wp_posts>, and within each item, the relevant data is conveniently stored as <post_content>, <post_title> and <post_date>. It is these three pieces of data that I wanted to export to my new blog, I couldn't care less about the other ones.

XML-RPC Interface to Wordpress

Wordpress supports the Blogger and Movable Type extension APIs to post and read contents form a blog. The XML-RPC interface is provided by the PHP file xmlrpc.php under the root directory of the Wordpress installation. For this task, all I needed was the metaWeblog.newPost() function call. Now, my choice of the metaWeblog.newPost() function over the simpler blogger.newPost() function stems from the fact that the metaWeblog.newPost() supports the "date published" feature for a post, whereas  the blogger.newPost() function  does not.  In hindsight though,  I should have used the simpler blogger API since the data published feature never worked anyway!  It could be because of the limitations of the XML-RPC implementation in Wordpress or something else, but  the posts that I exported to my new blog all had the timestamp of the  date they were exported, rather than the dates that I extracted from the XML file.

The Program

Here's is the Python program that did it all. I will explain each part of the program in subsequent sections.

# Import needed libraries:
import xmlrpclib;
from xml.dom import minidom;
from pprint import pprint;

# Define the blog publishing function
def blogPost( server, username, password, date, title, content ):
   
    datastruct = {'pubDate': date, 'description':content, 'title':title}
    returncode = server.metaWeblog.newPost('1',username,password,datastruct,1)
    print returncode

servname = xmlrpclib.Server("http://www.farhanahmed.net/blog/xmlrpc.php")

# Initialize the settings
d_username = 'admin'
d_password = 'd85a3f'
pprint(servname.system.listMethods())

# Open main XML file for parsing
maindoc = minidom.parse('wp_posts.xml')

# Get the list of <wp_posts> items
posts = maindoc.getElementsByTagName('wp_posts')

#  Initialize index
index = 0

# For each post, retrieve <post_date>, <post_content>, and <post_title>
for ipost in posts:

    n_date = '';
    n_title = '';
    n_content = '';
   
    index = index + 1

    print "Post : ", index
    print
   
    # Get the published date
    ldate = ipost.getElementsByTagName('post_date')
    for node in ldate:
        n_date = node.firstChild.data

    # Get the title of the post
    ltitle = ipost.getElementsByTagName('post_title')
    for node in ltitle:
        n_title = node.firstChild.data

    # Get the content of the post
    lcontent = ipost.getElementsByTagName('post_content')
    for node in lcontent:
        n_content = node.firstChild.data

    blogPost(servname, d_username, d_password, n_date, n_title, n_content)

maindoc.unlink()

The Code

Let's start from the beginning:

# Import needed libraries:
import xmlrpclib;
from xml.dom import minidom;
from pprint import pprint;


These two lines of code import the necesaary libraries from the default installation of Python. For this project, I used Python 2.4 that is available from http://www.python.org/ as a convenient MSI package for Win32. As is evident from the names, xmlrpclib exports the XML-RPC functions, xml.dom.minidom supports the XML parsing routines, and the not so obvious pprint library supports the "pretty print" interface that I just used to debug the script.

# Define the blog publishing function
def blogPost( server, username, password, date, title, content ):
   
    datastruct = {'pubDate': date, 'description':content, 'title':title}
    returncode = server.metaWeblog.newPost('1',username,password,datastruct,1)
    print returncode

This is the crux of the publishing part of the script. This function takes the server name ( a xmlrpclib.Server object), the username, the password, the date, title and content to publish to the blog and then makes a simple RPC (remote procedure call) to the metaWeblog.newPost() function to make the post. Before I can do that though, I create a struct that holds the data to be published which is in the form of key:value structure, where the keys are pre-defined in the API documentation. Of course, the return code is then printed out to make sure the post was successful. The return code is an integer denoting the post ID of the post just made.

servname = xmlrpclib.Server("http://www.farhanahmed.net/blog/xmlrpc.php")

# Initialize the settings
d_username = 'admin'
d_password = 'd85a3f'

This is where the initialization happens. xmlrpclib.Server creates an instance of a XML-RPC connection object and takes just the server name as the parameter. In this case, the server is the XML-RPC interface provided by Wordpress. The username and password variables are then initialized to the known values (don't worry, I've changed the password :)).

pprint(servname.system.listMethods())

To test the server connection, I then make a call to the system.listMethods() function that returns a list of all supported functions by the server. Of course, I make sure to see if metaWeblog.newPost() is one of them. pprint()just makes the output prettier by formatting it in a table-like form.

# Open main XML file for parsing
maindoc = minidom.parse('wp_posts.xml')

# Get the list of <wp_posts> items
posts = maindoc.getElementsByTagName('wp_posts')

#  Initialize index
index = 0

Okay, so this is where the XML parsing begins. I used the minidom.parse() function to create an instance of the main document parser. This function takes the filename to be parsed, to which I pass the XML file I exported from myPhpAdmin. Then, a call to getElementsByTagName() is made which returns a list of a all elements that match the given name. For me, since I want to extract all the posts, I pass in wp_posts, since all posts are embedded within this item.

for ipost in posts:

    n_date = '';
    n_title = '';
    n_content = '';
   
    index = index + 1

    print "Post : ", index
    print
   
    # Get the published date
    ldate = ipost.getElementsByTagName('post_date')
    for node in ldate:
        n_date = node.firstChild.data

    # Get the title of the post
    ltitle = ipost.getElementsByTagName('post_title')
    for node in ltitle:
        n_title = node.firstChild.data

    # Get the content of the post
    lcontent = ipost.getElementsByTagName('post_content')
    for node in lcontent:
        n_content = node.firstChild.data

    blogPost(servname, d_username, d_password, n_date, n_title, n_content)

This is the meat of the parsing. For each of the posts that I made a list of, I use the same function, getElementsByTagName() to extract the various fields like post_content, post_title and post_date. I then use the firstChild pointer to extract the data within these items. The code is pretty self-explanatory. The extracted data is then stored in variables and passed to the blogPost() function that I defined earlier. This happens for all the posts in the XML file. Of course, the index is incremented each cycle and printed out for me to see the progress of the operation.

maindoc.unlink()

The last line of the program breaks the connections between the XML file and the parser and frees up the allocated memory.

That was easy, wasn't it? This script took me a couple of hours to conujure up, without having any knowledge of XML, XML-RPC, or Python when I began. A lot more could have been done here, but this did help me achieve my goal of exporting all my blog entries to the new blog without having to go through the trouble of manually cut-and-pasting the entries.

Any comments or suggestions are welcome.


Farhan Ahmed. 2005.