Python, XML and XML-RPC : A perfect combination
The Problem
I
used to host this website over at Startlogic.com and since my account
was expiring soon, I wanted to take the opportunity to try out a new
web hosting service. Now, the move itself wasn't that bad with regards
to the site content, but the problem was to move the hundreds of posts
I had made on my blog. I wasn't ready to just delete all that hardwork
and move on ... and I had to come up with a solution fast - before they
restricted my account. This is where a little bit of Python hacking
came handy along with the ability to export my SQL tables for Wordpress
into an XML file and the XML-RPC Blogger API that Wordpress supports.
More on that later.
Wordpress Posts
Wordpress stores all the posts and comments in
a MySQL database and in tables that are intuitively named. Therefore, I
went into myPhpAdmin and using a simple SQL statement,
select
* from wp_posts
I got a list of all my blog postings and
conveniently exported them to an XML file using the in-built feature of
myPhpAdmin. Thus, now I had a soft copy of all the posts in one
conveniently formatted file. The problem now was to
parse this file, extract the relevant data and then somehow post it to
my new blog, all automatically.
The XML file
The XML file that was created by myPhpAdmin
looked somewhat like this:
<farhanah_wordpress>
<!-- Table wp_posts -->
<wp_posts>
<ID>1</ID>
<post_author>1</post_author>
<post_date>2004-12-12 17:59:51</post_date>
<post_date_gmt>2004-12-13
01:59:51</post_date_gmt>
<post_content>Welcome to FarhanAhmed.net!
I am
sorry about the downtime for the last couple of weeks. I realized that
the service I should except from my hosting service provider should be
directly proportional to the amount they charge me - unfortunately they
turned out to be really unreliable and so I had to switch. Let's see
how these people at <a
href="http://www.startlogic.com">StartLogic</a>
turn out to be.
</post_content>
<post_title>Welcome to
FarhanAhmed.net</post_title>
<post_category>1</post_category>
<post_excerpt></post_excerpt>
<post_status>publish</post_status>
<comment_status>open</comment_status>
<ping_status>open</ping_status>
<post_password></post_password>
<post_name>hello-world</post_name>
<to_ping></to_ping>
<pinged></pinged>
<post_modified>2004-12-12
19:15:10</post_modified>
<post_modified_gmt>2004-12-13
03:15:10</post_modified_gmt>
<post_content_filtered></post_content_filtered>
<post_parent>0</post_parent>
<guid>http://farhanahmed.net/?p=1</guid>
<menu_order>0</menu_order>
</wp_posts>
As you can see, each post is a embedded
into the item <wp_posts>,
and within each item, the relevant data is conveniently stored as <post_content>,
<post_title> and <post_date>.
It is these three pieces of data that I wanted to export to my new
blog, I couldn't care less about the other ones.
XML-RPC Interface to Wordpress
Wordpress supports the Blogger and Movable
Type extension APIs to post and read contents form a blog. The XML-RPC
interface is provided by the PHP file xmlrpc.php under the root
directory of the Wordpress installation. For this task, all I needed
was the metaWeblog.newPost() function call. Now, my choice of the
metaWeblog.newPost() function over the simpler blogger.newPost()
function stems from the fact that the metaWeblog.newPost() supports the
"date published" feature for a post, whereas the
blogger.newPost() function does not. In hindsight
though, I should have used the simpler blogger API since the
data published feature never worked anyway! It could be
because of the limitations of the XML-RPC implementation in Wordpress
or something else, but the posts that I exported to my new
blog all had the timestamp of the date they were exported,
rather than the dates that I extracted from the XML file.
The Program
Here's is the Python program that did it all.
I will explain each part of the program in subsequent sections.
#
Import needed libraries:
import
xmlrpclib;
from
xml.dom import minidom;
from
pprint import pprint;
#
Define the blog publishing function
def
blogPost( server, username, password, date, title, content ):
datastruct = {'pubDate': date, 'description':content, 'title':title}
returncode =
server.metaWeblog.newPost('1',username,password,datastruct,1)
print returncode
servname
= xmlrpclib.Server("http://www.farhanahmed.net/blog/xmlrpc.php")
#
Initialize the settings
d_username
= 'admin'
d_password
= 'd85a3f'
pprint(servname.system.listMethods())
#
Open main XML file for parsing
maindoc
= minidom.parse('wp_posts.xml')
# Get
the list of <wp_posts> items
posts
= maindoc.getElementsByTagName('wp_posts')
#
Initialize index
index
= 0
# For
each post, retrieve <post_date>,
<post_content>, and <post_title>
for
ipost in posts:
n_date = '';
n_title = '';
n_content = '';
index = index + 1
print "Post : ", index
print
# Get the published date
ldate = ipost.getElementsByTagName('post_date')
for node in ldate:
n_date = node.firstChild.data
# Get the title of the post
ltitle = ipost.getElementsByTagName('post_title')
for node in ltitle:
n_title = node.firstChild.data
# Get the content of the post
lcontent = ipost.getElementsByTagName('post_content')
for node in lcontent:
n_content = node.firstChild.data
blogPost(servname, d_username, d_password, n_date, n_title, n_content)
maindoc.unlink()
The Code
Let's start from the beginning:
#
Import needed libraries:
import xmlrpclib;
from xml.dom import minidom;
from pprint import pprint;
These two lines of code import the
necesaary libraries from the default installation of Python. For this
project, I used Python 2.4 that is available from http://www.python.org/
as a convenient MSI package for Win32. As is evident from the names, xmlrpclib
exports the XML-RPC functions, xml.dom.minidom supports the XML parsing routines, and the not so
obvious pprint library supports the "pretty print" interface that
I just used to debug the script.
#
Define the blog publishing function
def
blogPost( server, username, password, date, title, content ):
datastruct = {'pubDate': date, 'description':content, 'title':title}
returncode =
server.metaWeblog.newPost('1',username,password,datastruct,1)
print returncode
This is the crux of the publishing part of
the script. This function takes the server name ( a xmlrpclib.Server
object), the username, the password, the date, title and content to
publish to the blog and then makes a simple RPC (remote procedure call)
to the metaWeblog.newPost() function to
make the post. Before I can do that though, I create a struct that
holds the data to be published which is in the form of key:value
structure, where the keys are pre-defined in the API documentation. Of
course, the return code is then printed out to make sure the post was
successful. The return code is an integer denoting the post ID of the
post just made.
servname
= xmlrpclib.Server("http://www.farhanahmed.net/blog/xmlrpc.php")
#
Initialize the settings
d_username
= 'admin'
d_password
= 'd85a3f'
This is where the initialization happens.
xmlrpclib.Server creates an instance of a XML-RPC connection object and
takes just the server name as the parameter. In this case, the server
is the XML-RPC interface provided by Wordpress. The username and
password variables are then initialized to the known values (don't
worry, I've changed the password :)).
pprint(servname.system.listMethods())
To test the server connection, I then make
a call to the system.listMethods()
function that returns a list of all supported functions by the server.
Of course, I make sure to see if metaWeblog.newPost() is one of them. pprint()just
makes the output prettier by formatting it in a table-like form.
#
Open main XML file for parsing
maindoc = minidom.parse('wp_posts.xml')
# Get the list of <wp_posts> items
posts = maindoc.getElementsByTagName('wp_posts')
# Initialize index
index = 0
Okay, so this is where the XML parsing
begins. I used the minidom.parse()
function to create an instance of the main document parser. This
function takes the filename to be parsed, to which I pass the XML file
I exported from myPhpAdmin. Then, a call to getElementsByTagName()
is made which returns a list of a all elements that match the given
name. For me, since I want to extract all the posts, I pass in wp_posts,
since all posts are embedded within this item.
for
ipost in posts:
n_date = '';
n_title = '';
n_content = '';
index = index + 1
print "Post : ", index
print
# Get the published date
ldate =
ipost.getElementsByTagName('post_date')
for node in ldate:
n_date = node.firstChild.data
# Get the title of the post
ltitle =
ipost.getElementsByTagName('post_title')
for node in ltitle:
n_title = node.firstChild.data
# Get the content of the post
lcontent =
ipost.getElementsByTagName('post_content')
for node in lcontent:
n_content = node.firstChild.data
blogPost(servname, d_username,
d_password, n_date, n_title, n_content)
This is the meat of the parsing. For each
of the posts that I made a list of, I use the same function, getElementsByTagName()
to extract the various fields like post_content,
post_title and post_date. I
then use the firstChild pointer to extract
the data within these items. The code is pretty self-explanatory. The
extracted data is then stored in variables and passed to the blogPost() function that I defined earlier. This happens for
all the posts in the XML file. Of course, the index is incremented each
cycle and printed out for me to see the progress of the operation.
maindoc.unlink()
The last line of the program breaks the
connections between the XML file and the parser and frees up the
allocated memory.
That was easy, wasn't it? This script took me a couple of hours to
conujure up, without having any knowledge of XML, XML-RPC, or Python
when I began. A lot more could have been done here, but this did help
me achieve my goal of exporting all my blog entries to the new blog
without having to go through the trouble of manually cut-and-pasting
the entries.
Any comments or suggestions are welcome.