Friday, February 18, 2011

Parsing XML from SharePoint lists using Perl


I was tasked with generating flat-files from Microsoft SharePoint
lists recently, for loading into a database, and wanted to share one way of
doing this, which involves downloading the XML with a tool like wget or curl,
and then parsing it using a Perl script.
Unfortunately, I did not have access to the back-end, or I might have looked
for a way of running a query on the SQL Server database behind the SharePoint
There are also other ways of doing this that involve the use of SOAP::Lite.
Have look, for example, at this great post.
For this particular exercise I had access to a SharePoint 2007 site. Your
mileage may vary with a more recent version.

The SharePoint list URL

Before we can pull the XML from SharePoint, we’ll need the URL for the XML
version of the list we’ll be extracting.
From the SharePoint site for the list, export the list as an Excel
Spreadsheet. This will cause the site to push an file called owssvr.iqy as
a download to your browser. Save this file and open it with a text editor.

The owssvr.iqy file

Here’s what the downloaded web query file (owssvr.iqy) might look like:

RootFolder=/path/to/list/My List 
The URL listed on the third line of the file will point directly to the XML for the list. The list ID itself is shown as the attribute SharePointListName, and is also part of the URL.

Downloading the XML

With wget

For the SharePoint site I’m querying I have to use HTTPS, and it requires
authentication. Using the link obtained above, I can download the XML for the
list as follows (type the whole command into a single line):
wget --no-check-certificate --user=username --password=password -O test_list.xml
Note: Even though the curly braces ({, and }) in the URL are not valid
characters for a link, wget is able to use them, as it is encoding them behind
the scenes into %7B and %7D respectively. BTW if you need to encode
strings into valid URL format, have a look at URI::Escape.

With curl

Curl requires more strict URL encoding, but downloads the XML just as
well with the below command (again, put it into a single line).
curl --insecure --ntlm --user username:password -o test_list.xml

Structure of the XML

Here’s a sample of the XML obtained with either of the commands above. In this
example I’m querying a list of tasks.
<xml xmlns:s='uuid:BDC6E3F0-6DA3-11d1-A2A3-00AA00C14882'
<s:Schema id='RowsetSchema'>
   <s:ElementType name='row' content='eltOnly' rs:CommandTimeout='30'>
      <s:AttributeType name='ows_Attachments' rs:name='Attachments' rs:number='1'>
         <s:datatype dt:type='boolean' dt:maxLength='1' />
      <s:AttributeType name='ows_LinkTitle' rs:name='Title' rs:number='2'>
         <s:datatype dt:type='string' dt:maxLength='512' />
      <s:AttributeType name='ows_AssignedTo' rs:name='Assigned To' rs:number='3'>
         <s:datatype dt:type='string' dt:lookup='true' dt:maxLength='512' />
      <s:AttributeType name='ows_Status' rs:name='Status' rs:number='4'>
         <s:datatype dt:type='string' dt:maxLength='512' />
      <s:AttributeType name='ows_Priority' rs:name='Priority' rs:number='5'>
         <s:datatype dt:type='string' dt:maxLength='512' />
      <s:AttributeType name='ows_DueDate' rs:name='Due Date' rs:number='6'>
         <s:datatype dt:type='datetime' dt:maxLength='8' />
      <s:AttributeType name='ows_PercentComplete' rs:name='% Complete' rs:number='7'>
         <s:datatype dt:type='float' dt:maxLength='8' />
   <z:row ows_Attachments='0' ows_LinkTitle='Task #1' ows_AssignedTo='1;#Smith, John' ows_Status='Not Started' ows_Priority='(2) Normal' ows_DueDate='2011-02-28 00:00:00' ows_PercentComplete='0' />
   <z:row ows_Attachments='0' ows_LinkTitle='Task #2' ows_AssignedTo='1;#Smith, John' ows_Status='Not Started' ows_Priority='(2) Normal' ows_DueDate='2011-02-28 00:00:00' ows_PercentComplete='0' />
   <z:row ows_Attachments='0' ows_LinkTitle='Task #3' ows_AssignedTo='1;#Smith, John' ows_Status='Not Started' ows_Priority='(2) Normal' ows_DueDate='2011-02-28 00:00:00' ows_PercentComplete='0' />
There are three main sections
(1) The XML header, enclosed in the topmost <xml> tag.
(2) The schema or data description. Surrounded by the <s:Schema> tags.
(3) The data. Enclosed by the <rs:data> tags.
The strategy then, is to parse the schema first, and after having all
available attributes, we can then parse the data section to output
delimited columns.

The Perl code

I’m using Perl 5.12.1 with the below modules:
XML::Simple allows you to choose a preferred parser, and I chose
XML::LibXML::SAX::Parser, since it supports name spaces, which are used in
the XML.
# --------------------------------------------------------------
my $VERSION = '$Id$';
# --------------------------------------------------------------
# Script Name:
# Purpose:      To parse an XML file from a SharePoint list.
# --------------------------------------------------------------
use warnings; use strict;
use Getopt::Std; $Getopt::Std::STANDARD_HELP_VERSION = 1;
use XML::Simple;
$XML::Simple::PREFERRED_PARSER = "XML::LibXML::SAX::Parser";
# --------------------------------------------------------------

# --------------------------------------------------------------
# Print usage:
# --------------------------------------------------------------
sub print_usage {
    (my $usage = <<END_OF_HELP) =~ s/^ {4}//gm;

    Parse an XML file from a Sharepoint list and output a
    delimted version of the data.

    $0 [-d DELIMITER] file1 [file2 file3 file4 ... filen]
    $0 -h prints this help message.

    Available options:

        -o output file (optional -- STDOUT is the default).
        -k the output delimiter (TAB is the default)
        -H print a header with the output.
        -M print meta-data (schema) only.
        -R remove extraneous IDs or data-types pre-pended to
           data elements

    $0 -o my_output.txt my_input.xml
    $0 my_input.xml >my_output.txt


    print $usage;
sub HELP_MESSAGE { print_usage; }
# --------------------------------------------------------------

# --------------------------------------------------------------
# Parse the command-line options:
# --------------------------------------------------------------
my %option; getopts("k:o:hHMR",\%option);
if ($option{h}) { print_usage; exit(0); }
my $odelm = "\t"; if ($option{k}) { $odelm = $option{k}; }
my $ofile = "-"; if ($option{o}) { $ofile = $option{o}; }
my $header  = 0; if ($option{H}) { $header  = 1; }
my $meta = 0; if ($option{M}) { $meta = 1; }
my $rcln = 0; if ($option{R}) { $rcln  = 1; }
if ($#ARGV < 0) { $ARGV[0] = "-"; }
# --------------------------------------------------------------

# --------------------------------------------------------------
# The main part of the script:
# --------------------------------------------------------------
my $xml = new XML::Simple;
open my $OFILE, "> $ofile" or die "Couldn't open file $ofile: $!";
for my $ifile (@ARGV) {

    my $sp_list = $xml->XMLin("$ifile");
    my @attr;

    # First, parse the Schema.
    # We are setting up a description hash (%desc) for every column in the
    # schema and putting all descriptions in an attributes array (@attr),
    # which is indexed by the "rs:number" attribute.
    for my $col (keys %{$sp_list->{"s:Schema"}->{"s:ElementType"}->{"s:AttributeType"}}) {
        my %desc;
        $desc{"name"} = $col;
        $desc{"rs_name"} = $sp_list->{"s:Schema"}->{"s:ElementType"}->{"s:AttributeType"}->{$col}->{"rs:name"};
        $desc{"type"} = $sp_list->{"s:Schema"}->{"s:ElementType"}->{"s:AttributeType"}->{$col}->{"s:datatype"}->{"dt:type"};
        $desc{"maxl"} = $sp_list->{"s:Schema"}->{"s:ElementType"}->{"s:AttributeType"}->{$col}->{"s:datatype"}->{"dt:maxLength"};
        $attr[$sp_list->{"s:Schema"}->{"s:ElementType"}->{"s:AttributeType"}->{$col}->{"rs:number"} - 1] = \%desc;

    # Print the meta-data if needed:
    # This includes every attribute for every column.
    # (Note that if we output the schema, we don't go into the data)
    if ($meta) {
        my @header = ("COL","FIELD NAME","VISIBLE NAME","DATA TYPE","LEN");
        print join("$odelm",@header),"\n";
        for my $i (0..$#attr) {
            my @record;
            push @record, $i+1;
            push @record, $attr[$i]->{"name"};
            push @record, $attr[$i]->{"rs_name"};
            push @record, $attr[$i]->{"type"};
            push @record, $attr[$i]->{"maxl"};
            print join("$odelm",@record),"\n";

    # Print the header if needed.
    # (This is the header for the SharePoint list.)
    if ($header and not $meta) {
        my @header;
        for my $i (0..$#attr) {
            $header[$i] = $attr[$i]->{"rs_name"};
        print join("$odelm",@header),"\n";
    # Get the SharePoint list data and print it.
    unless($meta) {
        for my $row (@{$sp_list->{"rs:data"}->{"z:row"}}) {
            my @record;
            for my $i (0..$#attr) {
                if (exists($row->{$attr[$i]->{"name"}})) {
                    my $token = $row->{$attr[$i]->{"name"}};
                    if ($rcln) {
                        $token =~ s/^([0-9]+|float|string)\;\#//go;
                    push @record, $token;
                } else {
                    push @record, qq{};
            print {$OFILE} join("$odelm",@record),"\n";
# --------------------------------------------------------------

Running the script

If we’re just interested in the data, we can call the script as follows: -RH test_list.xml > test_list.txt
And it will give a tab-delimited file.

You can also output the schema of the list with the following command: -M test_list.xml
I’ve reformatted it into an ASCII table below:
|   1 | ows_Attachments     | Attachments  | boolean   |   1 |
|   2 | ows_LinkTitle       | Title        | string    | 512 |
|   3 | ows_AssignedTo      | Assigned To  | string    | 512 |
|   4 | ows_Status          | Status       | string    | 512 |
|   5 | ows_Priority        | Priority     | string    | 512 |
|   6 | ows_DueDate         | Due Date     | datetime  |   8 |
|   7 | ows_PercentComplete | % Complete   | float     |   8 |

Additional comments

Overall this code is fairly reusable as the structure of the XML is consistent
across various SharePoint lists. And since it is scriptable, it is easy
automate the update of a database table from a SharePoint list.

You can download the files mentioned above, here:,
It is possible to do the download step in Perl as well by using LWP::UserAgent,
but this is left as an exercise.