Thursday, June 20, 2013

Two helpful uses of transposing text

From algebra we know that the transpose of a matrix is obtained by changing rows to columns and columns to rows.  This simple idea, of changing the arrangement of a set of objects can help us solve two common problems in writing SQL code and ETL-processing of flat files.


Building IN-clauses for SQL queries (converting columns to rows)

In a multi-query search of a database I've often encountered situations where there are a number of values, anywhere from 2 to 1000 (if you have more than this, you should consider loading them into their own temporary table), that I need to further research with a query.  This leads to the inevitable retyping of these values to get them into a query, or the use of MS Excel to manipulate a relatively small set of strings, which could be easily handled with a script.  If I happen to be working with SQL*Plus, for example, why switch to Excel, when I could do the whole thing from the command-line?

My examples were run from Cygwin running on Windows, but these can be easily adapted to Linux, or MacOS.

Let's say I've isolated a list of troublesome records to the five (5) record IDs shown below (they would come out of a SQL client in column form):
162692
573677
742713
817851
 994363 

Now we build the IN-clause by processing these through col2row.pl.  Let's say we have these values in the clipboard, then in Cygwin we can use the getclip command and pipe it's output through col2row.pl.

getclip |d2u |col2row.pl -f1 k,

The d2u command is optional, and I'm using it in Cygwin to remove any extra carriage-returns in the incoming clipboard data.  In MacOS X we would use pbpaste in place of getclip, and in Linux we could use the xclip -o or xsel -o commands.

The '-f1' makes it process the first (and only) column, and the '-k,' sets the output delimiter to a comma.  Here's the output:
162692,573677,742713,817851,994363
 If we want the values single-quoted, we simply use the '-s' option:
getclip |d2u |col2row.pl -f1 -s -k,
Here is the output: 
'162692','573677','742713','817851','994363'
Depending on your environment, you can pipe these directly into the clipboard:

getclip |d2u |col2row.pl -f1 -s -k, |putclip

 As stated earlier, you can use pbcopy/pbpaste, on Mac OS X or xclip/xsel on Linux.

A short-cut for users of AutoHotkey on Windows:

The following code will output comma-separated row from a single column stored in your clipboard, when you use the Win-c shortcut:
;-----------------------------------------------------------
; Win-c - Paste a row-version of the column in the clipboard
; CR+LF are replaced with commas, and entries are not quoted
;-----------------------------------------------------------
#c::
outstring := RegExReplace(clipboard, " *\R *quot;, "")
outstring := RegExReplace(outstring, " *\R *", "`,")
outstring := RegExReplace(outstring, "^ +", "")
outstring := RegExReplace(outstring, " +quot;, "")
clipboard := outstring
Send +{Ins}
return
;-----------------------------------------------------------
You can also build a shortcut to paste single quoted values, but that is left as an exercise.


Comparing two file layouts (converting rows to columns)

In loading data files into databases, one often encounters wide layouts with dozens, or even hundreds of columns.  If there is a change in a layout, it can be time-consuming to detect it, if the source has not supplied the exact information.  Also, if you have the headers in two separate files, running a diff will simply tell you that the two header lines differ, but won't tell you where the difference is.

What we need is a way to turn the header row into a column of values, so that if we have two such columns from two headers, the diff command will give us useful information about how the two headers differ.

In our trivial example, we have some tab-delimited contact data:
$ cat file1.txt
record_id first_name last_name email company_name address_1 city state zip
10001 John Smith jsmith@email.zz Acme Inc. 123 Elm Street Anytown AZ 50001
$  cat file2.txt
record_id first_name last_name email company_name address_1 address_2 city state zip
10001 John Smith jsmith@email.zz Acme Inc. 123 Elm Street Mail Stop 1 Anytown AZ 50001

To properly compare the headers, we run row2col.pl on each file:
$ row2col.pl file1.txt
record_id
first_name
last_name
email
company_name
address_1
city
state
zip
$ row2col.pl file2.txt
record_id
first_name
last_name
email
company_name
address_1
address_2
city
state
zip


Putting each of these in a separate "layout" file:
$ row2col.pl file1.txt >layout1.txt
$ row2col.pl file2.txt >layout2.txt
We can diff them and get a clearer picture of the layout change:
$ diff layout1.txt layout2.txt 
6a7> address_2
 A new column seven (7) called "address_2" was added in file2. 


Displaying sample records

When combining the output of row2col.pl with tab2tbl.pl (from a previous post), you can get a very clear view of the layout of the file with sample values:


The Scripts

You can download the Perl scripts used for the above examples here: col2row.plrow2col.pl, tab2tbl.pl.





Friday, April 20, 2012

Creating text tables with Text::ASCIITable

I work a lot with delimited flat files, and a good way to display them in plain text is to create an ASCII table.  The attached script, tab2tbl.pl uses the Text::ASCIITable module and can convert an incoming tab-delimited file to a nice looking table.

Here's an example of a simple input file:


Below is what it looks like after formatting is applied:


Adjusting for proper width when formatting UTF-8 files

If we are formatting UTF-8 data, not only are some of the characters encoded in more than one byte, they can also take up one or two spaces on the screen.  The latter are referred to as double-width characters, and many CJKV glyphs fall into this category.  See this proposal for a more detailed description.

If we use Text::ASCIITable without modification, it is not able to distinguish between single and double-width characters, so the output doesn't always line up.  Here's a sample:


However, the module gives us the option of defining our own count call-back function, so we can correct for this, by making use of the mbswidth() function in the Text::CharWidth module:

sub count_utf8_cb {
    my $input = shift;
    $input =~ s/\33\[(\d+(;\d+)?)?[musfwhojBCDHRJK]//g;
    $input =~ s/\33\([0B]//g;
    return Text::CharWidth::mbswidth($input);
}
$tbl->setOptions('cb_count',\&count_utf8_cb);
 Once this is in place, we get the following:



ANSI character output

The script also provides output using ANSI color codes (if your terminal supports them) using the -a option:

Download Links

You can download the files mentioned above here: tab2tbl.pl, chinese_test_file_1.txt.

Wednesday, June 8, 2011

Using Salesforce.com's REST API from Perl

Introduction

Here's a simple way to access Salesforce data even from behind a firewall, using Perl and the REST API.

If you're not familiar with the details of the Salesforce REST API, have a look at the developer docs.

I'm using Perl 5.12.1 with modules WWW::Mechanize, and WWW::Salesforce.

The firewall settings (optional)

Since I'm behind a firewall, I have to define the necessary variables so the various Perl packages can route the requests properly.  Here's what's needed:
 #!/usr/bin/perl
use warnings;
use strict;
use WWW::Mechanize;
use WWW::Salesforce;

# The firewall proxy variables:
$ENV{HTTPS_PROXY} = 'http://proxy.org.com:80';
$ENV{HTTPS_PROXY_USERNAME} = "proxy_user";
$ENV{HTTPS_PROXY_PASSWORD} = "proxy_pass";
my $mech = WWW::Mechanize->new();
$mech->agent('Mozilla/5.0');
$mech->proxy('http', 'http://proxy_user:proxy_pass@proxy.org.com:80');
$mech->proxy('https', undef);

If you don't need to go through a firewall, you can skip all of these proxy settings, but you'll still need the $mech definition, as it will be used to make the REST requests later.

Authorization

The most difficult part (and it wasn't difficult) was setting up the authorization.  Salesforce offers two options:
  1. OAuth 2.0
  2. Session ID
The Sesion ID is far easier than the first option, so I'm using it here.

The first step is to log-in via the SOAP API (since that is a way to get a session ID):

# Authenticate first via SOAP interface to get a session ID:
my $sforce = eval { WWW::Salesforce->login(
                    username => 'sfdc_user@org.com',
                    password => 'sfdc_pass' ); };
die "Could not login to SFDC: $@" if $@;

Session ID

Now that we've logged-in via the SOAP API, we have a session ID as part of the $sforce variable, which we can get as follows:
 # Get the session ID:
my $hdr = $sforce->get_session_header();
my $sid = ${$hdr->{_value}->[0]}->{_value}->[0];

This SID will need to be sent as part of our request header when using the REST API.

A REST Request

Here's a simple request for all available SObjects:
# Now get some data (using the session ID from above):
$mech->add_header( "Authorization" => "OAuth $sid" );
$mech->add_header( "X-PrettyPrint" => '1' );
$mech->get("https://na1.salesforce.com/services/data/v20.0/sobjects/
");
# ...and write it out to the screen:
print $mech->content;

Going further

 The developer doc mentioned above lists lots of examples for requests that can be made this way, including SOQL queries.

Friday, February 18, 2011

Parsing XML from SharePoint lists using Perl

Introduction

I was tasked with generating flat-files from Microsoft SharePoint
lists recently, for loading into a database, and wanted to share one way of
doing this, which involves downloading the XML with a tool like wget or curl,
and then parsing it using a Perl script.
Unfortunately, I did not have access to the back-end, or I might have looked
for a way of running a query on the SQL Server database behind the SharePoint
application.
There are also other ways of doing this that involve the use of SOAP::Lite.
Have look, for example, at this great post.
For this particular exercise I had access to a SharePoint 2007 site. Your
mileage may vary with a more recent version.
 

The SharePoint list URL

Before we can pull the XML from SharePoint, we’ll need the URL for the XML
version of the list we’ll be extracting.
From the SharePoint site for the list, export the list as an Excel
Spreadsheet. This will cause the site to push an file called owssvr.iqy as
a download to your browser. Save this file and open it with a text editor.
  

The owssvr.iqy file

Here’s what the downloaded web query file (owssvr.iqy) might look like:
WEB
1
https://sharepoint.domain.org/path/to/list/_vti_bin/owssvr.dll?XMLDATA=1&List={921ACBDB-5060-4B35-A126-19E9A060513D}&View={F1275A60-A531-4F88-882E-E096721E53D0}

Selection={921ACBDB-5060-4B35-A126-19E9A060513D}-{F1275A60-A531-4F88-882E-E096721E53D0}
EditWebPage=
Formatting=None
PreFormattedTextToColumns=True
ConsecutiveDelimitersAsOne=True
SingleBlockTextImport=False
DisableDateRecognition=False
DisableRedirections=False
SharePointApplication=https://sharepoint.domain.org/path/to/list/_vti_bin
SharePointListView={F1275A60-A531-4F88-882E-E096721E53D0}
SharePointListName={921ACBDB-5060-4B35-A126-19E9A060513D}
RootFolder=/path/to/list/My List 
The URL listed on the third line of the file will point directly to the XML for the list. The list ID itself is shown as the attribute SharePointListName, and is also part of the URL.

Downloading the XML

With wget

For the SharePoint site I’m querying I have to use HTTPS, and it requires
authentication. Using the link obtained above, I can download the XML for the
list as follows (type the whole command into a single line):
wget --no-check-certificate --user=username --password=password -O test_list.xml
'https://sharepoint.domain.org/path/to/list/_vti_bin/owssvr.dll?XMLDATA=1&List={921ACBDB-5060-4B35-A126-19E9A060513D}&View={F1275A60-A531-4F88-882E-E096721E53D0}'
Note: Even though the curly braces ({, and }) in the URL are not valid
characters for a link, wget is able to use them, as it is encoding them behind
the scenes into %7B and %7D respectively. BTW if you need to encode
strings into valid URL format, have a look at URI::Escape.
 

With curl

Curl requires more strict URL encoding, but downloads the XML just as
well with the below command (again, put it into a single line).
curl --insecure --ntlm --user username:password -o test_list.xml
'https://sharepoint.domain.org/path/to/list/_vti_bin/owssvr.dll?XMLDATA=1&List=%7B921ACBDB-5060-4B35-A126-19E9A060513D%7D&View=%7BF1275A60-A531-4F88-882E-E096721E53D0%7D'

Structure of the XML

Here’s a sample of the XML obtained with either of the commands above. In this
example I’m querying a list of tasks.
<xml xmlns:s='uuid:BDC6E3F0-6DA3-11d1-A2A3-00AA00C14882'
     xmlns:dt='uuid:C2F41010-65B3-11d1-A29F-00AA00C14882'
     xmlns:rs='urn:schemas-microsoft-com:rowset'
     xmlns:z='#RowsetSchema'>
<s:Schema id='RowsetSchema'>
   <s:ElementType name='row' content='eltOnly' rs:CommandTimeout='30'>
      <s:AttributeType name='ows_Attachments' rs:name='Attachments' rs:number='1'>
         <s:datatype dt:type='boolean' dt:maxLength='1' />
      </s:AttributeType>
      <s:AttributeType name='ows_LinkTitle' rs:name='Title' rs:number='2'>
         <s:datatype dt:type='string' dt:maxLength='512' />
      </s:AttributeType>
      <s:AttributeType name='ows_AssignedTo' rs:name='Assigned To' rs:number='3'>
         <s:datatype dt:type='string' dt:lookup='true' dt:maxLength='512' />
      </s:AttributeType>
      <s:AttributeType name='ows_Status' rs:name='Status' rs:number='4'>
         <s:datatype dt:type='string' dt:maxLength='512' />
      </s:AttributeType>
      <s:AttributeType name='ows_Priority' rs:name='Priority' rs:number='5'>
         <s:datatype dt:type='string' dt:maxLength='512' />
      </s:AttributeType>
      <s:AttributeType name='ows_DueDate' rs:name='Due Date' rs:number='6'>
         <s:datatype dt:type='datetime' dt:maxLength='8' />
      </s:AttributeType>
      <s:AttributeType name='ows_PercentComplete' rs:name='% Complete' rs:number='7'>
         <s:datatype dt:type='float' dt:maxLength='8' />
      </s:AttributeType>
   </s:ElementType>
</s:Schema>
<rs:data>
   <z:row ows_Attachments='0' ows_LinkTitle='Task #1' ows_AssignedTo='1;#Smith, John' ows_Status='Not Started' ows_Priority='(2) Normal' ows_DueDate='2011-02-28 00:00:00' ows_PercentComplete='0' />
   <z:row ows_Attachments='0' ows_LinkTitle='Task #2' ows_AssignedTo='1;#Smith, John' ows_Status='Not Started' ows_Priority='(2) Normal' ows_DueDate='2011-02-28 00:00:00' ows_PercentComplete='0' />
   <z:row ows_Attachments='0' ows_LinkTitle='Task #3' ows_AssignedTo='1;#Smith, John' ows_Status='Not Started' ows_Priority='(2) Normal' ows_DueDate='2011-02-28 00:00:00' ows_PercentComplete='0' />
</rs:data>
</xml> 
There are three main sections
(1) The XML header, enclosed in the topmost <xml> tag.
(2) The schema or data description. Surrounded by the <s:Schema> tags.
(3) The data. Enclosed by the <rs:data> tags.
The strategy then, is to parse the schema first, and after having all
available attributes, we can then parse the data section to output
delimited columns.

The Perl code

I’m using Perl 5.12.1 with the below modules:
XML::Simple
XML::LibXML
XML::Simple allows you to choose a preferred parser, and I chose
XML::LibXML::SAX::Parser, since it supports name spaces, which are used in
the XML.
#!/usr/bin/perl
# --------------------------------------------------------------
my $VERSION = '$Id$';
# --------------------------------------------------------------
# Script Name:  xmlparse.pl
# Purpose:      To parse an XML file from a SharePoint list.
# --------------------------------------------------------------
use warnings; use strict;
use Getopt::Std; $Getopt::Std::STANDARD_HELP_VERSION = 1;
use XML::Simple;
$XML::Simple::PREFERRED_PARSER = "XML::LibXML::SAX::Parser";
# --------------------------------------------------------------


# --------------------------------------------------------------
# Print usage:
# --------------------------------------------------------------
sub print_usage {
    (my $usage = <<END_OF_HELP) =~ s/^ {4}//gm;

    Parse an XML file from a Sharepoint list and output a
    delimted version of the data.

    USAGE:
    $0 [-d DELIMITER] file1 [file2 file3 file4 ... filen]
    $0 -h prints this help message.

    Available options:

        -o output file (optional -- STDOUT is the default).
        -k the output delimiter (TAB is the default)
        -H print a header with the output.
        -M print meta-data (schema) only.
        -R remove extraneous IDs or data-types pre-pended to
           data elements

    EXAMPLE:
    $0 -o my_output.txt my_input.xml
    $0 my_input.xml >my_output.txt

END_OF_HELP

    print $usage;
}
sub HELP_MESSAGE { print_usage; }
sub VERSION_MESSAGE { print $VERSION; }
# --------------------------------------------------------------


# --------------------------------------------------------------
# Parse the command-line options:
# --------------------------------------------------------------
my %option; getopts("k:o:hHMR",\%option);
if ($option{h}) { print_usage; exit(0); }
my $odelm = "\t"; if ($option{k}) { $odelm = $option{k}; }
my $ofile = "-"; if ($option{o}) { $ofile = $option{o}; }
my $header  = 0; if ($option{H}) { $header  = 1; }
my $meta = 0; if ($option{M}) { $meta = 1; }
my $rcln = 0; if ($option{R}) { $rcln  = 1; }
if ($#ARGV < 0) { $ARGV[0] = "-"; }
# --------------------------------------------------------------


# --------------------------------------------------------------
# The main part of the script:
# --------------------------------------------------------------
my $xml = new XML::Simple;
open my $OFILE, "> $ofile" or die "Couldn't open file $ofile: $!";
for my $ifile (@ARGV) {

    my $sp_list = $xml->XMLin("$ifile");
    my @attr;

    #
    # First, parse the Schema.
    # We are setting up a description hash (%desc) for every column in the
    # schema and putting all descriptions in an attributes array (@attr),
    # which is indexed by the "rs:number" attribute.
    for my $col (keys %{$sp_list->{"s:Schema"}->{"s:ElementType"}->{"s:AttributeType"}}) {
        my %desc;
        $desc{"name"} = $col;
        $desc{"rs_name"} = $sp_list->{"s:Schema"}->{"s:ElementType"}->{"s:AttributeType"}->{$col}->{"rs:name"};
        $desc{"type"} = $sp_list->{"s:Schema"}->{"s:ElementType"}->{"s:AttributeType"}->{$col}->{"s:datatype"}->{"dt:type"};
        $desc{"maxl"} = $sp_list->{"s:Schema"}->{"s:ElementType"}->{"s:AttributeType"}->{$col}->{"s:datatype"}->{"dt:maxLength"};
        $attr[$sp_list->{"s:Schema"}->{"s:ElementType"}->{"s:AttributeType"}->{$col}->{"rs:number"} - 1] = \%desc;
    }

    #
    # Print the meta-data if needed:
    # This includes every attribute for every column.
    # (Note that if we output the schema, we don't go into the data)
    if ($meta) {
        my @header = ("COL","FIELD NAME","VISIBLE NAME","DATA TYPE","LEN");
        print join("$odelm",@header),"\n";
        for my $i (0..$#attr) {
            my @record;
            push @record, $i+1;
            push @record, $attr[$i]->{"name"};
            push @record, $attr[$i]->{"rs_name"};
            push @record, $attr[$i]->{"type"};
            push @record, $attr[$i]->{"maxl"};
            print join("$odelm",@record),"\n";
        }
        next;
    }

    #
    # Print the header if needed.
    # (This is the header for the SharePoint list.)
    if ($header and not $meta) {
        my @header;
        for my $i (0..$#attr) {
            $header[$i] = $attr[$i]->{"rs_name"};
        }
        print join("$odelm",@header),"\n";
    }
    #
    # Get the SharePoint list data and print it.
    unless($meta) {
        for my $row (@{$sp_list->{"rs:data"}->{"z:row"}}) {
            my @record;
            for my $i (0..$#attr) {
                if (exists($row->{$attr[$i]->{"name"}})) {
                    my $token = $row->{$attr[$i]->{"name"}};
                    if ($rcln) {
                        $token =~ s/^([0-9]+|float|string)\;\#//go;
                    }
                    push @record, $token;
                } else {
                    push @record, qq{};
                }
            }
            print {$OFILE} join("$odelm",@record),"\n";
        }
    }
}
close($OFILE);
# --------------------------------------------------------------

Running the script

If we’re just interested in the data, we can call the script as follows:
xmlparse.pl -RH test_list.xml > test_list.txt
And it will give a tab-delimited file.

You can also output the schema of the list with the following command:
xmlparse.pl -M test_list.xml
I’ve reformatted it into an ASCII table below:
.------------------------------------------------------------.
| COL | FIELD NAME          | VISIBLE NAME | DATA TYPE | LEN |
+-----+---------------------+--------------+-----------+-----+
|   1 | ows_Attachments     | Attachments  | boolean   |   1 |
|   2 | ows_LinkTitle       | Title        | string    | 512 |
|   3 | ows_AssignedTo      | Assigned To  | string    | 512 |
|   4 | ows_Status          | Status       | string    | 512 |
|   5 | ows_Priority        | Priority     | string    | 512 |
|   6 | ows_DueDate         | Due Date     | datetime  |   8 |
|   7 | ows_PercentComplete | % Complete   | float     |   8 |
'-----+---------------------+--------------+-----------+-----'

Additional comments

Overall this code is fairly reusable as the structure of the XML is consistent
across various SharePoint lists. And since it is scriptable, it is easy
automate the update of a database table from a SharePoint list.

You can download the files mentioned above, here: xmlparse.pl,
test_list.xml.
It is possible to do the download step in Perl as well by using LWP::UserAgent,
but this is left as an exercise.