Thursday, June 20, 2013

Two helpful uses of transposing text

From algebra we know that the transpose of a matrix is obtained by changing rows to columns and columns to rows.  This simple idea, of changing the arrangement of a set of objects can help us solve two common problems in writing SQL code and ETL-processing of flat files.


Building IN-clauses for SQL queries (converting columns to rows)

In a multi-query search of a database I've often encountered situations where there are a number of values, anywhere from 2 to 1000 (if you have more than this, you should consider loading them into their own temporary table), that I need to further research with a query.  This leads to the inevitable retyping of these values to get them into a query, or the use of MS Excel to manipulate a relatively small set of strings, which could be easily handled with a script.  If I happen to be working with SQL*Plus, for example, why switch to Excel, when I could do the whole thing from the command-line?

My examples were run from Cygwin running on Windows, but these can be easily adapted to Linux, or MacOS.

Let's say I've isolated a list of troublesome records to the five (5) record IDs shown below (they would come out of a SQL client in column form):
162692
573677
742713
817851
 994363 

Now we build the IN-clause by processing these through col2row.pl.  Let's say we have these values in the clipboard, then in Cygwin we can use the getclip command and pipe it's output through col2row.pl.

getclip |d2u |col2row.pl -f1 k,

The d2u command is optional, and I'm using it in Cygwin to remove any extra carriage-returns in the incoming clipboard data.  In MacOS X we would use pbpaste in place of getclip, and in Linux we could use the xclip -o or xsel -o commands.

The '-f1' makes it process the first (and only) column, and the '-k,' sets the output delimiter to a comma.  Here's the output:
162692,573677,742713,817851,994363
 If we want the values single-quoted, we simply use the '-s' option:
getclip |d2u |col2row.pl -f1 -s -k,
Here is the output: 
'162692','573677','742713','817851','994363'
Depending on your environment, you can pipe these directly into the clipboard:

getclip |d2u |col2row.pl -f1 -s -k, |putclip

 As stated earlier, you can use pbcopy/pbpaste, on Mac OS X or xclip/xsel on Linux.

A short-cut for users of AutoHotkey on Windows:

The following code will output comma-separated row from a single column stored in your clipboard, when you use the Win-c shortcut:
;-----------------------------------------------------------
; Win-c - Paste a row-version of the column in the clipboard
; CR+LF are replaced with commas, and entries are not quoted
;-----------------------------------------------------------
#c::
outstring := RegExReplace(clipboard, " *\R *quot;, "")
outstring := RegExReplace(outstring, " *\R *", "`,")
outstring := RegExReplace(outstring, "^ +", "")
outstring := RegExReplace(outstring, " +quot;, "")
clipboard := outstring
Send +{Ins}
return
;-----------------------------------------------------------
You can also build a shortcut to paste single quoted values, but that is left as an exercise.


Comparing two file layouts (converting rows to columns)

In loading data files into databases, one often encounters wide layouts with dozens, or even hundreds of columns.  If there is a change in a layout, it can be time-consuming to detect it, if the source has not supplied the exact information.  Also, if you have the headers in two separate files, running a diff will simply tell you that the two header lines differ, but won't tell you where the difference is.

What we need is a way to turn the header row into a column of values, so that if we have two such columns from two headers, the diff command will give us useful information about how the two headers differ.

In our trivial example, we have some tab-delimited contact data:
$ cat file1.txt
record_id first_name last_name email company_name address_1 city state zip
10001 John Smith jsmith@email.zz Acme Inc. 123 Elm Street Anytown AZ 50001
$  cat file2.txt
record_id first_name last_name email company_name address_1 address_2 city state zip
10001 John Smith jsmith@email.zz Acme Inc. 123 Elm Street Mail Stop 1 Anytown AZ 50001

To properly compare the headers, we run row2col.pl on each file:
$ row2col.pl file1.txt
record_id
first_name
last_name
email
company_name
address_1
city
state
zip
$ row2col.pl file2.txt
record_id
first_name
last_name
email
company_name
address_1
address_2
city
state
zip


Putting each of these in a separate "layout" file:
$ row2col.pl file1.txt >layout1.txt
$ row2col.pl file2.txt >layout2.txt
We can diff them and get a clearer picture of the layout change:
$ diff layout1.txt layout2.txt 
6a7> address_2
 A new column seven (7) called "address_2" was added in file2. 


Displaying sample records

When combining the output of row2col.pl with tab2tbl.pl (from a previous post), you can get a very clear view of the layout of the file with sample values:


The Scripts

You can download the Perl scripts used for the above examples here: col2row.plrow2col.pl, tab2tbl.pl.