### Powershell Performance

Recently, I have grown fond of Powershell. As someone who is also responsible for a little administration from time to time it quickly caught my eye as a language that could solve many mundane problems quickly and succinctly. Having originally come from a UNIX background I could see the Powershell Development Team had taken the best features from Korn Shell and Perl then combined them with the .NET framework to provide a very powerful tool.

However, it's not all a bed of roses and this will become clear as you read on.

I have recently been analysing various files containing financial tick data. Typically there are around two million lines in a file and each contains a comma delimited string with a date, time, price and amount traded of a particular stock. For this analysis I needed to extract a single column from this file and save it in a new file. The new file actually being used as input for GNU Octave.

## The Problem

This task is typical and can easily be done with a tool such as Perl or Awk. However, since I am trying to use more Powershell recently, I felt obliged to see how well it could do the job.

To start, I created a sample data file, testdata.csv, by using the following: -

PS> Set-Content -Path testdata.csv -Value ("1/1/2008, 09:00:00, 100, 1n" * 2000000)

What I need to extract is the third column. In each case, I used Measure-Command to also measure script execution time: -

PS> Get-Content testdata.csv | % { ($_.Split(","))[2].Trim(" ") } > testdata.out TotalSeconds : 1896.8136657 Clearly, I kept busy whilst the command ran but it was initially quite a surprise. One should ask how much of this time is spent on simply sending the data down Powershell's object pipeline? PS> Get-Content testdata.csv > testdata.out TotalSeconds : 1292.2079663 So perhaps this is a deficiency of Powershell; one they might improve with V2. Fortunately, in this case we are dealing with a CSV file so we can improve performance using Import-Csv. Here is another attempt: - PS> Set-Content testdata.out (Import-Csv .\testdata.csv -Header D,T,P,V | % {$_.P })
TotalSeconds      : 333.3965842

## The Perl Way

Slightly better but still seems poor and what do you do if your input file is delimited with more than a single character? I thought I should test the same problem using Perl.

PS> Copy-Item testdata.csv testdata; perl -nibak -e 's/[ \t]+//g; print \"\".(split(/,/, $_))[2].\"\n\"' testdata.out TotalSeconds : 42.8753894 A considerable improvement! Then I began thinking would it be possible to include Perl within the Powershell pipeline. Unfortunately, this is not a simple case of placing Perl after the pipe character '|' since Powershell will not connect to the STDIN and STDOUT of a normal process. One needs to open a stream to STDIN of a Perl process and feed it Powershell's pipeline '$_' object. Furthermore, the STDOUT of the Perl process needs to be collected and sent back through to Powershell as a pipeline object.

## A Powershell Function

My first attempt produced the following Powershell function: -

Function Perl-Filter() {
BEGIN {
$si = New-Object System.Diagnostics.ProcessStartInfo$si.FileName = "C:\perl\bin\perl.exe"
$si.Arguments = @' -ne "s/[ \t]+//g; print ''.(split(/,/,$_))[2].\"\n\""
'@
$si.UseShellExecute =$false
$si.RedirectStandardOutput =$true
$si.RedirectStandardInput =$true
$p = [System.Diagnostics.Process]::Start($si)
}
PROCESS {
$p.StandardInput.WriteLine($_)
$p.StandardInput.Flush() } END {$p.StandardInput.Close()
Write-Output $p.StandardOutput.ReadToEnd()$p.WaitForExit();
}
}

This Powershell function starts by spawning a Perl process and redirecting its STDIN and STDOUT streams. During the processing stage, data is flushed into Perl's STDIN and finally all data from the STDOUT stream is sent back down Powershell's pipeline via the echo command. Note the following will not work: -

PS> Get-Content .\testdata.csv | Perl-Filter > testdata.out

One problem with this function is that all of Perl's output is kept in memory until the process ends. It would be nice to Read data from the Perl process during the processing stage and send it down Powershell's pipeline as it is ready. Sadly, due to a known problem with .Net's StreamReader implementation a Read or Peek will block if the Stream has not had any data sent through it. The only workaround I know of is to start a separate thread to manage the Stream and this is where Powershell V1 has its limitations.

Another problem is it simply hangs because Perl is blocked from writing to the STDOUT stream once this pipe buffer is full. This usually is set to around 8KB.

## A Powershell Cmdlet

So, how does one allow a pipeline to access another process's STDIN and STDOUT streams? Well the answer appears to be that one has to write a Cmdlet using C# or VB.Net.

The general layout is similar to that of the function above. One must implement three main methods BeginProcess, ProcessRecord and EndProcess each corresponding to the BEGIN, PROCESS and END blocks above. Building a new Powershell Cmdlet is made very simple by using David Aiken's Visual Studio Template.

I chose to follow a similar structure to the Powershell Function above spawning my process in a begin block but also starting a special thread that monitors the STDOUT of this process. The thread looks for a line delimiter sequence in the stream and as these are discovered records are broken off and pushed into an ObjectQueue. Here is an excerpt from the thread: -

    List<char> dataQueue = new List<char>();
char[] separatorCharArray = Separator.ToCharArray();
char[] buffer = new char[16 * 1024];
int totalObjectsQueued = 0;

{

for (int i = 0; i <= bytesRead - 1; i++)

// pump out any complete objects
bool completeMatch = true;
int index;
while ((index = dataQueue.IndexOf(separatorCharArray[0])) > 0)
{
// skip if not enough chars to complete match
if (dataQueue.Count < index + separatorCharArray.Length)
continue;

// check it's a complete match, add if it is
for (int i = 1; i <= separatorCharArray.Length - 1; i++)
completeMatch &= dataQueue[index + i] == separatorCharArray[i];

if (completeMatch)
{
oq.Enqueue(new string(dataQueue.GetRange(0, index).ToArray()));
totalObjectsQueued++;
dataQueue.RemoveRange(0, index + separatorCharArray.Length);
}
}
}

// enqueue any remaining chars
if (dataQueue.Count > 0)
{
oq.Enqueue(new string(dataQueue.ToArray()));
dataQueue.Clear();
}


The ObjectQueue is then de-queued to the Powershell pipeline during the ProcessRecord and EndRecord stages. Very simply as: -

    while (ObjectQueue.Count != 0)
WriteObject(ObjectQueue.Dequeue());

The advantage of this structure is that one can choose processes that output data in other forms; perhaps even binary files.

Finally measuring the commands output: -

PS> Get-Content .\testdata.csv | Get-ProcessPipe -ProcessPath perl.exe -Arguments '-ne "s/\s//g; print q().(split(/,/, \$_))[2].qq(\n)" ' > testdata.out
TotalSeconds      : 1683.0998587`

This is only a small improvement on the original .Net's string method above.

## Conclusion

There are many advantages to using Powershell for many administration tasks. However, when processing more than 100,000 objects through its pipeline your task's performance will take a big hit. This is perhaps where it is worth falling back on more tried and tested tools such as Perl or Python and using traditional techniques. Powershell's object layer is very powerful; however, it would be nice if it could detect if it was processing either an object stream or text stream and behave accordingly.

I felt having access to a non-Powershell process as a Cmdlet might be useful so I have posted my code here on Google if you would like to play with it. Please note it has not been tested thoroughly and is bound to have many bugs.