Text file processing with LINQ

After working on this problem the other day I started Googling looking for posts written about using LINQ for text file processing. I found the post Parsing textfiles with LINQ (or LINQ-to-TextReader) by Arjan Einbu.

LINQ shows us alternate ways to write code, introducing a more declarative coding paradigm. To use LINQ over the lines of a file, we can read all the lines in the file into a collection, and use LINQ over that collection. There’s some overhead to this; the need to read the entire file upfront and to fit the entire file in memory at once.

The solution was to create an extension method on TextReader for IEnumerable<string>. That post was followed up by another post, rather unfortunately titled, improving upon the solution using  TextFieldParser class in the Microsoft.VisualBasic.FileIO namespace, something I wasn’t aware existed and now find it odd this class is stuck well off in left field.

One of the reasons this subject interests me is I’ve been working with EDI files for awhile now and querying data directly from this file format would be really nice. For example, given a PO with line item segments like this:

PO1*1*36*CA*11.15*PE***VP*RRSKRC85*PI*0001111091127~
PID*F****PRSL ROMNE BBY TRAY ORGNC~
PO1*2*84*CA*11.15*PE***VP*RSMKRC85*PI*0001111091131~
PID*F****PRSL SPRG BBY TRAY ORGNC~
PO1*3*84*CA*11.15*PE***VP*RBSKRC85*PI*0001111091128~
PID*F****PRSL SPNCH BBY TRAY ORGNC~
PO1*4*72*CA*11.15*PE***VP*RHEKRC85*PI*0001111091126~
PID*F****PRSL SPRG W/HRB CLM ORGNC~

You can calculate the total quantity, highlighted in yellow, of all line items using LINQ like this:

using (var reader = new StreamReader("c:\\edi\\inbound\\850_09022009_1311_89.txt"))
{
    var query = (from line in reader.GetSplittedLines("*")
                            where line[0].Equals("PO1") && line[2].Length > 0
                            select Convert.ToInt32(line[2])).Sum();
... }

Using Arjan’s implementation of GetSpittedLines, that’s his name not mine for the extension method he wrote, you can apply logic to any of the columns from the file which is pretty cool.

Of course, there are a myriad of ways of doing the same thing but it’s interesting to have access to the columns allowing for calculations and querying. For my EDI work I’m using FileHelpers which works well though I really like this LINQ option. That said, I haven’t done any benchmarking so I’m not sure about the performance but most of the PO’s I’m working with are less than 4KB and the volume isn’t so great that this would be a major factor. At any rate, I hope you find useful for you too.

Btw, if you’re looking for custom EDI implementations feel free to contact me.