Counting blank lines in a large text file with C# and .NET
For my test, I’ve used 250,000 empty lines, which outputs a file just over 250 MB. This should be big enough to expose the differences between counting methods.
private static void BuildFileWithBlankLines(string path, uint blanks)
{
var random = new Random();
var counter = 0;
using (var sw = File.CreateText(path))
{
while (counter < blanks)
{
var isBlank = (random.Next(100) == 0); // ~1% of lines blank
sw.WriteLine(isBlank ? "" : "NOT BLANK");
if (isBlank) { counter++; }
}
}
}
1. Regular expression matching in memory
var content = File.ReadAllText(path);
var re = new Regex(@"^\r?\n", RegexOptions.Multiline);
var count = re.Matches(content).Count;
This technique loads the whole text file into a string variable using File.ReadAllText, which is really just a wrapper for StreamReader.ReadToEnd. It loads the character bytes from a file stream into a StringBuilder. The load is relatively quick, taking about 8% of the total processing time. Once the file is loaded into memory, a regular expression finds and counts all the blank lines. This is, not surprisingly, glacially slow. Total duration was 16,531 ms.
2. Counting line by line through a string array
var count = File.ReadAllLines(path).Count(s => s.Length == 0);
I like this approach because it’s simple and readable. The first step is similar to the above, but this time uses a StreamReader (via File.ReadAllLines) to load the file into a string array, rather than into a single string. IEnumerable.Count then loops through the array and tries to match each line against a predicate that checks for empty strings. This takes less than half the time of the regex approach, only 7,006 ms.
3. Streaming through the file
uint count = 0;
using (var sr = File.OpenText(path))
{
string line;
while ((line = sr.ReadLine()) != null)
{
if (line.Length == 0) { count++; }
}
}
Unlike the first two, this last approach doesn’t load the entire file into memory. It uses File.OpenText to create a new StreamReader, then reads each line from the FileStream counting blank lines as it goes along. This is blazing fast compared to the other two, taking just 2,108 ms.
Conclusion
There’s almost an order of magnitude difference between the best and worst methods. So if you’re working with large files, try not to load them into memory first, but stream directly from disk instead.
This applies especially to parsing large data files before loading them into a database. In some cases, it’s not just slow to load into memory first, but impossible.
For example, a 500 GB XML file wouldn’t load easily into an XDocument, but could be streamed with an XmlReader and parsed with XNode.ReadFrom. Similarly, a huge CSV file couldn’t be loaded easily into a string then String.Split all in one go. Instead, you’d probably parse it line by line with a StreamReader, or use a streaming CsvReader library.