Combining two IRC-logs into one with PowerShell: A Cautionary Tale
03 February 2016
I have a reliable Unix shell that I use for running irssi (an IRC client) over SSH. The caveat is that the server the shell is on uses Very Expensive SCSI hard drives, which is why they can only afford to give users 512MB of disk. Or that’s what the sysadmin tells me, at least. Either way, it’s 20 eurobucks a year and very stable so I’m content.
This small quota tends to fill up rather quickly with logs from the 20 or so channels I’m on. Thus I must periodically delete the logs or else I’ll run out of disk space. Whenever I do this, I end up with two separate log files: the old one which I’ll download and store locally, and the new one which starts slowly rebuilding on the server.
Because I’d prefer to have one uniform logfile that contains everything that ever happened on that channel when I was on it, every time I reset the logs I need to append the new logfile to the old one. A very simple task, but strenuous to do by hand for dozens of channels and queries.
So for the longest time I’ve had a task on the backburner: write a script that automatically appends new logs to old ones. This, too, is a simple task. Or should be. One line in bash and you’re done.
But I’m on Windows. I don’t have Bash, and I don’t want to install Cygwin because I’ve been told it’s shit and don’t care to find out if this claim is true.
Windows is in Microsoftland, and in Anno Domini 2016 inhabitants of Microsoftland use Microsoft(r) PowerShell(tm). Off I went to figure out how to write an extremely simple little script in PowerShell, of which I have zero prior experience.
An hour and dozens of Duckduckgo searches later I emerge from underneath a million StackExchange tabs, carrying with me the scripture:
The best part: it actually works. Kinda. Except it doesn’t. It ruins character encoding, and I can’t for the life of me figure out why.
My IRC logs are encoded in UTF-8. If I tell the script to use UTF-8 encoding, special characters like ä and ö become garbled within the UTF-8 encoded IRC log file created by irssi on the unix server. However, in a test file created locally that also uses UTF-8 they’re completely fine.
On the other hand, if I tell the script to use the “default” encoding (which “uses the encoding of the system’s current ANSI code page.”, whatever that is), the IRC log file turns out fine while the test file becomes garbled.
After some research I discover an alternate command for appending to files. This command doesn’t care about character encoding – I don’t know how it can do this when encoding is such a big deal to the other one. I replace the Out-File line with this:
Miraculously, both the real logfile and the test logfile pass through unscathed! It seems that the script is actually working now. Further testing is required, but I’ll leave that for tomorrow.
P.S. Note-to-self: get a blog theme with better looking code tags.