The iconv slurp misfeature
Since I got bitten by this recently, let me blog a quick warning here:
glibc iconv
- a utility for character set conversions, like iso8859-1 or
windows-1252 to utf-8 - has a nasty misfeature/bug: if you give it data on
stdin it will slurp the entire file into memory before it does a single
character conversion.
Which is fine if you're running small input files. If you're trying to convert a 10G file on a VPS with 2G of RAM, however ... not so good!
This looks to be a known issue, with patches submitted to fix it in August 2015, but I'm not sure if they've been merged, or into which version of glibc. Certainly RHEL/CentOS 7 (with glibc 2.17) and Ubuntu 14.04 (with glibc 2.19) are both affected.
Once you know about the issue, it's easy enough to workaround - there's an iconv-chunks wrapper on github that breaks the input into chunks before feeding it to iconv, or you can do much the same thing using the lovely GNU parallel e.g.
gunzip -c monster.csv.gz | parallel --pipe -k iconv -f windows-1252 -t utf8
Nasty OOM avoided!
blog comments powered by Disqus