2014년 7월 29일 화요일

OmegaT will fail to open a project if a git repo is in the project's parent directory or current directory

A few weeks ago, I encountered an issue in which OmegaT hangs when trying to open a project.

15256: Info: Docking Framework version: 2.1.4 
15256: Info: Event: application startup (LOG_INFO_EVENT_APPLICATION_STARTUP)
15256: Info: Click on 'projectMenu' menu item (LOG_MENU_CLICK)
15256: Info: Click on 'projectMenu' menu item (LOG_MENU_CLICK)
15256: Info: Click on 'projectOpenMenuItem' menu item (LOG_MENU_CLICK)
15256: Info: Git 'status' execution start (GIT_START)

This happens because OmegaT checks for team translation projects managed with git. The issue is discussed on the OmegaT support forums in this thread.

The workaround is to remove the offending .git folder that exists in the translation project's parent directory or the the translation project's current folder.

2014년 7월 22일 화요일

How to fix broken CJK filenames extracted from zip archive created in MS Windows environment

Most computers run some form of MS Windows, which doesn't natively use UTF-8 for character encoding. This can cause problems for Linux users who have to work with filenames using East Asian CJK (Chinese Japanese Korean) characters from a Windows environment. For single files sent as email attachments through Gmail, Google is smart enough to detect what code page the filename is encoded in and convert it to UTF-8 when the attachment is downloaded to a POSIX environment.

For archive files like .zip, however, compressed files named using CJK characters in a MS Windows environment will appear as gibberish in a UTF-8 locale.

Rather than booting up a Windows VM just to extract files from an archive, a faster method is to extract the compressed files while maintaining their original filename character encoding.

The following example will use this .zip file that was created on a Korean version of MS Windows. Korean language characters on Windows are encoded using Code Page 949, which is compatible with EUC-KR, the most-widely used character encoding in Korea.

First I will extract the file using 7z from the CLI, but create a modified environment with a different language encoding by using env and the LANG=... flag. This method was first described by developer Allen Choong in this post from 2013 in which he details converting filenames encoded in MS Windows GBK (Code Page 936) Simplified Chinese to UTF-8 after extraction from an archive file.

[archjun@arch Downloads]$ env LANG=C 7z x 편혜영.zip

7-Zip [64] 9.20  Copyright (c) 1999-2010 Igor Pavlov  2010-11-18
p7zip Version 9.20 (locale=C,Utf16=off,HugeFiles=on,2 CPUs)

Processing archive: 편혜영.zip

Extracting  Korean Writers(2009)/������(����).doc.docx
Extracting  Korean Writers(2009)/������.hwp

Everything is Ok

Files: 2
Size:       33991
Compressed: 25578

You can see that the filenames extracted from the archive are mangled, as they have a non-UTF8 character encoding.

Note that the LANG variable can also be set to euc-kr or cp949 instead of C which will also maintain the original filename character encoding for archive files created in Korean Windows.

Next we need to convert the gibberish filenames from EUC-KR/CP949 to UTF-8 using convmv, which according to the description on its man page:

converts filenames from one encoding to another 

[archjun@arch Downloads]$ convmv -f cp949 -t utf8 -r --notest ~/Downloads/"Korean Writers(2009)"/
mv "/home/archjun/Downloads/Korean Writers(2009)/������(����).doc.docx" "/home/archjun/Downloads/Korean Writers(2009)/편혜영(영문).doc.docx"
mv "/home/archjun/Downloads/Korean Writers(2009)/������.hwp" "/home/archjun/Downloads/Korean Writers(2009)/편혜영.hwp"
Ready!

In the -f (from language) flag, you can also use euc-kr and the filename conversion will work just fine. The -r flag tells convmv to convert all filenames recursively (all files in the directory or sub-directories).

The --notest flag must be added for convmv to actually overwrite the existing filenames. As you can see above, the � gibberish characters have been converted to readable Korean.

In Allen's original post referred to above, he makes the important point that if you just naively extract an archive that contains filenames encoded in non-UTF8 characters onto a system with a UTF-8 locale, the gibberish filenames will automatically be encoded as UTF-8 but still be unreadable. If this happens, you will not be able to convert the mangled filenames to UTF-8 because they are in UTF-8 already!

For example,

[archjun@arch Downloads]$ 7z x 편혜영.zip

7-Zip [64] 9.20  Copyright (c) 1999-2010 Igor Pavlov  2010-11-18
p7zip Version 9.20 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,2 CPUs)

Processing archive: 편혜영.zip

Extracting  Korean Writers(2009)/ÆíÇý¿µ(¿µ¹®).doc.docx
Extracting  Korean Writers(2009)/ÆíÇý¿µ.hwp

Everything is Ok

Files: 2
Size:       33991
Compressed: 25578
[archjun@arch Downloads]$ convmv -f cp949 -t utf8 -r --notest ~/Downloads/"Korean Writers(2009)"/
Skipping, already UTF-8: /home/archjun/Downloads/Korean Writers(2009)/ÆíÇý¿µ(¿µ¹®).doc.docx
Skipping, already UTF-8: /home/archjun/Downloads/Korean Writers(2009)/ÆíÇý¿µ.hwp
Ready!convmv -f cp949 -t utf8 -r --notest ~/Downloads/"Korean Writers(2009)"/

In the case above, we didn't specify a character encoding for the extracted filenames, so 7z defaults to the character encoding in our locale, which is en_US.UTF-8

Because of this, when we try to use convmv to convert from a MS Windows character encoding to UTF-8, convmv tells us that the filenames are already in UTF-8 and therefore cannot be converted!

The same holds true for other archive extractors like unzip, file-roller, etc. So don't forget to preface the archive extraction command with env to create a modified environment and then set LANG to the proper encoding (whether it is euc-jp, euc-kr, shift_jis, gbk, etc.) so that the extracted filenames' original character encoding will be maintained, thereby enabling conversion with convmv!

Postscript 2014-12-21:

Once you have converted filenames from a Windows text encoding like euc-kr to UTF-8, you may also need to convert text within a pure text file (not a binary like .doc, .hwp, etc) created in a Windows environment into UTF-8.

The Linux command for converting text within a file to another encoding is iconv. Let's assume we have a file, someText.txt, that was created in Windows and that contains Korean characters encoded in euc-kr. To convert to UTF-8 you can invoke iconv with the following flags:

iconv -c -f euc-kr -t utf8 someText.txt > someTextUTF-8.txt

-c  Silently discard characters that cannot be converted instead of
      terminating when encountering such characters.

-f  from-encoding (input characters)

-t  to-encoding (output characters)

The invocation above reads in someText.txt in euc-kr encoding and redirects output to someTextUTF-8.txt in UTF-8 encoding.

2014년 7월 15일 화요일

smtp-cli - send email from the Command Line without a mail server

smtp-cli is a Perl script that enables you to send mail from the command line without having to set up a local mail server. It can interface with webmail smtp servers like smtp.gmail.com and supports TLS.

A list of optional flags including --user --pass --from --to --subject can be seen at the bottom of the smtp-cli Perl script on Github.

Some caveats when trying to send mail through Gmail:

1. When you use the flags --user username --pass foo you cannot use your regular PW for Gmail but must generate an application-specific password instead.

If you don't specify a password beforehand with --pass, you will be interactively prompted to enter your PW as soon as smtp-cli runs

Enter password for username@smtp.gmail.com :

2. If you want to set up a cron job to run smtp-cli, make sure you specify your app-specific password after the --pass flag, otherwise cron will not be able to handle the interactive prompt.

3. To be able to send multilingual emails, be sure to pass the flag
--charset=UTF-8


Here is an example invocation of smtp-cli that will send an email to my personal gmail account:

[archjun@arch smtp-cli]$ smtp-cli --verbose --host=smtp.gmail.com --enable-auth --user gojun077 --pass 123456789abcdefg --from gojun077@gmail.com --to gojun077@gmail.com --subject "testing smtp-cli script #2" --body-plain="This is the body. 이건 바디다." --charset=UTF-8

When the email arrives in my gmail inbox, it contains the following UTF-8 encoded text in the body:

This is the body. 이건 바디다.


I originally heard about a similar Perl script called sendemail (not to be confused with the mail server sendmail), but it hasn't been updated since 2009 and currently doesn't work due to an SSL bug.

smtp-cli, on the other hand, is comparatively recent with version 3.6 updated July 11, 2013 and even has a package available through AUR. It works great!

2014년 7월 8일 화요일

MyRepos, aka 'mr' -- tool for managing multiple VCS repos

We have 10+ repos under ~/ and it is a real PITA to push all changes manually to our different remote repositories (some on Bitbucket, others on Github).

There is a great tool called MyRepos written by Debian Developer Joey Hess (well-known for debian-installer, debhelper, git-annex and lots of other popular packages) that allows you to manage multiple VCS repositories with a single command.

To setup mr to work with your repo, first you have to cd into the directory containing your repo and type:

mr register

so myrepos can make a note of the remote repo target.

Once that's done for all the repos you want to manage with mr, exit out of the directories under version control and run your VCS command prefaced by mr to update multiple repos at once. Here's an example of pushing local changes to 7 different remote git repositories:

[archjun@arch ~]$ mr push -u origin master
mr push: /home/archjun/Documents/MyProjects/KTokenizer
Counting objects: 3, done.
Delta compression using up to 2 threads.
Compressing objects: 100% (3/3), done.
Writing objects: 100% (3/3), 475 bytes | 0 bytes/s, done.
Total 3 (delta 1), reused 0 (delta 0)
   3986b86..541569e  master -> master
Branch master set up to track remote branch master from origin.

mr push: /home/archjun/Documents/WhatYoureReallyMeantToDo
Counting objects: 15, done.
Delta compression using up to 2 threads.
Compressing objects: 100% (15/15), done.
Writing objects: 100% (15/15), 4.52 KiB | 0 bytes/s, done.
Total 15 (delta 6), reused 0 (delta 0)
   89e72c6..a55889f  master -> master
Branch master set up to track remote branch master from origin.

mr push: /home/archjun/Documents/cchallenges
Counting objects: 28, done.
Delta compression using up to 2 threads.
Compressing objects: 100% (27/27), done.
Writing objects: 100% (28/28), 5.37 KiB | 0 bytes/s, done.
Total 28 (delta 10), reused 0 (delta 0)
   3158800..d810c82  master -> master
Branch master set up to track remote branch master from origin.

mr push: /home/archjun/Documents/coursera_IMT
Counting objects: 6, done.
Delta compression using up to 2 threads.
Compressing objects: 100% (6/6), done.
Writing objects: 100% (6/6), 953 bytes | 0 bytes/s, done.
Total 6 (delta 4), reused 0 (delta 0)
   60e0628..987c4de  master -> master
Branch master set up to track remote branch master from origin.

mr push: /home/archjun/Documents/cs50x_2014
Counting objects: 44, done.
Delta compression using up to 2 threads.
Compressing objects: 100% (42/42), done.
Writing objects: 100% (44/44), 4.75 KiB | 0 bytes/s, done.
Total 44 (delta 24), reused 0 (delta 0)
   03f6060..9cf44c8  master -> master
Branch master set up to track remote branch master from origin.

mr push: /home/archjun/Documents/idb_2014
Counting objects: 10, done.
Delta compression using up to 2 threads.
Compressing objects: 100% (10/10), done.
Writing objects: 100% (10/10), 5.97 KiB | 0 bytes/s, done.
Total 10 (delta 5), reused 0 (delta 0)
   212016d..bf033fc  master -> master
Branch master set up to track remote branch master from origin.

mr push: /home/archjun/Documents/jun-project-euler
Username for 'https://github.com': gojun077
Password for 'https://gojun077@github.com': 
Branch master set up to track remote branch master from origin.
Everything up-to-date

mr push: finished (7 ok)

A real time-saver!

2014년 7월 1일 화요일

How to track Brainworkshop and Mnemosyne repetitions in Beeminder (using Dropbox and IFTTT)

I'm a huge fan of Beeminder as it has helped me quantify progress on personal goals. Before Beeminder, I just wrote down my goals on a To-Do list and then felt bad when I failed to achieve them.

With Beeminder, however, any quantifiable goal can be automatically tracked, and if I go off-track, my credit card is automatically charged a penalty that increases exponentially with every failure.

Although some users are conscientious enough to manually enter data points for their Beeminder goals, I am just too lazy to enter data every day. Also, I have found that manual data entry poses a greater temptation to falsify data points (especially if you have lots of money on the line). I find StickK along with a human referee much better for goals that cannot be automatically tracked (i.e. x pushups* per day, etc. -- for this type of goal, I take a video selfie and share it on social media with my referee).

*Postscript 2015.12.04
It actually is possible to track pushups through the Beeminder Android app using a method described on the Beeminder blog (Nov. 21) and in a post by the Beeminder CEO. Details:

By swiping at the data entry box, you can use different methods to submit data to Beeminder...
Tally entry
Swipe the data entry area right (see the narrow yellow rectangles?) and then tap the counter. Now you can increment the count by touching anywhere on the screen. I use this to do pushups by putting my phone on the floor and touching my nose to it on each pushup. Use the back button to submit the data.

Brainworkshop is a working memory trainer using Dual N-Back. Mnemosyne is a Spaced Repetition Software (SRS) program similar to Anki but more streamlined. Both are Free and Open Source Software.

First make sure that both Brainworkshop and Mnemosyne are installed into folders within Dropbox.

In Archlinux, Brainworkshop installs to ~/.brainworkshop by default, but you can create a symlink from this directory pointing to the actual Dropbox installation location (I go into detail of how to do this in this post).

Now create an IFTTT rule that uses RSS as the trigger. Dropbox conveniently provides each user with an RSS events feed that updates whenever a file is created or modified. Tell IFTTT to search this feed for the keyword .brainworkshop.stats

When this keyword is detected, make IFTTT send an email to bot@beeminder.com with your Beeminder ID and your goal name in the subject line in the following format:

User123/Mygoal555

In the body of the email, you want to increment your graph by one point, so the body of the email should read:

^ 1 "comment blah blah blah"

Note that there is a space between the caret and the 1. (I failed to notice this, and ended up spamming the Beeminder bot with a bunch of invalid emails; oops!)

*Edit May 2015:

IFTTT will soon add an official channel for Beeminder so it is no longer necessary to send an email to the Beeminder mailbot as in the above example if you activate the Beeminder channel and set your custom IFTTT recipe to increment a Beeminder goal by a certain amount every time your trigger condition is satisfied.

The IFTTT rule for Mnemosyne is much the same, except the keyword you want to search for in the Dropbox RSS Events feed is default.db-journal

--------------------------------------------
Bug Alert: Sometimes this doesn't work as intended, because if multiple files are changed at the same time, the Dropbox events feed will record the event as:

You edited filename.foo and n more files

instead of indicating a change in a single file like default.db-journal or brainworkshop.stats which the IFTTT trigger is looking for in the RSS feed.
--------------------------------------------

Here are the links to my shared IFTTT rules for Brainworkshop and Mnemosyne, respectively:

https://ifttt.com/recipes/159615-dropbox-brainworkshop-beeminder

https://ifttt.com/recipes/159617-dropbox-mnemosyne-beeminder

Feel free to edit these rules and replace the Dropbox RSS feed with your own. As of July 1st, 2014 I have been diligently doing my Brainworkshop and Mnemosyne reviews almost every day for 6 months straight!