2014년 6월 19일 목요일

As of Jun 19, 2014 Project Euler site is down :(

When I visited Project Euler on Thursday, I got the following page:

Apparently the site operators suspect their database was hacked. I hope they come back soon!

Update 2014-06-22: The site is back up, but logins are disabled and Project Euler banners listing how many problems a user has solved are no longer provided.

2014년 6월 17일 화요일

Google Code Jam 2014: First-ever Korean finalist Kim Chan-min (김찬민)!!!

Google Code Jam 2014 Round 3 was conducted last weekend (Jun 14) and the top 25 scorers will be going to LA to duke it out at the world finals along with the GCJ 2013 champion from Belarus.

Looking at the statistics of past Code Jams (previous years can be selected at the bottom of the page), there have been many finalists from Russia, Eastern Europe, China, Japan, America, etc., but no one from Korea (North or South) has ever made it to the onsite Final Round.

GCJ 2014 is thus a milestone of sorts for Korea, as Kim Chan-min kcm1700 placed 18th in the world in Round 3. Like most of his fellow qualifiers to the final round, he used C++ in his solutions. One of the reasons for the popularity of C++ in algorithmic coding competitions is that ACM-ICPC, which is the collegiate training arena for competitive coders, only accepts solutions in one of three languages -- C, C++, or Java -- and the International Olympiad in Informatics for high schoolers only accepts solutions in C, C++ or Pascal. This trend is also evident among the top competitors at elite algorithmic programming challenge sites like CodeForces and TopCoder.

So just who is kcm1700? He is a student at Korea's most prestigious college, Seoul National University and is a CodeForces grandmaster. According to his Twitter blurb, he participated in IOI 2008 and made it to the ACM-ICPC World Finals in 2010.

This new generation of programmers has a chance to change Korea's software development culture, which lags 15~20 years behind that of Japan and America. I won't go into the problems of Korea's software industry in this post, but I believe that the culture must change before Korea has even a remote chance of becoming a good environment for hackers (some good articles in Korean about the myriad problems faced by local SE's: 1 2 3).

Kudos to kcm1700 for making it to the GCJ 2014 World Finals and Good Luck!

Postscript 2014.09.25
kcm1700 scored 7th in the finals. A very respectable showing!

2014년 6월 10일 화요일

Handy Imagemagick tools for cropping book images prior to OCR

It's annoying to edit out page headers and footers that have been automatically OCR'd. One way to avoid this problem is to manually specify the OCR area, but this can become quickly tedious if you have hundreds of pages to process.

Enter Imagemagick's convert command. Today I will talk about the -chop and -shave config flags for convert.

These commands are a lifesaver when we want to unneeded areas (i.e. footers/headers, page numbers, etc.) that appear in a constant location from multiple pages.

A useful resource for the various image-cropping options available in convert can be found here:


Consider the following scan from a book:

Every page contains a page number at the bottom and on every other page along the right margin the book's title written vertically. We want to crop all the book's pages s.t. the page number and vertical title won't appear in the final image -- this will make OCR go much faster as we won't have to manually select which area needs to be OCR'ed.
In this particular example, we can use

convert "스님의_주례사 - 0011.png" -gravity SouthEast -chop 250x200 SE_chopped250x200.png

Which will remove 250 pixels from the right (East) and 200 pixels from the bottom (South). In the case of the -crop config flag, the reference point (0, 0) for all pixel calculations is the top-leftmost corner of the canvas. By using the -gravity flag, however (quote from Imagemagick docs):

The direction you choose specifies where to position text or subimages. For example, a gravity of Center forces the text to be centered within the image. By default, the image gravity is NorthWest.

So -gravity SouthEast will make the reference point (0, 0) the bottom-rightmost corner of the canvas. Now the resulting chopped image looks like:

​As you can see in the above image, the extraneous text from the right and bottom margins has been cropped out!

In other books, however, the location of extraneous text might be different. Let's say you want to remove text from both the top and bottom (or left and right) of the following screencap:

Eliminating the viewing frame can be accomplished with the convert flag -shave, which shaves pixels from the edges of an image (top & bottom, left & right). The arguments of -shave are:

... -shave [numPixelsLeftRightEdges]x[numPixelsTopBottomEdges]

Note that the brackets above should not actually be typed out. So if you wanted to remove 100 pixels from the left and right edges, you would pass the following arguments to - shave:

... -shave 100x0

If you want to remove 100 pixels from the top and bottom edges:

... -shave 0x100

To remove 100 pixels from both the top & bottom as well as left & right edges:

... -shave 100x100

The single page with its top and bottom edges removed:

convert escape_from_evil_frame_ex.png -shave 0x50 shaved_0x50.png

Finally, one more example. Say we have the following screencap containing two facing  pages:

Let's use the -shave flag to remove extraneous areas from both the top and bottom, left and right to make the image more amenable to OCR.

convert two_page.png -shave 80x50 two_page_shaved_80x50.png

This command shaves 80 pixels from both the left and right as well as 50 pixels from the top and bottom leaving us with the following image:

To run any of the above commands on all the images in a directory, simply invoke convert with a wildcard. For example:

convert *.png -configFlag outputFilename.png

Imagemagick will automatically increment outputFilename: outputFilename0.png, outputFilename1.png...

2014년 6월 1일 일요일

Overuse of English terminology in English-to-Korean technical translations

Recently I've been doing a technical translation project in the civil engineering and light rail sector. For those of you who have experience doing technical translation you will know that the biggest challenge in such projects is dealing with specialized terminology. In the engineering field common terminology might include terms such as

torque, leverage, pivot...

while in light rail projects common terms include

guideway, M&SF (Maintenance and Storage Facility), PSD (Platform Screen Door)...

and so on.

One problem I have noticed is that field engineers oftentimes transliterate these English terms to Korean, i.e. torque would become 토크 (which sounds like toe-keu for those of you who can't read Hangul). If you are an engineer who is familiar with the English term torque, than maybe transliteration (phonetic decomposition into another language) isn't such a big deal. But the problem is that engineering spec sheets and design review documents contain hundreds of such terms and in many cases, translators simply transliterate these to Korean without trying to find an appropriate term in Sino-Korean. As the final end-users of technical specification sheets include non-engineering personnel, naive transliteration that fails to convey meaning creates the risk of misunderstanding.

Looking up torque in Google using the search term: "define torque" returns:

1. MECHANICS - a twisting force that tends to cause rotation.

The Sino-Korean for this term is 회전력(回轉力) which literally means rotation(回轉) power(力).

In fact, most science and engineering terminology in Asia has Traditional Chinese character (aka hanja in Korean, kanji in Japanese) equivalents. The advantage of Chinese ideographs is that the actual principle explaining a word is contained in its symbolic representation. Even if you have no idea what torque means, if you look at its ideograph 回轉力 you can at least know that it has something to do with rotation. The ideograph is certainly more informative than the transliteration toe-keu (토그).

In Korea, there has been a systematic move away from the use of Chinese ideographs in favor of Hangul (native Korean script) over the past century and over the past 50 years or so, English transliteration into Hangul has become rampant.

But first, a historical detour into the background of Sino-Korean's decline in the Korean language. In 1894~1895, Korea's teetering Chosun dynasty passed the Second Gabo Reform (갑오개혁 甲午改革) which eliminated the civil service exam system (과거제도 科擧制度 ) based on the Chinese Classics while mandating the use of Hangul in lieu of Chinese characters in official documents. Fast forward 15 years to 1910, when Japan annexed Korea -- according to Japanese colonial gov't statistics (link in Korean), more than 90% of the Korean population could read neither Chinese nor Korean characters. (Note: the statistics from the colonial Japanese gov't should be taken with a grain of salt, as it was in their interest to discredit the natives as much as possible to justify their annexation) After Japan's defeat in WWII and Syngman Rhee's rise to power in South Korea, the illiteracy rate stood at 78% (link in Korean). Let me note that high illiteracy rates were common even in Western pre-industrial societies.

With the rise of nationalism in Korea's North and South, there was a movement to use pure Hangul free of Chinese characters. One problem with this plan, however, is that 70% of Korean originates from Chinese characters. The DPRK's solution was to remove many Sino-Korean words from the popular lexicon and replace them with pure Korean, a task at which they've been quite successful (one drawback, however, is that after 60 years of national division, people from the North and South sometimes have trouble communicating). South Korea, in contrast, pushed ahead with the Hangul-ization of school curricula, government publications, newspapers, etc. without removing Sino-Korean vocabulary from wide use.

This is problematic for one big reason -- different Chinese characters have lots of homophones in Korean. For example, the Chinese characters 鎭痛 (zhèn tòng, pain relief) and 陣痛 (zhèntòng, labor pains) have slightly different pronunciations in Chinese:

If you listen carefully to the Chinese pronunciation, you will notice that 鎭 has more stress on the first syllable than does 陣. Also the waveform display in the Soundcloud link above shows that the two are distinct. In Korean, however, the pronunciation for both 鎭痛 and 陣痛 is 진통 (jin-tong). Because these two different words are homophones, the only way you can know they mean different things is by using Chinese characters to disambiguate or by guessing from the context. But guessing from context is not foolproof. If you are a doctor in a busy hospital and you hear 'jin-tong' (진통), you might be confused as to whether it's referring to labor pains or the need for painkiller.

There are tens of thousands of such homophones in Korean and the meaning often cannot be gleaned from context alone. In conservative circles, there is much lament (link in Korean) about Korean language becoming a kind of Ebonics in which contemporary meaning and usage have been divorced from their linguistic roots.

I am no stickler for Korean tradition, but these critics have a point. Chinese ideographs can coexist with Hangul script for disambiguation purposes and provide more meaning than simple transliteration alone. Because of the growing role of English loan-words in Korean, reading comprehension is taking a hit. If Koreans were fluent English-speakers, then using such loan-words wouldn't matter, but I've noticed that my fellow translators often have no understanding of the technical material they're translating from English. This is exacerbated by the fact that most interpreters/translators in Korea are women, the vast majority of whom major in the humanities.

I believe this problem could be ameliorated somewhat if English-to-Korean technical translators used more Sino-Korean terminology, which includes semantics in their symbolic representation. Then translators could at least get a glimpse at the principles behind the terms they're translating.

Some English engineering terms, their naive transliterations into Korean, and Sino-Korean equivalents:

English Transliterated Sino-Korean
torque 토크 회전력(回轉力), 선전력(旋轉力)
actuator 엑추에이터 작동기(作動機)
leverage 레버리지 공간력(槓杆力)
buffer 버퍼 완충장치(緩衝裝置)
detailing 디테일 세부장식(細部裝飾)