Wednesday, June 2, 2010

Google Languages expands again

As long-time readers will certainly know by now, I keep rather a close eye on the offerings at Google Language Tools. Well, those industrious little devils are at it again, with several new languages offered at the “alpha” stage (i.e., likely to be riddled with risible errors). You won’t see these on the main Language Tools page yet; you have to reach the translation results page to get these. I’m not sure why, unless it’s to keep them a bit lower-profile. Anyway, here’s what’s new:
  • Armenian
  • Azerbaijani
  • Basque
  • Georgian
  • Urdu
For a total of 56 languages now available! India and Africa are still grossly underrepresented — for heaven’s sake, Basque is spoken by just a little more than half a million people, where Punjabi is spoken by somewhere between 50 and 100 million. Punjabi is spoken by so many people that estimates can’t keep up* — the Basques can probably count just about every speaker individ-ually! In Africa, Yoruba, Igbo, and Zulu together account for perhaps 60 million speakers, outnumbering speakers of Irish by about 100 to 1. And never mind that adding Urdu is sort of cheating, since they already have Hindi. :)

Ah, well, progress is progress. And of course, when they do finally add Yoruba or Gujarati or even Kyrgyz, I’ll probably be one of the first to know, which means you’ll be among the first to know!

* Of course, most people in India also speak English, which I imagine to be one reason Google is in no particular hurry to accommodate them in their cradle-tongues.

27 comments:

  1. When they add Quenya and Sindarin, it will be great!

    ReplyDelete
  2. Bilingual corpora, it's all about bilingual corpora. Find 'em and tell Google about 'em, that's what it takes.

    ReplyDelete
  3. EuroFox, I don’t know about translation, but it’s probably not long before the Google website itself is available in one or both. It’s already available in Klingon. :)

    John, yes, that’s how they do it. But it’s hard to imagine any excuse there are insufficient corpora for languages spoken by 100–200x as many people as some of the languages they offer. Google hardly needs you or me to find them a set of parallel English-Punjabi texts. There must be other reasons they’ve put that off.

    They do offer a useful transliteration tool, which supports an impressive array of scripts, covering Amharic, Arabic, Bengali, Greek, Gujarati, Hebrew, Hindi, Kannada, Malayalam, Marathi, Nepali, Oriya, Persian, Punjabi, Russian, Sanskrit, Serbian, Sinhalese, Tamil, Telugu, Tigrinya, and Urdu. But of course, this isn’t a translation tool.

    ReplyDelete
  4. David Doughan6/03/2010 3:58 PM

    That transliteration tool is quite impressive - I've tried it on (Modern) Greek, Russian and Urdu, and it appears to work quite well, once you get used to its ways. Now what about Serbian? Not to mention Ancient Greek ...

    Quenya / Sindarin: nothing! I'll be impressed when Google does Danian and Doriathrin.

    ReplyDelete
  5. They could probably support polytonic Greek, but it requires a particularly rich character set which not all users would have installed. But take a look at this tool, created by my friend Randy Hoyt.

    And personally, I’d like to see them support Tibetan and the Nasta’līq style of Arabic — to me, two of the most beautiful writing systems ever devised. They could stand to offer Thai as well.

    ReplyDelete
  6. David Doughan6/03/2010 4:43 PM

    Well, most Urdu (and Farsi) is written in a variety of Nasta'liq - but I do agree that it's very beautiful.

    I notice that the transliteration tool does not include Turkish - and this is a difficulty with varieties of a script peculiar to one language (there's a similar, but lesser, problem with Hungarian). OK, nothing's perfect ....

    ReplyDelete
  7. Well, most Urdu (and Farsi) is written in a variety of Nasta’liq [...]

    Yes, but not online. Online, it’s normally standard Arabic. That’s like Century Gothic compared to Fraktur. Pity.

    I notice that the transliteration tool does not include Turkish [...]

    Hasn’t Turkish been written in the Latin alphabet (with just a few special characters) since the interbellum period? For the Turkish characters in the Latin alphabet (as with Hungarian), most of these are available in the better Unicode character sets. That goes for the double-acute accent, dotless i, and so on.

    ReplyDelete
  8. David Doughan6/04/2010 2:07 AM

    Point taken about nasta'liq. As for Turkish, etc, my trouble is that I'm lazy. I was spoiled 20-odd years ago by Locoscript 2, which not only provided a very wide array of (European) characters, including such really ancient Greek letters as digamma and koppa, but had floating diacritics which could be attached to any letter. Unfortunately it was based on a pretty unsophisticated OS, and (among much else) any sort of internet access was out.

    ReplyDelete
  9. I've tried it on Modern Greek as well. Overall, it seems like a good and useful tool, but I think it could be a little more intuitive and flexible.
    If you write, for instance, petheno instead of pethainw, it comes out as πεθενο. In this respect, All Greek to Me is better. :)

    ReplyDelete
  10. Hi, Eva. I’m not familiar with that one. What’s the URL for the tool? (That’s if it’s an online tool at all.) And David, I’m not familiar with Locoscript 2 — sounds like it was very useful in its day. With its flexibility, I bet it was good for conlanging too! I did a lot of that 20 years ago; wish I’d know about this!

    ReplyDelete
  11. It is.
    http://speech.ilsp.gr/greeklish/greeklishdemo.asp

    You can type the words in any fashion, the system copes with (almost) any type of Greeklish. :)

    ReplyDelete
  12. Personally, I'd like to see a Glagolitic converter. :)

    ReplyDelete
  13. The Spanish on google translate is horrible, I was trying to translate hymns into Spanish, typed in "As the deer pants for the water" and it came back with "As the deer PANTS(trousers) for the water". Similar things have happened with other phrases and the likes too, it just has no recognition of context.......grrr. I just don't trust google translate.

    ReplyDelete
  14. Hi, Lilly. I’d have expected better for Spanish! Most of the times I’ve used it, I’ve gotten decent results, but that’s a pretty funny error you mention. Totally understandable how it occurs, of course.

    Speaking as a computer programmer for a moment, I can say it’s quite challenging to analyze textual input in multiple languages (especially single, short sentences) in order to correctly parse its grammatical structure. Looking at the sample you gave, how would Google “know” that pants is a verb here, not a noun? Come to that, can you answer how you know it’s a verb? :) We have all the rules internalized, but for Google to translate properly, it has to build those rules formally — not such an easy thing to do — and the problem is compounded by the fact of having to “understand” the grammatical structure of more than 50 languages.

    But yes, you can’t trust it for anything serious, like translating for publication. I’m amazed it works as well as it does, to be honest. :)

    ReplyDelete
  15. Yeah, I mean the only reason I knew it was wrong was because I have studied Spanish a bit and when I saw the word PANTALONES I knew it was having issues. In order to get the context right I had to put in "As the deer thirsts for the water", but then the song is not the same...ah well. I know pants is a verb because the noun after it (water) indicates that it is. I mean you could go into formal sentence structure and everything but the root of the problem is that the google translate program doesn't have a human brain, it saw the word "pants" and came up with "pantalones". There are just too many words in English that have multiple meanings depending on function.

    ReplyDelete
  16. There are just too many words in English that have multiple meanings depending on function.

    Yes, the problem is especially thorny with English. I imagine it could be equally (maybe even more) difficult with Chinese, a predominantly monosyllabic language where tone carries a lot of the meaning.

    ReplyDelete
  17. I'm not a specialist, but I'm positive that the issue goes far beyond formalized rules. You know 'pants' is a verb or a noun or whatever because, after you've exhausted all the grammar rules at your disposal, one of the possible constructions makes some kind of sense to you - so you unconsciously discard the rest, even if they were grammatical. If more than one sense is possible, you take the one that comes first to your mind (depending on a variety of factors), and maybe you catch more than one and appreciate the pun. You can even prefer one particular ungrammatical construction because the sense is better than the one you get going by the rules. But the translation engine can only resort to quantifiable criteria (such as word order, frequency of some combinations, explicit agreement of accidence markers, etc.), and you can't measure sense a priori.

    I am looking at the Spanish translation of this blog, and the thing that will give you (Jason) a laugh is that, according to Google, you appear to be a developer of reluctant software ("desarrollador de software reacios" - I'm not sure why it uses the plural). You may agree with them or not; you may wish to add a hyphen to help these guys and avoid misunderstandings. However, AFAIK that was a perfectly grammatical construction in English that can only be interpreted if you know what the author is talking about (or even if you know the author).

    ReplyDelete
  18. Hlaford, good points. Yes, it’s more than just formalized rules. So far as I know, we still lack a complete picture of how languages “work” in the human brain. But having said that, in spite of its sometimes humorous errors, Google translation is still better than anything I’ve seen before.

    As for why Google used the plural adjective reacios, I’d guess it’s because it is treating the loan-word software as a plural. Even in English, sofware is one of those nouns of indeterminate number; it’s not really singular, not really plural.

    For a real laugh, check out this Tolkien blog, which is apparently machine-translated from Québécois. You can sort of follow the intended meaning, but it reads like English As She Is Spoke. :)

    ReplyDelete
  19. Ah Jason, there is no problem with written Chinese. The characters portray the different meanings.....there is only a problem with spoken Chinese. Therefore google translate should hopefully not have trouble with Chinese (I don't read Chinese so I can't really text it). As I understand it the characters are always put together differently depending on context, they are made up of simple strokes which are then put together with other strokes to make different characters. So you can have the same strokes and put them together differently to get different meanings. I think that should be relatively easy for google translate.

    ReplyDelete
  20. Simplified or traditional Chinese ideographs are unambiguous, yes, but what about Chinese written in Pinyin? Perhaps that’s unambiguous too, since the tone can be marked with diacriticals, yes? I’m not particularly knowledgeable about Chinese.

    ReplyDelete
  21. Harm J. Schelhaas6/13/2010 3:08 PM

    Well, not that I'm well up in Chinese, but I have once made some study of its romanisations and its phonology (since the two are closely related), and there are many words, or syllables, that would be written the same in pinyin (or romanisations in general) — and would be pronounced the same, since there is a clear relation between romanisation and pronunciation — but would be written with different characters — either in simplified or in traditional or both. Even so, there are many characters that have more than one meaning (sometimes related, sometimes not at all), which depend upon context to be correctly interpreted.

    Oh, and technically, only a small number of Chinese characters are ideographs. Logographs seems to be the more accurate general term (I know Unicode calls them ideograms, Unicode is wrong).

    ReplyDelete
  22. Modern Mandarin is no longer even approximately monosyllabic, due to the wholesale loss of distinctions between Middle and Modern Chinese, and written Chinese has been based on the modern Mandarin standard for about a century now. The characters are really neither ideographic nor logographic, but morpho-syllabic; each character represents a single syllable and a single morpheme, all three concepts being called 子 (zì) in Chinese. (There is no ordinary word for 'word' in Chinese, only the technical term 词 (cí).)

    Standard Pinyin writing includes the tone marks, but the ambiguity resulting from omitting them isn't actually very large at the level of sentences, other than ones intentionally constructed to expose an ambiguity, like "time flies like an arrow/banana" in English. For one thing, Mandarin songs discard all tone information, and they are still straightforward to understand.

    ReplyDelete
  23. Thanks, Harm and John, for your further thoughts on the subject — all new information to me. I didn’t know that the tones in Mandarin are dropped in song. It makes sense, though, since it’s hard to imagine how tone wouldn’t interfere with melody.

    ReplyDelete
  24. Cantonese pop, though, does require the melodic contour to match the tone contour, which as you can imagine makes it hard to write — no arms-length lyricists and composers, like Gilbert and Sullivan. Cantonese has underlyingly 9-11 tones, but the collapse of old distinctions makes them realized as only 6 distinct on the surface.

    ReplyDelete
  25. Interesting! I could see it being difficult, but not impossible, to write original songs in Cantonese — but to translate existing songs into the language must be practically impossible. At least, if the goal is an accurate translation and rentention of the original melody. Know anything about this, John? Are Western songs translated into Cantonese at all? And if they are, how do they go about it?

    ReplyDelete
  26. http://Glagolitic.com/ has a Glagolitic converter

    ReplyDelete