Google’s translation service is in the news in India now for the wrong reasons. Apparently, the Union Public Service Commission (UPSC), which conducts the civil services examinations, uses the Google Translate free service to translate most of the questions in the Civil Service Aptitude Test or CSAT for the preliminary exam. Many exam takers blame the poor Hindi-to-English translation for making CSAT insurmountable for them.
Obviously, UPSC needs to fix the translation part. It could consider using the services of professional translators, instead of an algorithm-based service like that of Google. But having said that, one has to note that on the whole, Google has considerably improved on the translation front from where it began. Randall Stross in his book Planet Google has provided a fascinating account of how Google nailed the machine translation problem which has been a bugbear element in computing for long. Stross begins by saying that machine translation in computing has a long tradition of overpromising and underdelivering. Considering Cold War priorities, Russian-to-English translation of documents was the initial area of focus for researchers. But word-for-word matching had its limitations, including the famous ‘water goat’ problem, a reference to how computers frequently translated the word hydraulic ram. Researchers thought all they had to do was add syntactical rules to word-for-word matching and perfect the process until translation was fixed. It certainly improved the quality of translations, and soon commercial providers of such translation services, including Systran, began entering the field. But Stross notes that this rules-based methodology was only one approach to machine translation. An alternative approach was advanced by researchers at IBM in the 1970s known as the Statistical Machine Translation. It was not based on linguistic rules manually drawn up by humans, but on a translation model that the software develops on its own as it is fed millions of paired documents —an original and a translation done by a human translator. GOOGLE MADE USE OF IBM RESEARCH Historically, IBM is known as a company with such a vast bureaucracy that many divisions do not know the findings and research advances of other divisions in the same organization. It often falls on others to make the most of the research advances made at IBM. For instance, Oracle was formed after Larry Ellison was alerted to the potential of an obscure research paper published at IBM about relational databases. Google made its tentative foray into translations in 2003 by hiring a small group of researchers and letting them free to have a go at fixing the problem. As is to be expected, they soon saw the potential of Statistical Machine Translation. In this model, says Stross, “the software looks for patterns, comparing the words and phrases, beginning with the first sentence in the first page of Language A, and its corresponding sentence in Language B. Nothing much can be deduced by comparing a single pair of documents. But compare millions of paired documents, and highly predictable patterns can be discerned…” So the task before the Google translators was one of scale. To fix the translation problem, they needed millions of paired documents. Stross says the Google engineers solved it by getting them a corpus of 200 billion words from the United Nations, where every speech made in the General Assembly as well as every document made, is translated into five other languages. “The results were revelatory,” says Stross. “Without being able to read Chinese characters or Arabic script, without knowing anything at all about Chinese or Arabic morphology, semantics, or syntax, Google’s English-language programmers came up with a self-teaching algorithm that could produce accurate, and sometimes astoundingly fluid, translations.” Google soon went to town with its achievement. At a briefing in May 2005, it held two translations of a headline in an Arabic newspaper side by side — its own as well as that of Systran. The first translation by Systran read as ‘Apline white new presence tape registered for coffee confirms Laden’. It was sheer nonsense. The Google translation rendered it as ‘The White House confirmed the existence of a new Bin Laden Tape’. Pretty impressive! Google didn’t stop there. It entered its translation service at the annual competition for machine-translation software run by the National Institute of Standards and Technology in the United States. Google came first in both Arabic-to-English and Chinese-to-English leaving Systran far behind. Google repeated its feat in 2006, coming first in Arabic and second in Chinese. Stross says a stupefied Dimitris Sabatakakis, the CEO of Systran, could not grasp how Google’s statistical approach could outsmart his company, which was in the machine translation business since 1968, and which had initially even powered the Google translation efforts. At Systran, “if we don’t have some Chinese guys, our system may contain some enormous mistakes”, he was quoted as saying. Stross says he could not understand how Google, without those Chinese speakers double-checking the translation, had beat Systran so soundly. Incidentally, Google hasn’t taken part in the competition since 2008 since it may have found that there’s nothing left to prove. FROM MONOLINGUAL TO BILINGUAL Stross’ description of how Google built up a monolingual language model is also a fascinating read. While in bilingual, translation happens from one language to another, in the monolingual language model the efforts are directed at using software to fluently rephrase whatever the translation model produced. In other words, this model perfected the language after it was already translated from another. How did Google manage this? Randall Stross has an answer. “The algorithm taught itself to recognize what was the natural phrasing in English by looking for patterns in large quantities of professionally written and edited documents. Google happened to have ready access to one such collection on its servers —the stories indexed by Google News.” Stross says that “even though Google News users were directed to the Web sites of news organizations, Google stored copies of the stories to feed its news algorithm. Serendipitiously, this repository of professionally polished text —50 billion words that Google had collected by April 2007 —was a handy training corpus perfectly suited to teach the machine translation algorithm how to render English smoothly.” So Google Translate may not be perfect. But it is constantly getting better, using software that teaches itself to read patterns by looking at a large volume of data. “Google did not claim to have the most sophisticated translation algorithms, but it did have something that other machine-translation teams lacked — the largest body of training data. As Franz Och, the engineer who led (and still leads) Google Translate said, “There’s a famous saying in the natural processing field, ‘More data is better data’.” Indeed. Data has helped Google to prevail as the leader in yet another segment of search. e.o.m.
0 Comments
|
Archives
December 2014
AuthorI'm Georgy S. Thomas, the chief SEO architect of SEOsamraat. The Searchable site will track interesting developments in the world of Search Engine Optimization, both in India as well as abroad. Categories
All
|