We Make eCommerce Simple for Small Businesses
  SEOsamraat
  • SEO Home
    • SEO for Corporates >
      • Beat Your Competition With SEO
    • Power Searching
    • Keyword Research
    • SEO Best Practices
    • Types of SEO
    • SEO for Images
    • Optimized Content
    • SEO Tools
    • Webmaster Tools
    • SEO History
    • Future of SEO
    • SEO Basic Consulting
  • Workspace Home
    • Store It All on Drive
    • Move Write with Docs
    • Work with Sheets
    • Collaborate with Slides
    • Win Big Thru Sites
    • Sync Calendars to Meet
    • Get Secure with Vault
    • Get a Quote
  • Online Reputation Home
    • ORM Security
    • ORM Keywords
    • ORM Hiring and Training
    • ORM Legal Landscape
    • ORM Social Media Policy
    • ORM Tools
    • ORM Branding
    • ORM Goals
    • ORM Strategy
  • Blog
  • Digital Products
    • Password Creation Checklist
    • Glossary of ORM Terms

how google became a translator pro

4/8/2014

0 Comments

 
Google’s translation service is in the news in India now for the wrong reasons. Apparently, the Union Public Service Commission (UPSC), which conducts the civil services examinations, uses the Google Translate free service to translate most of the questions in the Civil Service Aptitude Test or CSAT for the preliminary exam. Many exam takers blame the poor Hindi-to-English translation for making CSAT insurmountable for them.

Obviously, UPSC needs to fix the translation part. It could consider using the services of professional translators, instead of an algorithm-based service like that of Google. But having said that, one has to note that on the whole, Google has considerably improved on the translation front from where it began. Randall Stross in his book Planet Google has provided a fascinating account of how Google nailed the machine translation problem which has been a bugbear element in computing for long.

Stross begins by saying that machine translation in computing has a long tradition of overpromising and underdelivering. Considering Cold War priorities, Russian-to-English translation of documents was the initial area of focus for researchers. But word-for-word matching had its limitations, including the famous ‘water goat’ problem, a reference to how computers frequently translated the word hydraulic ram.

Researchers thought all they had to do was add syntactical rules to word-for-word matching and perfect the process until translation was fixed. It certainly improved the quality of translations, and soon commercial providers of such translation services, including Systran, began entering the field. But Stross notes that this rules-based methodology was only one approach to machine translation. An alternative approach was advanced by researchers at IBM in the 1970s known as the Statistical Machine Translation. It was not based on linguistic rules manually drawn up by humans, but on a translation model that the software develops on its own as it is fed millions of paired documents —an original and a translation done by a human translator.

GOOGLE MADE USE OF IBM RESEARCH

Historically, IBM is known as a company with such a vast bureaucracy that many divisions do not know the findings and research advances of other divisions in the same organization. It often falls on others to make the most of the research advances made at IBM. For instance, Oracle was formed after Larry Ellison was alerted to the potential of an obscure research paper published at IBM about relational databases.

Google made its tentative foray into translations in 2003 by hiring a small group of researchers and letting them free to have a go at fixing the problem. As is to be expected, they soon saw the potential of Statistical Machine Translation. In this model, says Stross, “the software looks for patterns, comparing the words and phrases, beginning with the first sentence in the first page of Language A, and its corresponding sentence in Language B. Nothing much can be deduced by comparing a single pair of documents. But compare millions of paired documents, and highly predictable patterns can be discerned…”

So the task before the Google translators was one of scale. To fix the translation problem, they needed millions of paired documents. Stross says the Google engineers solved it by getting them a corpus of 200 billion words from the United Nations, where every speech made in the General Assembly as well as every document made, is translated into five other languages. “The results were revelatory,” says Stross. “Without being able to read Chinese characters or Arabic script, without knowing anything at all about Chinese or Arabic morphology, semantics, or syntax, Google’s English-language programmers came up with a self-teaching algorithm that could produce accurate, and sometimes astoundingly fluid, translations.”

Google soon went to town with its achievement. At a briefing in May 2005, it held two translations of a headline in an Arabic newspaper side by side — its own as well as that of Systran. The first translation by Systran read as ‘Apline white new presence tape registered for coffee confirms Laden’. It was sheer nonsense. The Google translation rendered it as ‘The White House confirmed the existence of a new Bin Laden Tape’. Pretty impressive!

Google didn’t stop there. It entered its translation service at the annual competition for machine-translation software run by the National Institute of Standards and Technology in the United States. Google came first in both Arabic-to-English and Chinese-to-English leaving Systran far behind. Google repeated its feat in 2006, coming first in Arabic and second in Chinese. Stross says a stupefied Dimitris Sabatakakis, the CEO of Systran, could not grasp how Google’s statistical approach could outsmart his company, which was in the machine translation business since 1968, and which had initially even powered the Google translation efforts.

At Systran, “if we don’t have some Chinese guys, our system may contain some enormous mistakes”, he was quoted as saying. Stross says he could not understand how Google, without those Chinese speakers double-checking the translation, had beat Systran so soundly. Incidentally, Google hasn’t taken part in the competition since 2008 since it may have found that there’s nothing left to prove.

FROM MONOLINGUAL TO BILINGUAL

Stross’ description of how Google built up a monolingual language model is also a fascinating read. While in bilingual, translation happens from one language to another, in the monolingual language model the efforts are directed at using software to fluently rephrase whatever the translation model produced. In other words, this model perfected the language after it was already translated from another.

How did Google manage this? Randall Stross has an answer. “The algorithm taught itself to recognize what was the natural phrasing in English by looking for patterns in large quantities of professionally written and edited documents. Google happened to have ready access to one such collection on its servers —the stories indexed by Google News.”

Stross says that “even though Google News users were directed to the Web sites of news organizations, Google stored copies of the stories to feed its news algorithm. Serendipitiously, this repository of professionally polished text —50 billion words that Google had collected by April 2007 —was a handy training corpus perfectly suited to teach the machine translation algorithm how to render English smoothly.”

So Google Translate may not be perfect. But it is constantly getting better, using software that teaches itself to read patterns by looking at a large volume of data. “Google did not claim to have the most sophisticated translation algorithms, but it did have something that other machine-translation teams lacked — the largest body of training data. As Franz Och, the engineer who led (and still leads) Google Translate said, “There’s a famous saying in the natural processing field, ‘More data is better data’.” Indeed. Data has helped Google to prevail as the leader in yet another segment of search.

e.o.m.

0 Comments

    Archives

    December 2014
    November 2014
    October 2014
    September 2014
    August 2014
    July 2014
    June 2014
    May 2014
    April 2014
    March 2014
    February 2014
    January 2014
    September 2013
    August 2013
    April 2013

    Author

    I'm Georgy S. Thomas, the chief SEO architect of SEOsamraat. The Searchable site will track interesting developments in the world of Search Engine Optimization, both in India as well as abroad.

    Categories

    All
    30th Annual TED Conference
    A/B Testing
    Adsense
    Adwords
    Aggregation
    Airgapped Computers
    Alex Gawley
    Algorithms
    Amazon
    Andreessen Horowitz
    Andy Conrad
    Astro Teller
    Ben Horowitz
    Bill Gross
    Bing
    Bitcoin
    Boston Dynamics
    Brand Management
    Brin
    Buchheit
    Business Insider
    Cailliau
    Charlie Rose
    Code Messages
    Coders
    Competitive Analysis
    Credit Card
    Crown Jewels
    CSAT
    Cutomers
    Cyberattackers
    Cyber Criminals
    Cybersecurity
    Cyber Security
    Cyber Siberia
    DATA PROTECTION
    David Krane
    DeepMind
    Digital Assassination
    Digital Innovation
    Diversity
    Donkey Kong
    Douglas
    Douglas Edwards
    Doxing
    Driverless Car
    Dunbar Number
    EBook
    Ebooksearch
    Ebookseller
    Ebooksellers
    Ecommerce
    English Auction
    Facebook
    Facebook Page
    Fetch As Google Tool
    Financial Times
    Flat Organization
    Flicker
    Fortune
    Fraudulent Activity
    Generic Names
    Giant Target Corp
    Gmail
    Google
    Google Analytics Tool
    Google Apps
    Google Chauffeur
    Google Maps
    Google Search
    Google Story
    Google X
    GoTo.com
    Greg Boser
    Gregg Steinhafel.
    Guruji.com
    Hackers
    How To Queries
    Huffington Post
    Human Flesh Search Engines
    IBM
    I'm Feeling Lucky
    Incentives
    Internet
    IP Addresses
    Japanese Auction
    Jeff Bezos
    Jeffrey Brewer
    Jill Abramson
    Kellogg’s
    Keyword Research
    Kindle
    Larry
    Larry Page
    Lidar
    LinkedIn
    Links
    Lori Randall Stradtman
    Luck
    Mail
    Malware
    Mark Davis
    Mauboussin
    Michael Mauboussin
    Microsoft
    Microsoft Stable
    Moonshot
    More Than You Know
    Neocortex
    NeXT Computer
    Nicole Perlroth
    NYT
    Nytimes.com
    Online Reputation Management
    Passwords
    Paul Buchheit
    Phishers
    Phrasal Searches
    Planet Google
    Project Loon
    PROPUBLICA
    Puzzles
    QWERTY
    QWERTY VS. DVORAK
    Recommend
    Reconsideration Tool
    Reid Hoffman
    Richard Torrenzano
    Robert Cailliau
    Robin Dunbar
    Satya Nadella
    Scams
    Schaft
    Scoop
    SEARCH ADVERTISING
    Search Engines
    Search Patterns
    Search Queries
    SEO
    Seo Agency
    SEO Analysts
    SERENDIPITY
    Sergey Brin
    Shari Fujii
    Silk Road
    Skill
    Social Networking Sites
    Spam
    Spammy
    Steve Jobs
    STORY MARKETING
    Streetview
    Stross
    Success Equation
    Suspicious Activity
    Systran
    Tell-Tale
    The 20% Doctrine
    The Age Of The Unthinkable
    The Art Of Strategy
    The Atlantic
    THE BANDWAGON EFFECT
    THE DUTCH AUCTION
    The Guardian
    The Legend Of Zelda
    The New York Times
    Think Twice
    Tim Berners-Lee
    Trivia
    Union Public Service Commission
    Uper Mario Bros
    URL
    Venture Capitalist
    Vigilantes
    Webmaster
    Webmaster Tools
    Website Traffic
    Wii
    William Vickery
    Wired Article
    World Wide Web
    WYSIWYG
    Yahoo

    RSS Feed

    View my profile on LinkedIn

About Us
Contact Us
Consult Us Now
Write to Us 
Terms of Use
Privacy Policy
Copyright © 2022 Proseperity
Photos used under Creative Commons from futureshape, a4gpa, taymtaym, Esparta