Follow

How Translate.com’s Website Localizer Parses Text on a Web Page

The goal of our text parsing in Website Localizer (WL) is to group as many related phrases together into a single phrase, while still detecting shared elements like headers and footers granularly enough to be re-used across multiple pages. To accomplish this, we look for text grouped within specific HTML tags that typically signify all the text inside should be part of the same phrase/translation. For example, if we see a <section> tag with many <p> tags inside of it, we will group all the <p> tags inside the <section> tag as one phrase. When human translations are ordered, all of the text in the <section> will be translated together which increases context and accuracy. Here are the HTML tags that we will parse everything inside of as one phrase:

 

  1. <table>
  2. <ul>
  3. <li>
  4. <article>
  5. <section>
  6. <summary>
  7. <figcaption>
  8. <p>
  9. <h1> through <h6>
  10. <a>
  11. Any other text not in one of the above tags

 

When there is one of these tags within another tag, we use to the outermost tag when making the phrase.

 

If any text or HTML within a parsed phrase changes on the web page, we will detect it as a brand new phrase, which will require a new translation. This means that if one <p> tag within a <section> changes, that the entire <section> will require a new translation as the change in the <p> tag could affect the context of the entire <section>.

 

This can also cause what appear to be “duplicate” phrases when invalid HTML are corrected by some web browsers, but not others. Chrome is very good at correcting invalid HTML when it renders the page. Other browsers are not as good or do not correct invalid HTML at all. This can cause the same sentence to be rendered with one set of HTML tags in Chrome and another set of HTML tags in Safari. Because the HTML tags are different, this will be picked up as two phrases even though the text is the same.  An easy way to find instances of this is to compare the source of a page within Chrome (right click on any open area of the page and choose View Page Source) to the HTML in the Chrome’s Inspector (right click on any open area of the page and choose Inspect). Common corrections are HTML tags that were not closed properly, special characters like & or © not using proper HTML entities and ® not having a <sup> tag wrapped around them.

 

In addition to looking at the source text on the page, we listen for any changes to the page via Javascript. When we detect changes to the page, we automatically parse the text in those changes using the same process as above.



Before embedding the Translate.js code on your web page, here is a handy checklist to ensure the highest quality translations and prevent easily correctable issues.

 

  1. Ensure your HTML is 100% valid which should ensure that browsers like Chrome do not make any corrections to it.
  2. Apply our notranslate tag to any section on the site that should not be translated. More details here.
  3. Wrap logical sections of content that should be grouped together as one phrase in the appropriate tag from the list above. The most common examples are:
    1. Ensuring navigation bars using a UL
    2. Wrapping <p> tags from a single article in <article> or tags
    3. Wrapping similar phrases of text that are not considered an article in <section> tags
  4. Populate your Glossary within your Translate.com account with any proper nouns or specific translations you already have.
  5. Then, embed the JavaScript code.

 

A quick note regarding iFrames: Website Localizer is written in Javascript which is not able to parse text within an iFrame. If your site uses frames, be sure to include the WL JavaScript within the page in the iFrame as well.  

Comments