Website Localizer (WL) uses text parsing to pull text into the Translate.com portal. The goal of our text parsing in WL is to group as many related phrases together into a single phrase, while still detecting shared elements like headers and footers granularly enough to be re-used (therefore translated only once) across multiple pages. This helps our translators with context for consistency and accuracy. To accomplish this, we look for text grouped within specific HTML tags that typically signify all the text inside should be part of the same phrase/translation. For example, if we see a <section> tag with many <p> tags inside of it, we will group all the <p> tags inside the <section> tag as one phrase. When human translations are ordered, all of the text in the <section> will be translated together for better, more accurate translations. Here are the HTML tags for which all the text will be considered as one phrase:
- <h1> through <h6>
- Any other text not in one of the above tags
When there is one of these tags within another tag, we use to the outermost tag when determining what makes up the phrase.
If any text or HTML within a parsed phrase changes on the web page, we will detect it as a brand-new phrase, which will require a new translation. This means that if one <p> tag within a <section> changes, that the entire <section> will require a new translation as the change in the <p> tag could affect the context of the entire <section>.
This can also create what appear to be “duplicate” phrases when invalid HTML is corrected by certain web browsers. For example, Chrome is very good at correcting invalid HTML when it renders a page. Other browsers are not as good or do not correct invalid HTML at all. This can cause the same sentence to be rendered with one set of HTML tags in Chrome and another set of HTML tags in Safari. Because the HTML tags are different, this will be picked up as two phrases even though the text is the same. An easy way to find instances of this is to compare the source of a page within Chrome (right click on any open area of the page and choose View Page Source) to the HTML in the Chrome’s Inspector (right click on any open area of the page and choose Inspect). Common corrections are HTML tags that were not closed properly, special characters such as “&” or “©”, not using proper HTML entities, and ® symbols not having a <sup> tag wrapped around them.
In addition to looking at the source text on the page, we monitor any changes to the page via our proprietary code. When we detect changes to the page, we automatically parse the text for those changes using the same process as above.
Before embedding the Translate.js code on your web page, we urge you to refer to the below, a handy checklist to ensure the highest quality translations and avoid easily correctable issues.
- Ensure your HTML is 100% valid which should ensure that browsers like Chrome do not make any corrections to it.
- Apply our "no-translate" tag to any section on the site that should not be translated. More details here.
- Wrap logical sections of content that should be grouped together as one phrase in the appropriate tag from the list above. The most common examples are:
- Ensuring navigation bars are tagged using a <ul> tag
- Wrapping related <p> tags from a single article in an <article> or tags
- Wrapping similar phrases of text that are not considered an article in <section> tags
- Populate your Glossary within your Translate.com account with any proper nouns or specific translations you already have.