When a multinational organization states that its corporate language is English, it is deluding itself. And worse, it is doing an injustice to all of its non-native English speaking employees. The default language for cross-border communication may be English, but at a national level a significant amount of information will often be in a national language. Employees need to be able to find this information with the same ease as they can find information in English.
The problems of searching for non-English information are very complex, as complex as the variations between languages. It is impossible to do justice to multiple language search in a short column, so let's highlight just a few of the challenges.
Crawl and Index
The first challenge is to identify the language that is being used. A multilingual document may contain sections in more than one language, include more than one language in a field (especially a title or summary field) and a mixture of languages in the same field. The search application has to be able to identify the language and then index it according to a set of rules for that language. That will involve deconstructing the text and using a different set of stop words before creating an index in the languages that have been detected. Building a single index with a mix of languages is not a good approach.
Indexing challenges are difficult enough with languages using the Roman alphabet, but as you move east and run into Hebrew, Arabic and Thai then things become a lot more complex. Probably the ultimate challenges are Chinese, Japanese and Korean, often denoted by CJK, which use ideographs.
One often overlooked category is when someone writes a document in English when it is not their native language. Words and sentence constructions may be used that can defeat a search indexing application, which is working on a set of rules. A related issue is the way in which different countries present dates. Is the search application normalizing the American 5/4/2015 (May 4) and English 5/4/2015 (5 April) to the same date?
If you are not fully conversant with a language, you may use a word that is difficult to match to the index. To take just one example, long compound words are a feature of both German and Dutch, and the search application needs to be able to break these down into the constituent word elements. These words also require a search box long enough to contain the compound word and any modifiers.
Or a word may have different meanings in different languages. A simple example from French would be “lard.” In English this is pig fat, but in French it means bacon. You can see both the linguistic similarity and difference at the same time. There are many other words in French that look English, and these are referred to by linguists as “false friends.”
Sorting out the inappropriate use of search terms in multiple languages also presents a problem for the search team as they work through search logs. If the organization has even a small quantity of information in languages other than English, then the team will need access to people who not only speak the languages but also understand the context of the way in which the word is being used.
The other major issue in query management is whether the search application is going to take, say, the English word “computer” and find references in French to “ordinateur.” The user may have to either conduct a search in multiple languages or use “computer OR ordinateur” and hope that the application can sort out the mess.
Will results from different language repositories be interleaved or presented as multiple result pages in a form of federated search? That only works when there are distinct repositories in each language. If there are instances of documents containing multiple languages in the same document, this approach is not effective. And no matter how difficult it is to present these results on a desktop page, the challenges on a mobile device are much more difficult to resolve successfully.
More Search Strategies
If you want to understand multilingual search in more detail, a good place to start is Part III of “Elasticsearch: The Definitive Guide.” The fact that this section is almost 100 pages long gives some indication of the complexity of the topic.
The Basis Technologies website is a very good source of briefing papers on multilingual search. Certainly multilingual search needs a section all to itself in a search strategy even if the decree has gone out to all employees that only English is an acceptable language of communication.
Title image by Georgie Pauwels
Title image by Georgie Pauwels