Simple methods to language detection

Let’s talk about the some Python packages to detect language of the text.

Language detection is one of the activity in the natural language processing, We have many python packages to serve this.

In this article we will discuss about the langid, langdetect, TextBlob.

1. langid

langid package is pre-trained over a large number of languages (currently 97).

langid package is requires >= Python 2.7 and numpy. The main script langid/langid.py is cross-compatible with both Python2 and Python3, but the accompanying training tools are still Python2-only.

langid.py is WSGI-compliant. langid.py will use fapws3 as a web server if available, and default to wsgiref.simple_server otherwise.

You can use this github link to explore more about langid package.

sample_data = ['Azaindole derivatives and their use as antithrombotic agents ', 'Azaindol derivate und ihre Verwendung als antithrombotische Wirkstoffe ', "Dérivés de l'azaindole et leur utilisation comme agents antithrombotiques ", '']import langidfor lang in lang_data:
language = langid.classify(lang)
print(f" {lang} is related to {language}")
And Out is like belowAzaindole derivatives and their use as antithrombotic agents is related to ('en', -173.22279596328735)Azaindol derivate und ihre Verwendung als antithrombotische Wirkstoffe is related to ('de', -239.93667697906494)Dérivés de l'azaindole et leur utilisation comme agents antithrombotiques is related to ('fr', -274.12151527404785)
is related to ('
en', 9.061840057373047)

2. langdetect

This module is a port of Google’s language-detection library that supports 55 languages. This module don’t come with Python’s standard utility modules. So, it is needed to be installed externally.

You can use this github link to explore more about langdetect package.

sample_data = ['Azaindole derivatives and their use as antithrombotic agents ', 'Azaindol derivate und ihre Verwendung als antithrombotische Wirkstoffe ', "Dérivés de l'azaindole et leur utilisation comme agents antithrombotiques ", '']from langdetect import detectfor lang in lang_data:
language = detect(lang)
print(language)
And, output will be like below....
en
de
fr

NOTE

Language detection algorithm is non-deterministic, which means that if you try to run it on a text which is either too short or too ambiguous, you might get different results every time you run it.

To enforce consistent results, call following code before the first language detection:

from langdetect import DetectorFactory
DetectorFactory.seed = 0

3. TextBlob

TextBlob package requires NLTK package, uses Google.

Note: This solution requires internet access and Textblob is using Google Translate’s language detector by calling the API.

sample_data = ['Azaindole derivatives and their use as antithrombotic agents ', 'Azaindol derivate und ihre Verwendung als antithrombotische Wirkstoffe ', "Dérivés de l'azaindole et leur utilisation comme agents antithrombotiques ", '']from textblob import TextBlobfor lang in lang_data:
b = TextBlob(lang)
print(b.detect_language())
And, output will be like below....
en
de
fr

In the Next article I will try to cover three more libraries.

Conclusion
In this article, I have tried to explain different python libraries to detect the languages.
We have many more packages for language detection like spacy-langdetect, Pycld2, polyglot, Chardet, guess_language, fasttext, pycld3 and Googletrans
Language detection is one of the key activity in NLP process.
Hopefully, this article will help you.

Thanks for reading…