Facebook launches a new AI language model called M2M-100 that can translate between 100 languages and it comes with the 4,450 possible language combinations. Project is released as open source project on GitHub. The M2M-100 is a deep learning machine a new AI language model called M2M-100, which is a beyond English-Centric Multilingual Machine Translation.
Facebook’s new polyglot AI machine translation model is being open-sourced to the research community and the source code is available on GitHub. As of today Facebook released 12B parameter model trained on many-to-many training data for 100 languages. This model can be downloaded by the researchers from the official project site.
The M2M-100 is developed using various automated and machine learning techniques, which is now being open-source by Facebook for research by the community around the world. This will further fuel the growth of the M2M-100 model. Communities around the world will be able to use this model for research and similar work.
The new AI language model developed by Facebook is called M2M-100 and this model is trained with the 100 languages. The M2M-100 model is trained in all the possible language pairs so that it can translate the language between any pairs. The model comes with the 4,450 possible language combinations and it can translate 1,100 pairs directly. This model seems to be very powerful and it is trained with over 7.5 billion sentence pairs.
This model is different from the previous models where they were using English as intermediate translations; now with M2M-100 model language can be translated into target language without using English as an intermediate language. This is a great move in machine language translation field. For example, you have to translate Chinese into French then the path used with previous model was to first translate into English and then translate into French. This long process might introduce errors in the translations. The current M2M-100 model is very complex and now it’s released for the researchers around the world.
The automated data curation techniques had been used by the researchers to collect large amount data for model training. The web crawlers were used to scrape billions of sentences from the web and then used FastText identify the language. The Facebook data were not used as part of training. The model only used the public domain data available on the Internet. Then the researchers used a program called LASER 2.0, developed previously by Facebook’s AI research lab, which used unsupervised learning for matching sentences across languages.
The current release of M2M-100 is for research only, there is no plan to use this by Facebook for products use. According to Fan, the M2M-100 is meant for research purposes. The goal of this project is to improve on and expand Facebook’s existing translation capabilities.
The source code of the project can be accessed at https://github.com/pytorch/fairseq/tree/master/examples/m2m_100
The release of M2M-100 model and code in the public domain will further fuel the growth of the machine translation model.