A team of Romanian researchers has created a language model (LLM) dedicated to Romanian language, which can be used to develop tools and platforms A.I. The model is “open source”, so it can be accessed and used by anyone who wants to build tools based on artificial intelligence. With the publication of this LLM, the initiators of the project also launch the OpenLLM -Ro community, which wants to bring together all those who want to contribute in various forms to the development of A.I. technologies for the Romanian language. Both projects are initiated and carried out by POLITEHNICA Bucharest, the University of Bucharest and the Institute of Logic and Data Science, with the support of BRD Groupe Société Générale.
Although it is a technology that we have access to on a large scale for only a few years, many of us have already interacted very often with conversational robots, such as ChatGPT (produced by OpenAI), Copilot (developed by Microsoft), Gemini (developed by Google) etc. However, for the Romanian language the results are sometimes imprecise, because the models on which they are trained have not been exposed to many Romanian data sources. At the same time, these types of tools cannot be used in companies because direct access can be restricted for reasons of security and confidentiality. One solution in these situations is the implementation of a local model located in the company’s infrastructure. These public models that can be used locally are, however, generally trained in English or using a small number of documents in languages with lower circulation.
The Romanian model launched today is the adaptation of a public LLM developed mainly for English. But it was exposed to several million documents in Romanian, in order to better understand the meaning of the words. This is essential for the performance of such models in situations where the expression of the user’s request or question as well as the answer must be in Romanian. Since the second part of 2023, a team of researchers from POLITEHNICA Bucharest, the University of Bucharest and the Institute of Logic and Data Science have worked on the development and training of this LLM. Academic partners contributed with researchers who worked pro-bono and, in addition, POLITEHNICA Bucharest also provided the computing power needed to train the model. The main partner of the project is BRD Groupe Société Générale, which supports innovation and technologies of the future in Romania in all their forms.
“In order for the economic and/or institutional environment in Romania to be able to use this promising new technology, we need specialised models that have met a lot of conversations and documents in Romanian. The reason is simple: to be able to provide us with the information we need. In BRD we are constantly working on solutions that improve our work processes, using the latest technologies that can bring added value to our customers in the first place. But we also understand that our needs are shared with many other institutional actors, and we are committed to supporting innovation in artificial intelligence early. By engaging in its highly animated landscape, we can help the newest technologies to have a positive impact in the Romanian society at almost the same pace as the developments in the field at international level”, said Horia Velicu, Head of Innovation Lab at BRD Groupe Société Générale.
“Some of the examples of use of the Romanian model are: searching for information in an organisation’s knowledge base, with guides and working procedures, or conversational robots for clients of companies or institutions to guide them through the steps needed to use a product or service. In both cases, employees and/or customers save time in accessing information, benefiting in many situations and improving its quality,” said Alin Stefanescu, Director of the Computer Science Department at the University of Bucharest and vice-president of the Institute of Logic and Data Science.
The effort to specialise a language model is frequently coordinated by the academic community associated with that language, with recent examples being from countries such as France, Germany, Spain, Finland, Bulgaria. However, the necessary resources are considerable in terms of both the necessary technical infrastructure (e.g. dedicated hardware such as high-power graphics cards) and experienced researchers and programmers. There is therefore a need for broad, medium- and long-term support from many key societal actors: economic, academic and last but not least the government environment, through programs dedicated to the development of artificial intelligence technologies.
That’s why the developers of this model are launching the OpenLLM.ro community at the same time. It aims to encourage interaction between various actors or facilitators who wish to contribute to the development of this technology for the Romanian language and to launch specialized models for certain fields. The initiation of this dialogue in an open source environment will accelerate the creation of more performing models, implemented in Romanian companies or institutions, which will result in an increase in overall productivity of the entire society.
“We hope that the launch of this model will only be the beginning of a long-term effort that will result in better LLMs for the Romanian language. We have already discovered a method that we want to apply to other recently launched models (Llama-3 and Mistral) and which generally perform better than the one we started from (Llama-2). However, in order to have good models for the Romanian language we need 2 types of resources: collections of large and curated data, of good quality, as well as hardware resources (in particular, GPUs for model training). We hope that both private and public entities will understand the importance of developing large and multi-modal language models (text-images) for the Romanian language. We expect everyone to join us in the OpenLLM-Ro initiative and the research projects that will support it”, said Traian Rebedea, lecturer at POLITEHNICA Bucharest and senior researcher at NVIDIA, one of the technical coordinators of the OpenLLM-Ro initiative.
The technical report can be found here: https://arxiv.org/abs/2405.07703
The LLM model can be downloaded from the Hugging Face platform: https://huggingface.co/OpenLLM-Ro
The code associated with the model can be downloaded from GitHub: https://github.com/OpenLLM-Ro
More details about the project: https://ilds.ro/llm-for-romanian
BRD for Education, Technology & Innovation
BRD supports the preparation of future generations of technology specialists and entrepreneurs. The projects that BRD envisages go towards education in STEM disciplines: First Tech Challenge Romania, robotics laboratories at POLITEHNICA University of Bucharest, Innovation Labs program, partnership with Applied Data Science Centre of the University of Bucharest, Innovators for Children program or How to Web Conference. All this brings together pupils, students, teachers, young entrepreneurs, experts and has impacted more than 50.000 beneficiaries so far.
Source: European Digital Skills & Jobs Platform