Who we are ?

I started working on this project on the 21st of February, 2022. I watched the speech of Vladimir Putin, I talked to my mom in Odesa and cried a lot, understanding that war we were discussing for months is about to start. And then I opened my laptop and started scraping the data. I knew I need to do something. I knew that being a Ukrainian NLP scientist working in European University positions me almost uniqly to research propaganda in terms of Computitional Linguistics, with full understanding of the contexts and its tactics and by publishing i can also raise public awareness of the war. After getting some trained models, which were laying under the dust on my laptop’s desktop for some time, I understood that I need to get them out there, let everyone use it, even if they won’t work perfectly, maybe it will teach someone a thing or two about Russian propaganda and how it works.

When I found out about Nika's project, I realized that I couldn't pass it up. Being in Germany, I understood, that I have to help in any way I can, so this project was and is a great opportunity to participate in the information war. This project gave me the opportunity to learn a lot and improve my existing knowledge. Even if this project is not perfect, we put our hearts and souls into it, and I personally hope that it will help other people understand what propaganda is and how it works.

What is this project for?

Many European citizens become targets of the Kremlin propaganda campaigns, aiming to minimise public support for Ukraine, foster a climate of mistrust and disunity, and shape elections (Meister, 2022). To address this challenge, we developed “Check News in 1 Click”, the first NLP-empowered pro-Kremlin propaganda detection application available in 7 languages, which provides the lay user with feedback on their news, and explains manipulative linguistic features and keywords. We conducted a user study, analysed user entries and models’ behaviour paired with questionnaire answers, and investigated the advantages and disadvantages of the proposed interpretative solution.

Methods

We implement a binary classification using the following models for input vectors consisting of 41 handcrafted linguistic features and 116 keywords (normalized by the length of the text in tokens): decision tree, linear regression, support vector machine (SVM) and neural networks, using stratified 5‑fold cross-validation. For comparison with learned features, we extract embeddings using a multilingual BERT model and train a linear model using them.We performed 3 sets of experiments contrasting the handcrafted and learned features:

  • Experiment 1. Training models on Russian, Ukrainian, Romanian and English newspaper articles, and evaluating them on the test sets of these languages and on French newspaper articles. We add the French newspapers to benchmark the multilingualism of our models. We choose French because it is in the same language family as Romanian.
  • Experiment 2. Training models on Russian, Ukrainian, Romanian, English and French newspaper articles, and validating them on the test set. Additionally, we use this model to test the Russian and Ukrainian Telegram data. Here the goal is to investigate whether this model will perform well out-of-the-box for the Telegram articles, which are 10 to 20 times shorter.
  • Experiment 3. Training models on the combined newspaper and Telegram data and applying them to the test set. Here we verify whether adding the Telegram data to the training set can improve generalization power, although data distributions differ.