Financial Misinformation Detection

Introduction

In the financial sector, the accuracy of information is crucial for the integrity of decisions, market operation, risk management, compliance, and trust establishment. However, the proliferation of digital media has escalated the spread of financial misinformation. Such misinformation, including deceptive investment propositions and biased news articles, can manipulate market prices and influence economic sentiment, presenting substantial risks. The advent of LLMs in finance has introduced transformative potential for analysis [1], prediction [2], and decision-making [3]. In this challenge, participants are expected to engineer LLMs capable of not only identifying fraudulent financial content but also generating clear, concise explanations that elucidate the reasoning behind the classification by levering claims and contextual information. This requirement for explanation generation is crucial, as it adds an additional layer of transparency and trust in the AI's decision-making process, enhancing the model's utility for investors, regulatory bodies, and the broader financial community.

The goal of this challenge is to create a specialized LLM that excels in pinpointing financial misinformation and articulating its findings. By incorporating explanations, the model not only identifies misinformation but also educates users about the nature of the misinformation, leveraging a wide array of financial domain features such as income, finance, economics, and budget. This approach aims to fortify the model's effectiveness and contribute to a more transparent, accountable, and stable financial environment. The ability to detect and provide explanations for fake financial news is vital for safeguarding investors and diminishing adverse effects on financial markets, thereby promoting a well-informed and resilient financial ecosystem.

Task

This task tests the ability of LLM to verify financial misinformation while generating plausible explanations.  Participants need to develop or adapt LLMs to identify financial claims (True'/'False'/'Not Enough Information') and give explanations for their decision according to the related information, following the designed prompt template of the query. 

The following template is an example of constructing the instruction tuning data to support the training and evaluation of LLMs [4]. Also, participants can adjust the template to make full use of all information. 

Task: [task prompt]. Claim: [claim]. Context: [context]. Prediction: [output1]. Explanation: [output2]


[task prompt] denotes the instruction for the task (e.g. Please determine whether the claim is True, False, or Not Enough Information based on contextual information, and provide an appropriate explanation.). [claim] and [context] are the claim text and contextualization content from the raw data respectively. [output1]  and [output2] are the output from LLM. 

Dataset

The task leverages the FIN-FACT [5] dataset, a comprehensive collection of financial claims categorized into areas like Income, Finance, Economy, Budget, Taxes, and Debt. The claim label categorizes claims as 'True', 'False', and 'NEI (Not Enough Information)'.   The dataset contains the following information.

NOTE: Participants need to predict labels and generate explanations (evidence) simultaneously based on a single model using other information in the data. The  blind test set will not include label and evidence columns. Closed-source LLMs (e.g. chatgpt, gpt4) can also be used. However, it is necessary to provide hyperparameters (especially the seed) and the required scripts to ensure reproducibility. Evidence consists of sentences that are supported by some form of proof. Evidence can be part of the justification. Participants cannot copy the justification directly or prompt the LLM in a similar manner to generate the explanations (evidence).


Data Examples:

FND_dataexample

Evaluation

The task uses metrics such as Accuracy, Precision, Recall, Micro-F1 for misinformation detection evaluation and ROUGE (1, 2, and L) [6], BERTScore [7] for explanation evaluation. The metrics ROUGE and BERTScore are commonly used to evaluate the quality of automatically generated text and its similarity to reference text. 

We use the average of F1 and ROUGE -1 scores as the final ranking metrics.

Model Cheating Detection

To measure the risk of data leakage from the test set used in the training of a model, the Model Cheating, participants need to upload their final model to hugging face and the necessary scripts for Model Cheating Detection. 

[1] When FLUE Meets FLANG: Benchmarks and Large Pretrained Language Model for Financial Domain. (https://aclanthology.org/2022.emnlp-main.148

[2] Bloomberggpt: A large language model for finance. (https://arxiv.org/abs/2303.17564

[3] Pixiu: A large language model, instruction data and evaluation benchmark for finance. (https://arxiv.org/abs/2306.05443

[4] FMDLlama: Financial Misinformation Detection based on Large Language Models. (https://www.arxiv.org/abs/2409.16452)

[5] Fin-Fact: A Benchmark Dataset for Multimodal Financial Fact Checking and Explanation Generation. (https://arxiv.org/abs/2309.08793

[6] ROUGE: A Package for Automatic Evaluation of Summaries. (https://aclanthology.org/W04-1013)

[7] BERTScore: Evaluating Text Generation with BERT. (https://openreview.net/pdf?id=SkeHuCVFDr)

Registration

Registration has been closed

Please choose a unique team name and ensure that all team members provide their full names, emails, institutions, and the team name. Every team member should register using the same team name. We encourage you to use your institution email to register.

Schedule

Practice data: Huggingface link

Baseline: Github link

Training data: Huggingface link

Test set: Huggingface link

submission template: link (NOTE: Label value: 0: False, 1: True, 2: NEI)

submission link

Link: https://softconf.com/coling2025/FinNLP25/
Financial Misinformation Detection track 

Submission


Shared Task Organizers

Contact