Chinese Language Model Evaluation With CLEVAPublished on Sat Sep 09 2023 by Dustin Van Tate Testa Yum Cha | Phil Whitehouse on Flickr
Chinese Language Models (LLMs) have been gaining attention in the field of natural language processing, but evaluating their capabilities has been a challenge due to the lack of a comprehensive benchmark and standardized evaluation procedures. However, a team of researchers has developed CLEVA, the Chinese Language Models EVAluation Platform, to address these issues.
CLEVA is a user-friendly platform that allows researchers and developers to evaluate Chinese LLMs in a holistic manner. It employs a standardized workflow and regularly updates a competitive leaderboard. To ensure unbiased evaluation, CLEVA curates a significant proportion of new data and develops a sampling strategy that guarantees a unique subset for each evaluation round.
The platform offers a comprehensive Chinese benchmark, incorporating various metrics to evaluate LLMs' performance across different dimensions. It includes tasks that assess specific LLM skills as well as real-world application assessment. With over 9 million queries from 84 datasets and 9 metrics, CLEVA boasts the most extensive Chinese evaluation dataset to date.
One of the key features of CLEVA is its standardized prompt-based evaluation methodology. It provides a set of prompts for each task, ensuring comparable evaluation results and encouraging further analysis of LLMs' sensitivity to different prompts.
To address the issue of train-test contamination, CLEVA adopts a proactive approach by collecting extensive new data and frequently organizing new evaluation rounds. This strategy mitigates the risk of contamination and improves the trustworthiness of the evaluation results.
The efficacy of CLEVA has been validated through large-scale experiments involving 23 influential Chinese LLMs. A user-friendly interface allows users to explore the evaluation results of different models and conduct their own evaluations with minimal coding required.
The researchers behind CLEVA acknowledge the limitations of the platform, such as the reliance on inference walk-clock time as a metric and the challenging problem of evaluating privacy. However, they are committed to updating and improving the platform to reflect the latest advancements in evaluation methods.
In terms of ethics, responsible data collection and usage are prioritized. Manual data collection methods are widely adopted, and participants are properly compensated. To evaluate LLMs' performance on potentially harmful content, caution is exercised, and the datasets should only be used for LLM evaluation.
Overall, CLEVA provides a much-needed platform for the comprehensive evaluation of Chinese LLMs, addressing the challenges posed by the absence of a benchmark, unstandardized evaluation procedures, and the risk of contamination. It facilitates the development and improvement of Chinese LLMs and enhances their applicability in various tasks and applications.