Xunzi Series of Large Language Models: A New Tool for Ancient Text Processing

In today’s digital age, ancient texts, as precious treasures of human culture, face unprecedented opportunities and challenges. How to better utilize modern technology to explore, organize, and study ancient texts has become a focal point for numerous scholars and technology workers. The emergence of the Xunzi series of large language models offers a new solution for this field.

I. Introduction to the Xunzi Series of Models

The open-source Xunzi series includes two main components: the foundational model XunziALLM and the conversational model XunziChat. XunziALLM is the highlight of this open-source project, being a fully open ancient text domain large language model. To help non-artificial intelligence professionals better understand and utilize this open-source model, the development team also constructed the conversational model XunziChat using some data. Users can conveniently call the Xunzi ancient text processing model in the same way as they call open-source models like Qwen, Baichuan2, ChatGLM3, etc.

Currently, multiple ancient text large language model versions based on different open-source models have been released, including:

II. Highlights of the Xunzi Series of Models

(I) Intelligent Indexing of Ancient Texts

The Xunzi model excels in intelligent indexing of ancient texts. It can perform high-quality thematic indexing of ancient text content, akin to installing a smart index for ancient texts, enabling researchers to quickly understand the core themes of the texts. For instance, when dealing with a vast ancient historical book, the model can rapidly and accurately index major historical events and biographies of important figures, significantly enhancing the efficiency of researchers in locating information.

(II) Key Information Extraction from Ancient Texts

The Xunzi model’s ability to automatically extract key information from ancient texts is a highly practical feature. It can precisely identify and extract elements such as characters, events, and locations from ancient text content, saving researchers a substantial amount of time in information screening and organization. Imagine studying an ancient literary work; instead of manually combing through character relationships and story settings, the Xunzi model can quickly present this key information, allowing researchers to swiftly move into the stage of in-depth analysis.

(III) Poetry Generation

For poetry enthusiasts, the Xunzi model’s poetry generation capability is undoubtedly a delightful surprise. Based on themes or keywords provided by users, it can automatically generate ancient poems that comply with grammatical rules and rhythmic requirements. This not only offers creators a rich source of inspiration but also helps individuals better understand and learn poetry creation techniques and artistic styles. For example, using “Autumn Night Longing for Home” as a theme, the model can generate deeply evocative and rhythmically harmonious poems, sparking further creative desires.

(IV) High-Quality Translation of Ancient Texts

Understanding obscure ancient texts has long been a challenge for many researchers. The Xunzi model’s high-quality translation function acts as a bridge, helping individuals overcome language barriers and better comprehend the meanings of ancient texts. Whether it is an ancient philosophical classic or a historical work, the model’s translation enables non-specialist readers to more accurately grasp its core ideas and content, promoting the dissemination and popularization of ancient text culture.

(V) Reading Comprehension

The Xunzi model is capable of analyzing and interpreting ancient text content, achieving automatic reading of ancient texts. It is like equipping ancient texts with an intelligent commentator that can deeply analyze complex sentence structures, archaic vocabulary, and allusions in ancient texts. This helps readers gain a more comprehensive and in-depth understanding of the content and implications of ancient texts, enhancing their reading experience and learning outcomes.

(VI) Lexical Analysis

In the field of linguistics research, the Xunzi model’s lexical analysis function holds significant importance. It can automatically perform word segmentation and part-of-speech tagging on ancient text content, providing linguists with an efficient research tool. Through precise lexical analysis, scholars can more conveniently study the vocabulary composition, grammatical evolution, and linguistic style characteristics of Classical Chinese, advancing related linguistic research.

(VII) Automatic Punctuation

Ancient text content often lacks modern punctuation marks, posing difficulties for reading and comprehension. The Xunzi large language model’s automatic punctuation function can quickly complete sentence segmentation and punctuation addition for ancient texts, making them clearer and easier to read. This is immensely helpful for both professional researchers and amateur enthusiasts in reading ancient texts and accurately understanding their content, improving the fluidity and accuracy of reading.

III. How to Call the Xunzi Series of Models

Take calling the Xunzi-Qwen1.5-7B_chat model as an example. You can use the third-party Python library OpenAI to achieve conversational functionality. Below is a sample code for calling the model:

from openai import OpenAI
from tqdm import tqdm
openai_api_key = "ANY THING"
openai_api_base = "http://xunziallm.njau.edu.cn:21180/v1"
client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

for i in tqdm(range(0,1)):
    chat_response = client.chat.completions.create(
        model="/home/gpu0/xunzi_web/Xunzi-Qwen1.5-7B_chat",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": 'Based on the provided text, extract relationship triplets that match the description according to the relationship scheme: (Person, PO/Official Position, Official Position), (Person, PP/Attitude Tendency/Negative, Person), (Person, PL/Other, Location), (Person, PL/Live, Location), (Personal Pronoun, Attitude Tendency/Negative, Person)\nUpon submission, the emperor ordered princes, marquises, and members of the imperial clan to gather and discuss. No one dared to object, except for Dou Ying, who contested the matter, thereby creating a rift with Chuo.'},
        ]
    )
    print(chat_response.choices[0].message.content)

IV. Model Optimization and Disclaimer

Despite the Xunzi series of large language models demonstrating commendable performance in processing Chinese ancient text information, accurately analyzing the complexity of ancient texts, and uncovering the rich connotations of traditional Chinese culture, the development team is well aware that the models still require numerous improvements and optimizations. Therefore, they warmly welcome users to provide valuable feedback and pledge to continuously enhance the models by launching new versions with better performance in the future.

However, it is important to note that the massive number of parameters in large language models introduces greater randomness. Although every effort has been made to ensure the compliance of training data, the complexity of data and models may still lead to unavoidable issues. Consequently, the development team will not assume any liability for problems arising from the use of this open-source model, including but not limited to data security issues, public opinion risks, or any risks and problems caused by model misdirection, misuse, dissemination, or improper utilization.

Additionally, in accordance with the “Provisional Measures for the Administration of Generative Artificial Intelligence Services” jointly issued by the Cyberspace Administration of China and six other departments, please adhere to relevant laws and regulations when training, using this model, and other generative models to collectively build a harmonious, healthy, and sustainable generative AI community.

Should you have any questions regarding the use of the model, please feel free to contact the developers at zhaozhixiao@stu.njau.edu.cn.

V. Acknowledgments

The successful launch of the Xunzi series of large language models would not have been possible without the strong support of numerous collaborating organizations and researchers. Gratitude is extended to the following entities and individuals:

School of Economics and Management, Nanjing University of Science and Technology, Department of Information

  • Associate Professor Shen Si

School of Chinese Language and Literature, Nanjing Normal University

  • Professor Li Bin

National Library of China

  • Deputy Researcher Ma Xueliang

These collaborating organizations and researchers have provided invaluable academic resources, professional knowledge, and technical guidance during the development of the model, making significant contributions to its continuous improvement and development.

The Xunzi series of large language models opens a new chapter in the field of ancient text processing. With its robust capabilities and broad application prospects, it is expected to play a significant role in ancient text research, cultural heritage preservation, and the development of related disciplines in the future. As technology advances and the model continues to be refined, we believe that the Xunzi series of models will play an increasingly important role in the digitization of ancient texts. It will enable more people to appreciate the charm of ancient text culture and contribute technological power to the inheritance and promotion of Chinese traditional culture.