In this paper, we propose KnowCoder, a Large Language Model (LLM) to conduct Universal Information Extraction (UIE) via code generation. KnowCoder aims to develop a unified schema representation method that LLMs can easily understand and an effective learning framework that encourages LLMs to follow schemas and extract structural knowledge accurately. To achieve these, KnowCoder introduces a code-style schema representation method to uniformly transform different schemas into Python classes, with which complex schema information, such as constraints among tasks in UIE, can be modeled in an LLM-friendly manner. We further construct a code-style schema library covering over 30,000 types of knowledge, which is the largest one for UIE, to the best of our knowledge. To ease the learning of LLMs, KnowCoder contains a two-phase learning framework that enhances the schema understanding ability via code pretraining and the schema following ability via instruction tuning.
[2024-03-11]: We released the initial version of the KnowCoder!
We released KnowCoder, a powerful Large Language Model for Universal Information Extraction that injects thousands of structured knowledge through code.
KnowCoder completed various evaluations on 33 widely used information extraction benchmarks:
- After code pretraining on around 1.5B automatically constructed data, KnowCoder already attains remarkable generalization ability and achieves NER improvements compared to LLaMA2 by 49.8% relative F1 under the few-shot setting.
- After instruction tuning, KnowCoder further exhibits strong generalization ability on unseen schemas and achieves up to 12.5% and 21.9% under the zero-shot setting and the low resource setting, respectively.
- Additionally, based on our unified schema representations, various human-annotated datasets can simultaneously be utilized to refine KnowCoder, which achieves significant improvements up to 7.5% under the supervised setting.
The code-style schema representation method comprises three basic classes, namely, "Entity", "Relation", and "Event". Based on the three basic classes, we represent all the concepts in the schemas by the corresponding classes. Thus, the instances of each concept can be represented by the objects of the corresponding class. A schema consists of class name, class inheritance, class comments, type hint, and class method. The detailed explanation of each component can be found in our paper.
We construct the code-style schema library under this schema representation method based on Wikidata (Note that we use the Wikidata dump up to 20220704). We select the concepts included in the existing IE datasets created from Wikidata, i.e., KELM, UniversalNER, InstructIE, and LSEE, and derive the constraints among concepts according to their co-occurrences. To construct the taxonomies, we extract the "subclass of" relations among these concepts from Wikidata. To obtain the description of a concept, we use its definition from Wikidata directly or generate its descriptions using GPT-4 if its definition in Wikidata is missing. Finally, the constructed schema library encompasses over 29,177 entity types, 876 relation types, and 519 event types. The detailed statistics of the schema are shown in Table \ref{schema_stat}. Here, "#Type" denotes the total number of types, "#Type w/ desc." indicates the count of types with descriptions, and "#Type w/o desc." signifies the count of types without descriptions.
The schema library data can be found in 🤗Schema-Library.
Task | #Type | #Type w/ desc. | #Type w/o desc. |
---|---|---|---|
NER | 29,117 | 19,856 | 9,321 |
RE | 876 | 840 | 36 |
EE | 519 | 515 | 4 |
The datasets consist of three parts: schema understanding data, schema following data, and specific domain IE data.
The datasets are released in 🤗Huggingface-KnowCoder.
The schema understanding data includes schema definition codes and schema instance codes.
The schema understanding data can be found in 🤗Schema-Understanding.
The schema definition codes are built based on the schema library, with statistical results shown in Schema Library Construction.
The schema following data can be found in 🤗Schema-Following.
The schema following data is constructed on UniversalNER, InstructIE, and LSEE. The statistics of schema following data are presented in Schema Instance Codes.
The cases of schema following data are shown here.
Note: Because some data sets have copyright requirements and need licenses, we cannot directly release this part of the data now. If you have a license for restricted datasets, you can contact us via emails to obtain data.
Additionally, for specific domain Information Extraction (IE), we conduct experiments utilizing 33 datasets, comprising 23 datasets for the NER task, 8 datasets for the RE task, and 2 datasets for the ED and EAE tasks. Specifically, under the supervised setting, we employ 18 datasets for the NER task, including ACE04, ACE05, AnatEM, Broad Twitter, bc2gm, bc5cdr, CoNLL03, DIANN, FabNER, FindVehicle, GENIA, MIT-Movie, MIT-Restaurant, MultiNERD, ncbi-disease, Ontonotes5, WikiANN, and WNUT17. For the RE task, we utilize 8 datasets under the supervised setting, including ACE05, ADE corpus, CoNLL04, GIDS, kbp37, NYT, SciERC, and semeval RE. For the ED and EAE tasks, ACE05 and CASIE are employed.
Under the zero-shot setting, we take 7 datasets for the NER task, comprising 5 CrossNER subsets (AI, literature, music, politics, science), MIT-Movie, and MIT-Restaurant. For the RE task, we adopt GIDS under the zero-shot setting. For the ED and EAE tasks, CASIE is adopted under the zero-shot setting.
The detailed statistic of each dataset is shown as follows. Here, "#Type" indicates the number of types, while "#Train", "#Dev", and "#Test" denote the number of sentences in the training, development, and testing datasets, respectively. Here is the overview of the datasets on specific domain IE by task and size. Note that the statistics for each dataset in the figure encompass the total number of train, dev, and test sets.
After Schema Understanding, we can obtain the KnowCoder (SU. only).
To verify the generalization ability of KnowCoder (SU. only), we conduct few-shot experiments on 7 datasets across NER tasks.
Model | Movie | Rest. | AI | Litera. | Music | Politics | Science | Average |
---|---|---|---|---|---|---|---|---|
LLaMA2-7B | 31.0 | 19.6 | 30.8 | 24.1 | 28.0 | 38.7 | 44.1 | 30.9 |
LLaMA2-13B | 32.6 | 25.2 | 37.5 | 36.5 | 37.0 | 60.3 | 51.7 | 40.1 |
LLaMA2-7B | 31.0 | 19.6 | 30.8 | 24.1 | 28.0 | 38.7 | 44.1 | 30.9 |
KnowCoder-7B (SU. only) | 37.2 | 36.4 | 41.8 | 42.6 | 53.8 | 60.6 | 51.6 | 46.3↑49.8% |
After Schema Understanding and Schema Following on LLaMA2, we can obtain the KnowCoder.
To verify the generalization ability of KnowCoder, we conduct zero-shot experiments on 9 datasets across NER, RE, and ED tasks.
Model | Movie | Rest. | AI | Litera. | Music | Politics | Science | Average |
---|---|---|---|---|---|---|---|---|
w.refinement | ||||||||
InstructUIE-11B | - | - | 48.4 | 48.8 | 54.4 | 49.9 | 49.4 | - |
GoLLIE-7B | 63.0 | 43.4 | 59.1 | 62.7 | 67.8 | 57.2 | 55.5 | 58.4 |
GoLLIE-13B | 62.5 | 49.8 | 56.7 | 59.7 | 65.5 | 54.4 | 56.2 | 57.8 |
UniNER-7B refined | 59.4 | 31.2 | 62.6 | 64.0 | 66.6 | 66.3 | 69.8 | 60.0 |
w.o.refinement | ||||||||
Vicuna-7B | 6.0 | 5.3 | 12.8 | 16.1 | 17.0 | 20.5 | 13.0 | 13.0 |
Vicuna-13B | 0.9 | 0.4 | 22.7 | 22.7 | 26.6 | 27.2 | 22.0 | 17.5 |
ChatGPT | 5.3 | 32.8 | 52.4 | 39.8 | 66.6 | 68.5 | 67.0 | 47.5 |
UniNER-7B | 42.4 | 31.7 | 53.5 | 59.4 | 65.0 | 60.8 | 61.1 | 53.4 |
KnowCoder-7B | 50.0 | 48.2 | 60.3 | 61.1 | 70.0 | 72.2 | 59.1 | 60.1↑12.5% |
Dataset | SoTA | KnowCoder |
---|---|---|
GIDSRE | 9.9 | 25.5 |
CASIEED | 59.3 | 56.3 |
Average | 34.6 | 41.9↑21.1% |
To further investigate the generalization ability of KnowCoder for IE tasks, we conduct experiments by refine KnowCoder with three different partitions of the original training sets (1/5/10% ratio) across four tasks to further evaluate its performance in low-resource scenarios.
Ratio | Model | Task | Ave | |||
---|---|---|---|---|---|---|
NER | RE | ED | EAE | |||
1% | UIE-base | 82.8 | 30.8 | 41.5 | 12.8 | 42.0 |
LLaMA2-7B | 72.3 | 32.1 | 35.3 | 33.3 | 43.3 | |
KnowCoder-7B | 79.2 | 43.3 | 50.3 | 38.5 | 52.8↑21.9% | |
5% | UIE-base | 88.3 | 51.7 | 55.7 | 30.4 | 56.5 |
LLaMA2-7B | 89.3 | 35.7 | 52.6 | 46.3 | 56.0 | |
KnowCoder-7B | 90.6 | 51.1 | 59.0 | 48.3 | 62.3↑10.3% | |
10% | UIE-base | 89.6 | 59.2 | 60.3 | 36.3 | 61.4 |
LLaMA2-7B | 91.2 | 48.6 | 60.7 | 52.3 | 63.2 | |
KnowCoder-7B | 92.2 | 53.6 | 62.2 | 55.1 | 65.8↑4.1% |
Based on our unified schema representations, various human-annotated datasets can simultaneously be utilized to refine KnowCoder.
To further investigate the IE ability of KnowCoder, we conduct supervised experiments on four IE tasks, including NER, RE, ED, and EAE. Under the supervised evaluation, KnowCoder is further refined with 28 IE datasets.
Dataset | SoTA | KnowCoder-7B |
---|---|---|
ACE04 | 87.6 | 86.2 |
ACE05 | 89.6 | 86.1 |
AnatEM | 88.9 | 86.4 |
Broad Twitter | 79.8 | 78.3 |
CoNLL03 | 94.8 | 95.1 |
DIANN | 84.1 | 94.7 |
FabNER | 82.3 | 82.9 |
FindVehicle | 98.4 | 99.4 |
GENIA | 80.3 | 76.7 |
Movie | 90.2 | 90.6 |
Rest. | 82.6 | 81.3 |
MultiNERD | 93.9 | 96.1 |
OntoNotes 5 | 84.6 | 88.2 |
WikiANN | 85.4 | 87.0 |
WNUT17 | 54.3 | 66.4 |
bc2gm | 80.5 | 82.0 |
bc5cdr | 91.5 | 89.3 |
ncbi | 85.0 | 83.8 |
Average | 85.2 | 86.1↑1.1% |
Dataset | SoTA Model | Results | KnowCoder-7B |
---|---|---|---|
ACE05 | GoLLIE | 70.1 | 64.5 |
semevalRE | InstructUIE | 65.8 | 66.3 |
CoNLL04 | USM | 78.8 | 73.3 |
NYT | InstructUIE | 91.0 | 93.7 |
ADE corpus | InstructUIE | 82.8 | 84.3 |
kbp37 | InstructUIE | 30.6 | 73.2 |
GIDS | InstructUIE | 76.9 | 78.0 |
SciERC | USM | 37.4 | 40.0 |
Average | - | 66.7 | 71.7↑7.5% |
Model | ACE05ED | ACE05EAE |
---|---|---|
UIE | 73.4 | 69.3 |
USM | 69.3 | 63.3 |
Code4UIE | 37.4 | 57.0 |
InstructUIE-11B | 43.2 | 56.8 |
GoLLIE-7B | 72.2 | 66.0 |
KnowCoder-7B | 74.2 | 70.3 |
@article{li2024knowcoder,
title={KnowCoder: Coding Structured Knowledge into LLMs for Universal Information Extraction},
author={Li, Zixuan and Zeng, Yutao and Zuo, Yuxin and Ren, Weicheng and Liu, Wenxuan and Su, Miao and Guo, Yucan and Liu, Yantao and Li, Xiang and Hu, Zhilei and others},
journal={arXiv preprint arXiv:2403.07969},
year={2024}
}