Logo KnowCoder

Coding Structured Knowledge into LLMs for Universal Information Extraction

Zixuan Li†*1, Yutao Zeng*, Yuxin Zuo*1, Weicheng Ren*1,
Wenxuan Liu1, Miao Su1, Yucan Guo1, Yantao Liu1, Xiang Li1, Zhilei Hu1, Long Bai1 , Wei Li1, Yidan Liu1, Pan Yang,
Xiaolong Jin†1 , Jiafeng Guo†1 , Xueqi Cheng1

1CAS Key Laboratory of Network Data Science and Technology, Institute of Computing Technology, Chinese Academy of Sciences

* Co-first Authors
† Corresponding Authors

In this paper, we propose KnowCoder, a Large Language Model (LLM) to conduct Universal Information Extraction (UIE) via code generation. KnowCoder aims to develop a unified schema representation method that LLMs can easily understand and an effective learning framework that encourages LLMs to follow schemas and extract structural knowledge accurately. To achieve these, KnowCoder introduces a code-style schema representation method to uniformly transform different schemas into Python classes, with which complex schema information, such as constraints among tasks in UIE, can be modeled in an LLM-friendly manner. We further construct a code-style schema library covering over 30,000 types of knowledge, which is the largest one for UIE, to the best of our knowledge. To ease the learning of LLMs, KnowCoder contains a two-phase learning framework that enhances the schema understanding ability via code pretraining and the schema following ability via instruction tuning.

🎉 News

[2024-03-11]: We released the initial version of the KnowCoder!

Overview

We released KnowCoder, a powerful Large Language Model for Universal Information Extraction that injects thousands of structured knowledge through code.

KnowCoder completed various evaluations on 33 widely used information extraction benchmarks:

- After code pretraining on around 1.5B automatically constructed data, KnowCoder already attains remarkable generalization ability and achieves NER improvements compared to LLaMA2 by 49.8% relative F1 under the few-shot setting.

- After instruction tuning, KnowCoder further exhibits strong generalization ability on unseen schemas and achieves up to 12.5% and 21.9% under the zero-shot setting and the low resource setting, respectively.

- Additionally, based on our unified schema representations, various human-annotated datasets can simultaneously be utilized to refine KnowCoder, which achieves significant improvements up to 7.5% under the supervised setting.

algebraic reasoning

KnowCoder Schema

Code-style Schema Representation Method

The code-style schema representation method comprises three basic classes, namely, "Entity", "Relation", and "Event". Based on the three basic classes, we represent all the concepts in the schemas by the corresponding classes. Thus, the instances of each concept can be represented by the objects of the corresponding class. A schema consists of class name, class inheritance, class comments, type hint, and class method. The detailed explanation of each component can be found in our paper.

algebraic reasoning

Schema Library Construction

We construct the code-style schema library under this schema representation method based on Wikidata (Note that we use the Wikidata dump up to 20220704). We select the concepts included in the existing IE datasets created from Wikidata, i.e., KELM, UniversalNER, InstructIE, and LSEE, and derive the constraints among concepts according to their co-occurrences. To construct the taxonomies, we extract the "subclass of" relations among these concepts from Wikidata. To obtain the description of a concept, we use its definition from Wikidata directly or generate its descriptions using GPT-4 if its definition in Wikidata is missing. Finally, the constructed schema library encompasses over 29,177 entity types, 876 relation types, and 519 event types. The detailed statistics of the schema are shown in Table \ref{schema_stat}. Here, "#Type" denotes the total number of types, "#Type w/ desc." indicates the count of types with descriptions, and "#Type w/o desc." signifies the count of types without descriptions.

The schema library data can be found in 🤗Schema-Library.

Task #Type #Type w/ desc. #Type w/o desc.
NER 29,117 19,856 9,321
RE 876 840 36
EE 519 515 4

KnowCoder Datasets

The datasets consist of three parts: schema understanding data, schema following data, and specific domain IE data.

The datasets are released in 🤗Huggingface-KnowCoder.

Schema Understanding Data

The schema understanding data includes schema definition codes and schema instance codes.

The schema understanding data can be found in 🤗Schema-Understanding.

Schema Definition Codes

The schema definition codes are built based on the schema library, with statistical results shown in Schema Library Construction.

Schema Instance Codes

The schema instance cods are constructed based on KELM. The statistical results are as follows.

algebraic reasoning

The cases of schema instance codes in schema understanding data shown in this file.

Schema Following Data

The schema following data can be found in 🤗Schema-Following.

The schema following data is constructed on UniversalNER, InstructIE, and LSEE. The statistics of schema following data are presented in Schema Instance Codes.

The cases of schema following data are shown here.

Specific Domain IE Data

Note: Because some data sets have copyright requirements and need licenses, we cannot directly release this part of the data now. If you have a license for restricted datasets, you can contact us via emails to obtain data.

Additionally, for specific domain Information Extraction (IE), we conduct experiments utilizing 33 datasets, comprising 23 datasets for the NER task, 8 datasets for the RE task, and 2 datasets for the ED and EAE tasks. Specifically, under the supervised setting, we employ 18 datasets for the NER task, including ACE04, ACE05, AnatEM, Broad Twitter, bc2gm, bc5cdr, CoNLL03, DIANN, FabNER, FindVehicle, GENIA, MIT-Movie, MIT-Restaurant, MultiNERD, ncbi-disease, Ontonotes5, WikiANN, and WNUT17. For the RE task, we utilize 8 datasets under the supervised setting, including ACE05, ADE corpus, CoNLL04, GIDS, kbp37, NYT, SciERC, and semeval RE. For the ED and EAE tasks, ACE05 and CASIE are employed.

Under the zero-shot setting, we take 7 datasets for the NER task, comprising 5 CrossNER subsets (AI, literature, music, politics, science), MIT-Movie, and MIT-Restaurant. For the RE task, we adopt GIDS under the zero-shot setting. For the ED and EAE tasks, CASIE is adopted under the zero-shot setting.

The detailed statistic of each dataset is shown as follows. Here, "#Type" indicates the number of types, while "#Train", "#Dev", and "#Test" denote the number of sentences in the training, development, and testing datasets, respectively. Here is the overview of the datasets on specific domain IE by task and size. Note that the statistics for each dataset in the figure encompass the total number of train, dev, and test sets.

algebraic reasoning

Results

Results on NER under the few-shot setting

After Schema Understanding, we can obtain the KnowCoder (SU. only).

To verify the generalization ability of KnowCoder (SU. only), we conduct few-shot experiments on 7 datasets across NER tasks.

Model Movie Rest. AI Litera. Music Politics Science Average
LLaMA2-7B 31.0 19.6 30.8 24.1 28.0 38.7 44.1 30.9
LLaMA2-13B 32.6 25.2 37.5 36.5 37.0 60.3 51.7 40.1
LLaMA2-7B 31.0 19.6 30.8 24.1 28.0 38.7 44.1 30.9
KnowCoder-7B (SU. only) 37.2 36.4 41.8 42.6 53.8 60.6 51.6 46.3↑49.8%

Results under zero-shot setting

After Schema Understanding and Schema Following on LLaMA2, we can obtain the KnowCoder.

To verify the generalization ability of KnowCoder, we conduct zero-shot experiments on 9 datasets across NER, RE, and ED tasks.

Results on NER

Model Movie Rest. AI Litera. Music Politics Science Average
w.refinement
InstructUIE-11B - - 48.4 48.8 54.4 49.9 49.4 -
GoLLIE-7B 63.0 43.4 59.1 62.7 67.8 57.2 55.5 58.4
GoLLIE-13B 62.5 49.8 56.7 59.7 65.5 54.4 56.2 57.8
UniNER-7B refined 59.4 31.2 62.6 64.0 66.6 66.3 69.8 60.0
w.o.refinement
Vicuna-7B 6.0 5.3 12.8 16.1 17.0 20.5 13.0 13.0
Vicuna-13B 0.9 0.4 22.7 22.7 26.6 27.2 22.0 17.5
ChatGPT 5.3 32.8 52.4 39.8 66.6 68.5 67.0 47.5
UniNER-7B 42.4 31.7 53.5 59.4 65.0 60.8 61.1 53.4
KnowCoder-7B 50.0 48.2 60.3 61.1 70.0 72.2 59.1 60.1↑12.5%

Results on RE and ED

Dataset SoTA KnowCoder
GIDSRE 9.9 25.5
CASIEED 59.3 56.3
Average 34.6 41.9↑21.1%

Results under low-resource setting

To further investigate the generalization ability of KnowCoder for IE tasks, we conduct experiments by refine KnowCoder with three different partitions of the original training sets (1/5/10% ratio) across four tasks to further evaluate its performance in low-resource scenarios.

Ratio Model Task Ave
NER RE ED EAE
1% UIE-base 82.8 30.8 41.5 12.8 42.0
LLaMA2-7B 72.3 32.1 35.3 33.3 43.3
KnowCoder-7B 79.2 43.3 50.3 38.5 52.8↑21.9%
5% UIE-base 88.3 51.7 55.7 30.4 56.5
LLaMA2-7B 89.3 35.7 52.6 46.3 56.0
KnowCoder-7B 90.6 51.1 59.0 48.3 62.3↑10.3%
10% UIE-base 89.6 59.2 60.3 36.3 61.4
LLaMA2-7B 91.2 48.6 60.7 52.3 63.2
KnowCoder-7B 92.2 53.6 62.2 55.1 65.8↑4.1%

Results under supervised setting

Based on our unified schema representations, various human-annotated datasets can simultaneously be utilized to refine KnowCoder.

To further investigate the IE ability of KnowCoder, we conduct supervised experiments on four IE tasks, including NER, RE, ED, and EAE. Under the supervised evaluation, KnowCoder is further refined with 28 IE datasets.

Results on NER

Dataset SoTA KnowCoder-7B
ACE04 87.6 86.2
ACE05 89.6 86.1
AnatEM 88.9 86.4
Broad Twitter 79.8 78.3
CoNLL03 94.8 95.1
DIANN 84.1 94.7
FabNER 82.3 82.9
FindVehicle 98.4 99.4
GENIA 80.3 76.7
Movie 90.2 90.6
Rest. 82.6 81.3
MultiNERD 93.9 96.1
OntoNotes 5 84.6 88.2
WikiANN 85.4 87.0
WNUT17 54.3 66.4
bc2gm 80.5 82.0
bc5cdr 91.5 89.3
ncbi 85.0 83.8
Average 85.2 86.1↑1.1%

Results on RE

Dataset SoTA Model Results KnowCoder-7B
ACE05 GoLLIE 70.1 64.5
semevalRE InstructUIE 65.8 66.3
CoNLL04 USM 78.8 73.3
NYT InstructUIE 91.0 93.7
ADE corpus InstructUIE 82.8 84.3
kbp37 InstructUIE 30.6 73.2
GIDS InstructUIE 76.9 78.0
SciERC USM 37.4 40.0
Average - 66.7 71.7↑7.5%

Results on ED, EAE

Model ACE05ED ACE05EAE
UIE 73.4 69.3
USM 69.3 63.3
Code4UIE 37.4 57.0
InstructUIE-11B 43.2 56.8
GoLLIE-7B 72.2 66.0
KnowCoder-7B 74.2 70.3

Citation

@article{li2024knowcoder,
  title={KnowCoder: Coding Structured Knowledge into LLMs for Universal Information Extraction},
  author={Li, Zixuan and Zeng, Yutao and Zuo, Yuxin and Ren, Weicheng and Liu, Wenxuan and Su, Miao and Guo, Yucan and Liu, Yantao and Li, Xiang and Hu, Zhilei and others},
  journal={arXiv preprint arXiv:2403.07969},
  year={2024}
}