KnowCoder

Coding Structured Knowledge into LLMs for Universal Information Extraction

Zixuan Li†*¹, Yutao Zeng*, Yuxin Zuo*¹, Weicheng Ren*¹,
Wenxuan Liu¹, Miao Su¹, Yucan Guo¹, Yantao Liu¹, Xiang Li¹, Zhilei Hu¹, Long Bai¹ , Wei Li¹, Yidan Liu¹, Pan Yang,
Xiaolong Jin†¹ , Jiafeng Guo†¹ , Xueqi Cheng¹

¹CAS Key Laboratory of Network Data Science and Technology, Institute of Computing Technology, Chinese Academy of Sciences

* Co-first Authors
† Corresponding Authors

📃

Paper Github

🤗

Resource (Schema • Data • Model)

🚀

Try KnowCoder

In this paper, we propose KnowCoder, a Large Language Model (LLM) to conduct Universal Information Extraction (UIE) via code generation. KnowCoder aims to develop a unified schema representation method that LLMs can easily understand and an effective learning framework that encourages LLMs to follow schemas and extract structural knowledge accurately. To achieve these, KnowCoder introduces a code-style schema representation method to uniformly transform different schemas into Python classes, with which complex schema information, such as constraints among tasks in UIE, can be modeled in an LLM-friendly manner. We further construct a code-style schema library covering over 30,000 types of knowledge, which is the largest one for UIE, to the best of our knowledge. To ease the learning of LLMs, KnowCoder contains a two-phase learning framework that enhances the schema understanding ability via code pretraining and the schema following ability via instruction tuning.

🎉 News

[2024-03-11]: We released the initial version of the KnowCoder!

Overview

We released KnowCoder, a powerful Large Language Model for Universal Information Extraction that injects thousands of structured knowledge through code.

KnowCoder completed various evaluations on 33 widely used information extraction benchmarks:

- After code pretraining on around 1.5B automatically constructed data, KnowCoder already attains remarkable generalization ability and achieves NER improvements compared to LLaMA2 by 49.8% relative F1 under the few-shot setting.

- After instruction tuning, KnowCoder further exhibits strong generalization ability on unseen schemas and achieves up to 12.5% and 21.9% under the zero-shot setting and the low resource setting, respectively.

- Additionally, based on our unified schema representations, various human-annotated datasets can simultaneously be utilized to refine KnowCoder, which achieves significant improvements up to 7.5% under the supervised setting.

KnowCoder Schema

Code-style Schema Representation Method

The code-style schema representation method comprises three basic classes, namely, "Entity", "Relation", and "Event". Based on the three basic classes, we represent all the concepts in the schemas by the corresponding classes. Thus, the instances of each concept can be represented by the objects of the corresponding class. A schema consists of class name, class inheritance, class comments, type hint, and class method. The detailed explanation of each component can be found in our paper.

Schema Library Construction

We construct the code-style schema library under this schema representation method based on Wikidata (Note that we use the Wikidata dump up to 20220704). We select the concepts included in the existing IE datasets created from Wikidata, i.e., KELM, UniversalNER, InstructIE, and LSEE, and derive the constraints among concepts according to their co-occurrences. To construct the taxonomies, we extract the "subclass of" relations among these concepts from Wikidata. To obtain the description of a concept, we use its definition from Wikidata directly or generate its descriptions using GPT-4 if its definition in Wikidata is missing. Finally, the constructed schema library encompasses over 29,177 entity types, 876 relation types, and 519 event types. The detailed statistics of the schema are shown in Table \ref{schema_stat}. Here, "#Type" denotes the total number of types, "#Type w/ desc." indicates the count of types with descriptions, and "#Type w/o desc." signifies the count of types without descriptions.

The schema library data can be found in 🤗Schema-Library.

Task	#Type	#Type w/ desc.	#Type w/o desc.
				NER	29,117	19,856	9,321
				RE	876	840	36
EE	519	515	4

KnowCoder Datasets

The datasets consist of three parts: schema understanding data, schema following data, and specific domain IE data.

The datasets are released in 🤗Huggingface-KnowCoder.

Schema Understanding Data

The schema understanding data includes schema definition codes and schema instance codes.

The schema understanding data can be found in 🤗Schema-Understanding.

Schema Definition Codes

The schema definition codes are built based on the schema library, with statistical results shown in Schema Library Construction.

Schema Instance Codes

The schema instance cods are constructed based on KELM. The statistical results are as follows.

The cases of schema instance codes in schema understanding data shown in this file.

Schema Following Data

The schema following data can be found in 🤗Schema-Following.

The schema following data is constructed on UniversalNER, InstructIE, and LSEE. The statistics of schema following data are presented in Schema Instance Codes.

The cases of schema following data are shown here.

Specific Domain IE Data

Note: Because some data sets have copyright requirements and need licenses, we cannot directly release this part of the data now. If you have a license for restricted datasets, you can contact us via emails to obtain data.

Additionally, for specific domain Information Extraction (IE), we conduct experiments utilizing 33 datasets, comprising 23 datasets for the NER task, 8 datasets for the RE task, and 2 datasets for the ED and EAE tasks. Specifically, under the supervised setting, we employ 18 datasets for the NER task, including ACE04, ACE05, AnatEM, Broad Twitter, bc2gm, bc5cdr, CoNLL03, DIANN, FabNER, FindVehicle, GENIA, MIT-Movie, MIT-Restaurant, MultiNERD, ncbi-disease, Ontonotes5, WikiANN, and WNUT17. For the RE task, we utilize 8 datasets under the supervised setting, including ACE05, ADE corpus, CoNLL04, GIDS, kbp37, NYT, SciERC, and semeval RE. For the ED and EAE tasks, ACE05 and CASIE are employed.

Under the zero-shot setting, we take 7 datasets for the NER task, comprising 5 CrossNER subsets (AI, literature, music, politics, science), MIT-Movie, and MIT-Restaurant. For the RE task, we adopt GIDS under the zero-shot setting. For the ED and EAE tasks, CASIE is adopted under the zero-shot setting.

The detailed statistic of each dataset is shown as follows. Here, "#Type" indicates the number of types, while "#Train", "#Dev", and "#Test" denote the number of sentences in the training, development, and testing datasets, respectively. Here is the overview of the datasets on specific domain IE by task and size. Note that the statistics for each dataset in the figure encompass the total number of train, dev, and test sets.

Results

Results on NER under the few-shot setting

After Schema Understanding, we can obtain the KnowCoder (SU. only).

To verify the generalization ability of KnowCoder (SU. only), we conduct few-shot experiments on 7 datasets across NER tasks.

Model	Movie	Rest.	AI	Litera.	Music	Politics	Science	Average
Model	Movie	Rest.	LLaMA2-7B	31.0	19.6	30.8	24.1	28.0	38.7	44.1	30.9
LLaMA2-13B	32.6	25.2	37.5	36.5	37.0	60.3	51.7	40.1
LLaMA2-7B	31.0	19.6	30.8	24.1	28.0	38.7	44.1	30.9
KnowCoder-7B (SU. only)	37.2	36.4	41.8	42.6	53.8	60.6	51.6	46.3^↑49.8%

Results under zero-shot setting

After Schema Understanding and Schema Following on LLaMA2, we can obtain the KnowCoder.

To verify the generalization ability of KnowCoder, we conduct zero-shot experiments on 9 datasets across NER, RE, and ED tasks.

Results on NER

Model	Movie	Rest.	AI	Litera.	Music	Politics	Science	Average
Model	Movie	Rest.	w.refinement
InstructUIE-11B	-	-	48.4	48.8	54.4	49.9	49.4	-
GoLLIE-7B	63.0	43.4	59.1	62.7	67.8	57.2	55.5	58.4
GoLLIE-13B	62.5	49.8	56.7	59.7	65.5	54.4	56.2	57.8
UniNER-7B refined	59.4	31.2	62.6	64.0	66.6	66.3	69.8	60.0
w.o.refinement
Vicuna-7B	6.0	5.3	12.8	16.1	17.0	20.5	13.0	13.0
Vicuna-13B	0.9	0.4	22.7	22.7	26.6	27.2	22.0	17.5
ChatGPT	5.3	32.8	52.4	39.8	66.6	68.5	67.0	47.5
UniNER-7B	42.4	31.7	53.5	59.4	65.0	60.8	61.1	53.4
KnowCoder-7B	50.0	48.2	60.3	61.1	70.0	72.2	59.1	60.1^↑12.5%

Results on RE and ED

Dataset	SoTA	KnowCoder
			GIDS_RE	9.9	25.5
			CASIE_ED	59.3	56.3
Average	34.6	41.9^↑21.1%

Results under low-resource setting

To further investigate the generalization ability of KnowCoder for IE tasks, we conduct experiments by refine KnowCoder with three different partitions of the original training sets (1/5/10% ratio) across four tasks to further evaluate its performance in low-resource scenarios.

Ratio	Model	Task				Ave
Ratio	Model	NER	RE	ED	EAE	Ave
1%	UIE-base	82.8	30.8	41.5	12.8	42.0
	LLaMA2-7B	72.3	32.1	35.3	33.3	43.3
	KnowCoder-7B	79.2	43.3	50.3	38.5	52.8^↑21.9%
5%	UIE-base	88.3	51.7	55.7	30.4	56.5
	LLaMA2-7B	89.3	35.7	52.6	46.3	56.0
	KnowCoder-7B	90.6	51.1	59.0	48.3	62.3^↑10.3%
10%	UIE-base	89.6	59.2	60.3	36.3	61.4
	LLaMA2-7B	91.2	48.6	60.7	52.3	63.2
	KnowCoder-7B	92.2	53.6	62.2	55.1	65.8^↑4.1%

Results under supervised setting

Based on our unified schema representations, various human-annotated datasets can simultaneously be utilized to refine KnowCoder.

To further investigate the IE ability of KnowCoder, we conduct supervised experiments on four IE tasks, including NER, RE, ED, and EAE. Under the supervised evaluation, KnowCoder is further refined with 28 IE datasets.

Results on NER

Dataset	SoTA	KnowCoder-7B
			ACE04	87.6	86.2
			ACE05	89.6	86.1
AnatEM	88.9	86.4
Broad Twitter	79.8	78.3
CoNLL03	94.8	95.1
DIANN	84.1	94.7
FabNER	82.3	82.9
FindVehicle	98.4	99.4
GENIA	80.3	76.7
Movie	90.2	90.6
Rest.	82.6	81.3
MultiNERD	93.9	96.1
OntoNotes 5	84.6	88.2
WikiANN	85.4	87.0
WNUT17	54.3	66.4
bc2gm	80.5	82.0
bc5cdr	91.5	89.3
ncbi	85.0	83.8
Average	85.2	86.1^↑1.1%

Results on RE

Dataset	SoTA Model	Results	KnowCoder-7B
				ACE05	GoLLIE	70.1	64.5
				semevalRE	InstructUIE	65.8	66.3
CoNLL04	USM	78.8	73.3
NYT	InstructUIE	91.0	93.7
ADE corpus	InstructUIE	82.8	84.3
kbp37	InstructUIE	30.6	73.2
GIDS	InstructUIE	76.9	78.0
SciERC	USM	37.4	40.0
Average	-	66.7	71.7^↑7.5%

Results on ED, EAE

Model	ACE05_ED	ACE05_EAE
			UIE	73.4	69.3
			USM	69.3	63.3
Code4UIE	37.4	57.0
InstructUIE-11B	43.2	56.8
GoLLIE-7B	72.2	66.0
KnowCoder-7B	74.2	70.3

Citation

@article{li2024knowcoder,
  title={KnowCoder: Coding Structured Knowledge into LLMs for Universal Information Extraction},
  author={Li, Zixuan and Zeng, Yutao and Zuo, Yuxin and Ren, Weicheng and Liu, Wenxuan and Su, Miao and Guo, Yucan and Liu, Yantao and Li, Xiang and Hu, Zhilei and others},
  journal={arXiv preprint arXiv:2403.07969},
  year={2024}
}