KnowCoder-A1: Incentivizing Agentic Reasoning Capability with Outcome Supervision for KBQA

Chen, Zhuo; Wang, Fei; Li, Zixuan; Zhang, Zhao; Ding, Weiwei; Yang, Chuanguang; Xu, Yongjun; Jin, Xiaolong; Guo, Jiafeng

KnowCoder-A1: Incentivizing Agentic Reasoning Capability with Outcome Supervision for KBQA

An outcome-supervised, curriculum-RL trained LLM agent that performs robust and flexible agentic reasoning over knowledge bases for complex question answering.

Zhuo Chen^1,2,3, Fei Wang^1,2*, Zixuan Li^1,2*, Zhao Zhang^1,2, Weiwei Ding^1,2, Chuanguang Yang^1,2, Yongjun Xu^1,2, Xiaolong Jin^1,2,3, Jiafeng Guo^1,2,3

¹Institute of Computing Technology, Chinese Academy of Sciences ²State Key Laboratory of AI Safety ³School of Computer Science, University of Chinese Academy of Sciences
^*Indicates Corresponding Author

arXiv Code

Abstract

Knowledge Base Question Answering (KBQA) aims to answer natural-language questions over a structured Knowledge Base (KB). Recent work improves KBQA by adopting an agentic reasoning paradigm, in which Large Language Models (LLMs) iteratively decompose a question, generate its corresponding logical queries, and interact with the KB to derive the answer. However, these methods typically fine-tune LLMs on reasoning trajectories synthesized via process supervision, which offers weak incentives for exploration and thus fails to strengthen the agentic reasoning ability.

In this paper, we propose KnowCoder-A1, an LLM that can autonomously perform agentic reasoning on KBs to obtain answers. To incentivize autonomous exploration, KnowCoder-A1 trains the LLM under outcome-only supervision via a multi-stage curriculum reinforcement learning with an easy-to-hard curriculum.

To establish foundational agentic capabilities, KnowCoder-A1 first fine-tunes the LLM on a small set of high-quality trajectories obtained through outcome-based rejection sampling. Then, to alleviate the reward sparsity inherent in outcome-only supervision, it applies multi-stage curriculum RL with reward schedules that progress from easy to hard.

Trained with outcome-only supervision, KnowCoder-A1 exhibits powerful reasoning behaviors and consistently outperforms prior approaches across three mainstream datasets. Notably, on the zero-shot subset of GrailQA, KnowCoder-A1 achieves up to an 11.1% relative improvement while using only one-twelfth of the training data, demonstrating strong agentic reasoning capabilities. Our code can be found at https://github.com/ICT-GoKnow/KnowCoder-A1 .

Highlights

Key contributions of KnowCoder-A1 at a glance.

Outcome-only Supervision

Learn KBQA agents directly from success signals, without relying on process supervision or gold trajectories, enabling autonomous exploration.

Curriculum RL

Multi-stage curriculum with F_0.5 → F₁ reward schedule, mitigating reward sparsity and discouraging reward hacking.

Data-efficient SOTA

Up to 11.1% relative improvement on GrailQA zero-shot subset using only about 1/12 training data compared to previous SOTA methods.

🤔 Exsiting Limitation

Knowledge Base Question Answering (KBQA) aims to provide precise answers by leveraging structured KBs, yet still struggles with complex real-world questions. Recent “agentic” approaches show promise, but most rely on process supervision for finetuning, where models are trained to strictly follow ideal trajectories decomposed from “gold” logical forms (e.g., SPARQL).

This strong reliance on process supervision leads to several key limitations:

Limited Robustness. Training trajectories are idealized and error-free. As a result, agents become fragile when facing realistic noise (e.g., tool failures, empty results) and struggle to recover from mistakes during interaction with the KB.
Limited Flexibility. Many questions admit multiple valid reasoning paths. Learning only from a single set of “gold” trajectories restricts behavioral diversity and can prevent the agent from discovering alternative, potentially more efficient reasoning strategies. As shown in Fig.~\ref{fig:intro}, when the model sees an example with a “high school” type constraint, it insists on finding a predicate meaning “notable type” and applying the type constraint. Yet more efficient solutions exist, such as using common sense to recognize entities of the desired type from the candidate set, or calling SearchTypes to directly constrain entity types.

Key limitations of existing agentic approaches that rely on process supervision.

These issues highlight an open challenge: how can KBQA agents learn robust, flexible agentic reasoning without being tightly constrained by hand-crafted, process-supervised trajectories?

💡 KnowCoder-A1: Incentivizing Agentic Reasoning Capability with Outcome Supervision

We propose KnowCoder-A1, a two-stage framework that enables LLMs to perform robust agentic reasoning over knowledge bases, by combining SFT-based cold-start with outcome-only, multi-stage curriculum reinforcement learning.

Overview of the training framework of KnowCoder-A1. Stage 1 (left): SFT-based cold-start using high-quality trajectories curated from strong LLMs. Stage 2 (right): multi-phase curriculum RL with outcome-only supervision to progressively improve the agent.

Our training framework consists of three stages:

Agentic Initialization (Sec. 4.1). We first instantiate KnowCoder-A1 as a ReAct-style KBQA agent that interleaves natural-language reasoning with KB tool calls. The KB and executor are formalized as an environment \(\mathcal{M} = (\mathcal{G}, \texttt{Exec})\), and the agent is equipped with a small set of tools such as SearchTypes, SearchGraphPatterns, and ExecuteSPARQL to retrieve types, local graph patterns, and answers.
Cold-start SFT Stage (Sec. 4.2). We then perform supervised fine-tuning on a compact set of high-quality trajectories collected from strong LLMs via outcome-based rejection sampling, enforcing exact match and evidence grounding. Multi-turn SFT on these trajectories endows the model with basic long-horizon reasoning and tool-use skills, yielding a stable cold-start policy that already follows the agent format.
Curriculum RL with Outcome-only Supervision (Sec. 4.3). Starting from the cold-start policy, we optimize the agent using Group Relative Policy Optimization (GRPO) with outcome-only rewards. A curriculum reward combines a lightweight format reward with an answer reward based on \(F_{\beta}\), gradually tightening the objective from precision-leaning \(F_{0.5}\) to balanced \(F_{1}\). This multi-stage RL phase encourages autonomous exploration, mitigates reward sparsity, and improves both robustness and answer completeness.

Evaluation Results

We evaluate KnowCoder-A1 on three mainstream KBQA benchmarks – WebQSP, CWQ, and GrailQA – using F1 as the main metric, supplemented by RHits@1 and EM. We compare against both prompting-based methods (e.g., IO, CoT, Interactive-KBQA) and fine-tuning-based methods (e.g., RnG-KBQA, TIARA, SFT-traj., KBQA-o1, MCTS-KBQA).

The key limitations of existing agentic approaches that rely on process supervision.

Key Performances.

State-of-the-art in low-resource KBQA. Across all three datasets, KnowCoder-A1 consistently outperforms directly comparable low-resource agentic baselines, while approaching the performance of fully supervised systems that rely on much larger annotated datasets.
Data-efficient yet strong on GrailQA. On GrailQA, KnowCoder-A1 reaches an F1 score of 80.5, a 3.3% relative improvement over the previous state-of-the-art KBQA-o1, while using only about 1/12 of its training data (~3.5k vs. 43.8k samples). On the zero-shot subset, it achieves up to 11.1% relative gain, showing strong generalization to unseen question types.
Efficient supervision and inference. KnowCoder-A1 relies purely on outcome-only supervision (no process-annotated trajectories), and uses a comparable or smaller number of outcome-supervised samples than MCTS-based methods. At inference time, it performs a single linear reasoning trajectory instead of expensive multi-sampling, reducing latency by roughly 3.2-6×.

Further Analysis

Training dynamics under outcome-only supervision: reward rises while response length, turns, and invalid tool calls all decrease.

Evolution of robustness and flexibility: more successful rollouts and more diverse SPARQL queries per question over training.

Impact of Process Reward: introducing penalties for any intermediate behaviors, even for intuitively negative ones, degrades final performance.

Case Study

We illustrate how KnowCoder-A1 behaves with two qualitative examples, showing robust recovery from errors and flexible reasoning trajectories.

Example 1: Robust Reasoning

Question: who plays london tipton in suite life on deck?

Think I need to find the actor who plays London Tipton in "The Suite Life on Deck". I will start by searching for predicates related to "actor" or "performance" for the entity "London Tipton(m.07g8r3)".

Tool SearchGraphPatterns( sparql = "SELECT ?e WHERE { VALUES ?e {ns:m.07g8r3} }", semantic = "actor/performer" )

Result [..., (?e, film.film_character.portrayed_in_films -> film.performance.actor, "Brenda Song"), ...]

Think (Fail Try 1) The previous query returned no results. This might be due to the structure of the predicates or the way the entities are linked. I will try a different approach by searching for predicates related to "actor" or "performance" ...

Tool ExecuteSPARQL( sparql = "SELECT ?actor WHERE { VALUES ?e1 {ns:m.07g8r3} . VALUES ?e2 {ns:m.03mj4jm} . ?e1 ns:film.film_character.portrayed_in_films ?cvt . ?cvt ns:film.performance.actor ?actor . ?cvt ns:film.performance.film ?e2 . }" )

Result []

Think The previous queries have not yielded the correct result. I need to refine my approach. The question asks for... I will try a different strategy... I will try to find the actor who played London Tipton in any film and then filter for "The Suite Life on Deck".

Tool ExecuteSPARQL( sparql = "SELECT ?actor WHERE { VALUES ?e {ns:m.07g8r3} . ?e ns:film.film_character.portrayed_in_films ?cvt . ?cvt ns:film.performance.actor ?actor . }" )

Result ["Brenda Song"]

Answer Brenda Song

Example 2: Flexible Reasoning

Question: Where did Charles Drew attend college that has the latest founding date?

Think I need to find the colleges that Charles R. Drew attended and then find the college with the latest founding date.

Result (Omit for space...)

Tool ExecuteSPARQL( sparql = "SELECT DISTINCT ?college WHERE { VALUES ?e {ns:m.018t67} . ?e ns:people.person.education ?edu . ?edu ns:education.education.institution ?college . ?college ns:organization.organization.date_founded ?founded_date . } ORDER BY DESC(xsd:date(?founded_date)) LIMIT 1" )

Result ["Dunbar High School"]

Think The result is "Dunbar High School", but the question asks for the college, not high school. I need to re-evaluate the search. ... I will search for the type "College/University".

Tool SearchTypes(query = "College/University")

Result ["education.university", ...]

Think The "SearchTypes" results include "education.university" ... I will find the colleges attended by Charles R. Drew and their founding dates, then order by founding date in descending order and limit to 1 to get the college with the latest founding date.

Tool ExecuteSPARQL( sparql = "SELECT DISTINCT ?college WHERE { VALUES ?e {ns:m.018t67} . ?e ns:people.person.education ?edu . ?edu ns:education.education.institution ?college . ?college ns:type.object.type ns:education.university . ?college ns:organization.organization.date_founded ?founded_date . } ORDER BY DESC(xsd:date(?founded_date)) LIMIT 1" )

Result ["McGill University Faculty of Medicine"]

Answer McGill University Faculty of Medicine

BibTeX

@article{chen2025knowcodera1,
  title={KnowCoder-A1: Incentivizing Agentic Reasoning Capability with Outcome Supervision for KBQA},
  author={Chen, Zhuo and Wang, Fei and Li, Zixuan and Zhang, Zhao and Ding, Weiwei and Yang, Chuanguang and Xu, Yongjun and Jin, Xiaolong and Guo, Jiafeng},
  journal={arXiv preprint arXiv:2510.25101},
  year={2025}
}

Top