GraphDancer: Training LLMs to Explore and Reason over Graphs via Two-Stage Curriculum Post-Training

Yuyang Bai¹, Zhuofeng Li¹, Ping Nie², Yu Wang⁴, Jianwen Xie³, Yu Zhang¹

¹Texas A&M University, ²University of Waterloo, ³Lambda, ⁴University of Oregon

🤗

Figure 1: Overview of the GraphDancer framework.

Introduction

The internal parameters of LLMs are often insufficient to capture external knowledge that is both rapidly evolving and extremely long-tailed. Consequently, establishing a framework that enables LLMs to interact with external knowledge sources is necessary to improve their factuality and reliability. Retrieval-augmented generation (RAG) has become a mainstream solution toward this goal. However, when external knowledge is not organized as plain text corpora but instead interconnected in graph structures, two major challenges arise.

Challenge 1: Heterogeneous Relations. Graphs may contain heterogeneous node and edge types (e.g., [Gene] is upregulated in [Anatomy] vs. [Gene] is downregulated in [Anatomy]), making it inadequate to retrieve information solely through semantic similarity search. Instead, effective information access requires precisely defined function calls (e.g., finding genes connected to a given anatomy entity via "is upregulated in").

Challenge 2: Multi-hop Reasoning. Reasoning over graphs usually requires capturing complex, multi-hop connections between nodes. This makes it impractical to force the model to acquire all necessary information in a single round. Instead, an adaptive, multi-round process is needed, in which information seeking and reasoning steps are interleaved (e.g., subsequent function calls should be based on the gene nodes obtained in the current round).

To elicit multi-round graph interaction capabilities in LLMs, existing studies have proposed various prompting strategies. However, compared with actual model training, prompting-based approaches are less effective at helping models internalize the sophisticated graph interaction skills. On the other hand, training models via supervised fine-tuning (SFT) to access external knowledge struggles to generalize to new graphs and domains, such as those with novel node/edge types, function calls, or tasks.

We propose GraphDancer, a two-stage post-training framework that addresses these limitations through reinforcement learning. The first stage (Curriculum-PPO) teaches the model how to interact with the graph under executable rule-based rewards. The second stage (Curriculum-DPO) refines the PPO policy on self-generated preference pairs, making interactions more grounded and efficient. Both stages are organized by a graph-aware curriculum that progressively increases task difficulty based on the structural complexity of information-seeking trajectories.

Contributions

🔧 Two-Stage Post-Training Framework: We propose GraphDancer, a two-stage post-training framework that combines Curriculum-PPO (online RL with executable rule-based rewards) and Curriculum-DPO (offline preference learning over self-sampled trajectories). The two stages complement each other: PPO teaches navigation behavior, while DPO improves grounding and schema compliance.

📊 Graph-Aware Curriculum: We introduce a graph-specific scheduler that decomposes trajectories into Singleton lookup rounds (S-rounds) and Neighborhood expansion rounds (E-rounds), and biases batch composition from Easy to Hard via a time-varying mixture. The same curriculum organizes both training stages.

🚀 Strong Cross-Domain Generalization: With a 3B backbone, GraphDancer outperforms graph-agent baselines built on substantially larger or stronger models (up to Qwen3-14B, Mixtral-8x7B-Instruct, and GPT-4o-mini), evaluated on four unseen GRBench domains plus out-of-distribution question types. Improvements are statistically significant under a two-tailed Z-test across five seeds.

Method

Problem Setup

Let \(\mathcal{G}=(\mathcal{V},\mathcal{E})\) be a text-attributed graph with heterogeneous node/edge types, such as an academic network with papers, authors, and venues. Each node \(v \in \mathcal{V}\) is associated with a set of textual fields (e.g., title, name, abstract) and typed relations to other nodes. Given a question \(x\) and the graph environment \(\mathcal{G}\), the model needs to produce a natural-language answer \(y\) by interacting with \(\mathcal{G}\) through a set of executable graph functions.

We model the process of answering a question as an episodic Markov decision process (MDP). At round \(t\), the agent observes a state \(s_t\) comprising (1) the question \(x\), (2) the history of prior reasoning and actions, and (3) the graph observations returned by previous function calls. The agent then selects an action \(a_t\) (i.e., a graph function call), receives an observation \(o_t\) (i.e., results returned by the function call), and updates its state. An episode terminates when the agent produces a final answer or reaches a maximum of \(T\) rounds.

LLM Reasoning with Graph Interaction

We use a simple interaction format to expose tool use to the LLM. Specifically, the model generates a sequence interleaving three block types:

Reasoning block: <think> ... </think> containing non-executable reasoning text.
Action block: <graph> ... </graph> containing one or more function calls.
Observation block: <information> ... </information> appended by the environment after executing the action block.

The final output is produced in an <answer> ... </answer> block.

Graph Function Calls

To enable precise access over heterogeneous graphs, the agent invokes a small set of deterministic functions:

RetrieveNode(Text): returns a ranked list of node IDs relevant to a textual query (semantic entry point).

NodeFeature(NodeID, FeatureName): returns the requested textual field for a node.

NeighborCheck(NodeID, NeighborType): returns neighbor node IDs under a specified typed relation.

NodeDegree(NodeID, NeighborType): returns the count of neighbors under a specified typed relation.

Structural Difficulty of Graph Interaction

Graph interaction naturally decomposes into a sequence of information-seeking rounds. Based on the number of nodes surfaced in each round, we categorize rounds into two types:

Singleton lookup round (S-round): Introduces exactly one node (e.g., resolving an entity mention to a unique node via RetrieveNode).
Neighborhood expansion round (E-round): Brings in multiple nodes (e.g., retrieving all co-authors of a researcher).

Based on this decomposition, we define three difficulty levels:

Easy: Questions require a single round of information seeking.
Medium: Questions require multiple rounds but at most one E-round.
Hard: Questions require at least two E-rounds, reflecting repeated neighborhood expansions.

Two-Stage Post-Training

GraphDancer post-trains the policy in two stages. Stage 1 uses online reinforcement learning to teach the model how to interact with the graph. Stage 2 uses offline preference learning to make interactions more grounded and efficient. The same graph-aware curriculum organizes both stages.

Stage 1: Curriculum-PPO

We optimize the policy with Proximal Policy Optimization (PPO) under a rule-based reward over executable graph calls. The reward combines three terms: (i) Exact Match against the gold answer as the primary correctness signal, (ii) a structural-validity penalty that fires when a correct answer is produced under an invalid trace, and (iii) a small Answer-Presence bonus on well-formed non-empty attempts to discourage degenerate outputs. No learned reward model is used.

At each training step, we sample a difficulty level from the graph-aware curriculum (see below) and uniformly select an instance within that level. This stage teaches navigation behavior: when to call which function, how to chain S- and E-rounds, and when to terminate.

Stage 2: Curriculum-DPO

Stage 1 produces a policy that interacts with the graph but still mis-calls tools and over-explores. Stage 2 refines the PPO checkpoint via Direct Preference Optimization (DPO) on self-generated preference pairs. For each training question, we sample M = 8 trajectories from the PPO policy and rank them by six trajectory-level keys with fixed tie-break priority:

(EM, EH, VF, −Loop, −Invalid, −Rounds)

EM and EH (Evidence Hit) are correctness signals; VF (Valid Format) is schema adherence; the last three penalize loops, invalid tool calls, and unnecessary rounds. Rank-1 becomes the chosen response, rank-M the rejected one. DPO then nudges the policy toward chosen trajectories without a separate reward model.

Graph-Aware Curriculum

Both stages share a biased-mixture scheduler that combines a Gaussian curriculum schedule \(g(t,k)\) with a fixed level-bias prior \(q(k)\) over {Easy, Medium, Hard}, mixed by a time-varying weight \(\eta(t)\) (rising from 0.2 to 0.8 over training). Early on, the model sees mostly Easy questions (e.g., "Who are the authors of the ResNet paper?"). As training progresses, multi-hop and multi-round questions become more frequent (e.g., "Which venue did Kaiming He and Ross Girshick collaborate most?"), while controlled exposure to easier levels is preserved throughout.

Experiments

Experimental Setup

Dataset

We use GRBench to assess the ability of LLMs to interact with graphs. We train on the Academic domain only and test cross-domain generalization on four unseen domains: E-commerce, Literature, Healthcare, and Legal.

Questions that were originally labeled Hard in GRBench (i.e., those that cannot be answered by simply looking up the graph, though the graph may still provide useful context) are treated as out-of-distribution (OOD) samples. They are excluded from training and used solely for evaluation.

Table 1: Dataset statistics. Red: training data. Blue: testing data.

Baselines

We compare GraphDancer with three groups of baselines:

Single-round RAG: TextRAG (graph as plain text + retriever) and GraphRAG (text + local subgraph retrieval), both with GPT-3.5-turbo.
Prompting-based graph agents: Graph-CoT and Graph-Counselor, both evaluated across four backbones (GPT-3.5-turbo, GPT-4o-mini, Qwen2.5-3B-Instruct, Qwen3-14B).
RL-based graph agents: Graph-MCTS and Graph-o1, both with Mixtral-8x7B-Instruct.

GraphDancer uses Qwen2.5-3B-Instruct as its backbone. We additionally report three ablations of GraphDancer itself: NoPPO (skip Stage 1), NoDPO (skip Stage 2), and NoCurriculum (uniform sampling in both stages).

Evaluation Metrics

We report two complementary metrics:

Rouge-L: Measures the longest common subsequence between the generated answer and the reference.
GPT4Score: An LLM-as-a-judge score computed with the same prompt and rubric as in prior work.

Results

Main Results

Table 2: Main results on GRbench. We highlight the Average performance (purple columns) and the performance Gap compared to our method (green column).

A 3B backbone significantly outperforms larger baselines. GraphDancer achieves 44.47 average Rouge-L and 46.44 average GPT4Score across the four unseen domains, outperforming every baseline including Graph-CoT and Graph-Counselor on Qwen3-14B and GPT-4o-mini, as well as Graph-o1 on Mixtral-8x7B-Instruct. The improvements are statistically significant under a two-tailed Z-test across five seeds (p < 0.01 against every baseline on both metrics).

Domain-level wins are positive across the board, but not uniform. GraphDancer achieves the highest GPT4Score on E-commerce, Literature, and Legal, with the largest absolute gains concentrated on Legal across all difficulty levels and on Literature Medium. The one domain where larger-backbone baselines still lead is Healthcare: GraphDancer scores 32.89 Rouge-L, behind both Graph-CoT (39.88) and Graph-Counselor (46.52) on GPT-4o-mini. A difficulty-wise breakdown localizes the gap to Healthcare Hard, which remains near zero for every method.

Ablation Analysis

Table 3: Ablation analysis (Rouge-L) on the four unseen GRBench domains.

We evaluate three ablations to isolate each design choice:

NoPPO — skip Stage 1 and train DPO directly from Qwen2.5-3B-Instruct. Average Rouge-L drops by 7.90, with Healthcare collapsing from 32.89 to 23.43 (−9.46). PPO is the load-bearing component for navigation behavior.
NoDPO — skip Stage 2 and output the Curriculum-PPO checkpoint directly. Average Rouge-L drops by 3.85, concentrated on Legal (+7.57: 38.25 → 45.82) and Literature (+5.43: 43.82 → 49.25). DPO drives schema adherence and evidence grounding.
NoCurriculum — replace the biased-mixture sampler with uniform sampling in both stages. Average Rouge-L drops by 3.05, largest on Legal (+7.32). The curriculum mainly helps the agent learn when to terminate.

Together, the three components contribute complementary signals: PPO produces navigation, DPO produces format compliance and grounding, and the graph-aware curriculum amplifies both.

Error Analysis

Figure 3: Outcome breakdown pooled over 890 evaluation episodes across the four unseen domains.

We decompose each episode into four mutually exclusive outcomes: Correct, Invalid Format (invalid tool calls), Loop / Timeout (exceeding the interaction budget), and Premature Stop (terminating with an incorrect answer). All compared methods use Qwen2.5-3B-Instruct as the LLM backbone.

PPO drives the Loop / Timeout reduction. GraphDancer reduces Loop / Timeout to 7.9% of all episodes, compared with 21.2% for Graph-CoT and 17.4% for NoPPO. Skipping the PPO stage (NoPPO) brings the failure rate back close to Graph-CoT and drops Correct below Graph-CoT's level.

DPO drives schema adherence. Removing DPO (NoDPO) nearly doubles the Invalid Format rate (3.3% → 5.7%) while leaving Loop / Timeout close to the full system. The curriculum sits between the two: NoCurriculum compresses Loop / Timeout only to 13.5% and trails the full system by 1.9 points on Correct.

In other words, PPO produces the navigation behavior that suppresses Loop / Timeout, DPO produces the format compliance that suppresses Invalid Format, and the graph-aware curriculum amplifies both. Residual errors in the full system concentrate on Premature Stop (49.9%), where the agent fails decisively rather than wandering.

Evidence-Grounded Accuracy

Rouge-L and GPT4Score reward accurate answers but cannot tell whether the model retrieved supporting evidence or guessed. We complement them with an evidence-grounded view: each episode is decomposed by exact-match correctness (EM) and evidence hit (EH; whether the model's tool observations contain the gold answer). The conditional P(EM | EH) measures evidence-to-answer conversion.

Method (`Qwen2.5-3B`)	EH ↑	P(EM \| EH) ↑	EM ↑
Graph-CoT	40.5	69.8	32.7
Graph-Counselor	39.0	73.2	28.5
GraphDancer	43.8	77.4	38.9

Table 4: Evidence-grounded accuracy (%) pooled over 890 questions across the four unseen GRBench domains.

Under the same backbone, GraphDancer achieves the highest EH (43.8%) and the highest evidence-conversion rate (77.4%), indicating that the full pipeline both retrieves answer-supporting evidence more often and converts retrieved evidence into a correct answer more reliably.

Conclusion

We present GraphDancer, a two-stage post-training framework for graph reasoning with LLMs. Stage 1 (Curriculum-PPO) teaches executable graph interaction under rule-based rewards. Stage 2 (Curriculum-DPO) refines the PPO policy with self-generated preference pairs ranked by six trajectory-level keys with fixed tie-break priority. Both stages are organized by a graph-aware curriculum that schedules training by the structural complexity of information-seeking trajectories.

Training only on the Academic domain, GraphDancer generalizes to four unseen domains (E-commerce, Literature, Healthcare, Legal) and out-of-distribution question types with a 3B backbone. The ablation analysis shows that the three components contribute complementary signals: PPO drives navigation, DPO drives schema adherence and evidence grounding, and the graph-aware curriculum amplifies both.

The results suggest that transferable graph exploration requires training signals matched to graph interaction structure, beyond stronger prompting or larger backbones.

Appendix: Additional Analysis

We provide additional behavioral analysis and detailed diagnostics stratified by difficulty levels. Use the arrows to navigate through the supplementary figures and tables.

Difficulty-wise Rouge-L on the Four Unseen GRBench Domains

All three methods use Qwen2.5-3B-Instruct as the LLM backbone. "–" marks splits with no test samples (only Healthcare OOD).

Behavioral Analysis of Execution Traces

VF: Valid Format. VC: Valid Call. EH: Evidence Hit. Graph-Counselor uses a Plan/Thought/Action/Observation protocol that does not match the <think>/<graph> format, so VF/VC are not directly comparable.

Question-Type Breakdown on Literature

Categories derived by regex-matching the canonical question templates from Jin et al. (2024). GraphDancer wins every category on both EM and Rouge-L.

BibTeX

@misc{bai2026graphdancertrainingllmsexplore,
  title={GraphDancer: Training LLMs to Explore and Reason over Graphs via Two-Stage Curriculum Post-Training},
  author={Bai, Yuyang and Li, Zhuofeng and Nie, Ping and Wang, Yu and Xie, Jianwen and Zhang, Yu},
  year={2026},
  eprint={2602.02518},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2602.02518}
}