Logo GraphDancer: Training LLMs to Explore and Reason over Graphs via Curriculum Reinforcement Learning

1Texas A&M University, 2University of Waterloo, 3Lambda

Figure 1: Overview of the GraphDancer framework.

Introduction

The internal parameters of LLMs are often insufficient to capture external knowledge that is both rapidly evolving and extremely long-tailed. Consequently, establishing a framework that enables LLMs to interact with external knowledge sources is necessary to improve their factuality and reliability. Retrieval-augmented generation (RAG) has become a mainstream solution toward this goal. However, when external knowledge is not organized as plain text corpora but instead interconnected in graph structures, two major challenges arise.

Challenge 1: Heterogeneous Relations. Graphs may contain heterogeneous node and edge types (e.g., [Gene] is upregulated in [Anatomy] vs. [Gene] is downregulated in [Anatomy]), making it inadequate to retrieve information solely through semantic similarity search. Instead, effective information access requires precisely defined function calls (e.g., finding genes connected to a given anatomy entity via "is upregulated in").

Challenge 2: Multi-hop Reasoning. Reasoning over graphs usually requires capturing complex, multi-hop connections between nodes. This makes it impractical to force the model to acquire all necessary information in a single round. Instead, an adaptive, multi-round process is needed, in which information seeking and reasoning steps are interleaved (e.g., subsequent function calls should be based on the gene nodes obtained in the current round).

To elicit multi-round graph interaction capabilities in LLMs, existing studies have proposed various prompting strategies. However, compared with actual model training, prompting-based approaches are less effective at helping models internalize the sophisticated graph interaction skills. On the other hand, training models via supervised fine-tuning (SFT) to access external knowledge struggles to generalize to new graphs and domains, such as those with novel node/edge types, function calls, or tasks.

Contributions

🔧 GraphDancer Framework: We propose GraphDancer, an RL framework that teaches LLMs to interact with and reason over graphs in an adaptive, multi-round fashion, enabling the internalization of sophisticated graph exploration skills.

📊 Graph-Aware Curriculum: We introduce a novel graph-aware curriculum that gradually increases question difficulty from one-hop to multi-hop and multi-round reasoning, allowing moderate-sized LLMs to progressively acquire complex graph reasoning abilities.

🚀 Strong Generalization: We demonstrate the effectiveness and generalizability of GraphDancer via evaluation on multiple unseen domains and OOD question types, where it outperforms competitive baselines with larger LLM backbones.

Method

Problem Setup

Let \(\mathcal{G}=(\mathcal{V},\mathcal{E})\) be a text-attributed graph with heterogeneous node/edge types, such as an academic network with papers, authors, and venues. Each node \(v \in \mathcal{V}\) is associated with a set of textual fields (e.g., title, name, abstract) and typed relations to other nodes. Given a question \(x\) and the graph environment \(\mathcal{G}\), the model needs to produce a natural-language answer \(y\) by interacting with \(\mathcal{G}\) through a set of executable graph functions.

We model the process of answering a question as an episodic Markov decision process (MDP). At round \(t\), the agent observes a state \(s_t\) comprising (1) the question \(x\), (2) the history of prior reasoning and actions, and (3) the graph observations returned by previous function calls. The agent then selects an action \(a_t\) (i.e., a graph function call), receives an observation \(o_t\) (i.e., results returned by the function call), and updates its state. An episode terminates when the agent produces a final answer or reaches a maximum of \(T\) rounds.

LLM Reasoning with Graph Interaction

We use a simple interaction format to expose tool use to the LLM. Specifically, the model generates a sequence interleaving three block types:

  • Reasoning block: <think> ... </think> containing non-executable reasoning text.
  • Action block: <graph> ... </graph> containing one or more function calls.
  • Observation block: <information> ... </information> appended by the environment after executing the action block.

The final output is produced in an <answer> ... </answer> block.

Graph Function Calls

To enable precise access over heterogeneous graphs, the agent invokes a small set of deterministic functions:

RetrieveNode(Text): returns a ranked list of node IDs relevant to a textual query (semantic entry point).
NodeFeature(NodeID, FeatureName): returns the requested textual field for a node.
NeighborCheck(NodeID, NeighborType): returns neighbor node IDs under a specified typed relation.
NodeDegree(NodeID, NeighborType): returns the count of neighbors under a specified typed relation.

Structural Difficulty of Graph Interaction

Graph interaction naturally decomposes into a sequence of information-seeking rounds. Based on the number of nodes surfaced in each round, we categorize rounds into two types:

  • Singleton lookup round (S-round): Introduces exactly one node (e.g., resolving an entity mention to a unique node via RetrieveNode).
  • Neighborhood expansion round (E-round): Brings in multiple nodes (e.g., retrieving all co-authors of a researcher).

Based on this decomposition, we define three difficulty levels:

  • Easy: Questions require a single round of information seeking.
  • Medium: Questions require multiple rounds but at most one E-round.
  • Hard: Questions require at least two E-rounds, reflecting repeated neighborhood expansions.

Graph-Aware Curriculum RL

RL struggles on hard reasoning tasks because moderate-sized LLMs often perform poorly in the zero-shot setting, and task-specific rewards are sparse, making direct learning unstable and inefficient. Curriculum learning addresses this by training from easier to harder tasks and decomposing complex skill acquisition into manageable steps.

We adopt Gaussian curriculum sampling with a time-varying convex mixture scheduler. At the beginning of RL training, the vast majority of questions require only one-hop information seeking or reasoning over the graph (e.g., "Who are the authors of the ResNet paper?"). As training progresses, these are gradually replaced by questions that rely on multi-hop connections and multi-round information seeking (e.g., "Which venue did Kaiming He and Ross Girshick collaborate most?").

This biased-mixture design preserves the smooth easy-to-hard progression while maintaining controlled exposure to each level, which helps the model learn stable and generalizable graph tool-use behaviors under sparse rule-based rewards.

Experiments

Experimental Setup

Dataset

We use GRBench to assess the ability of LLMs to interact with graphs. We train on the Academic domain only and test cross-domain generalization on four unseen domains: E-commerce, Literature, Healthcare, and Legal.

Questions that were originally labeled Hard in GRBench (i.e., those that cannot be answered by simply looking up the graph, though the graph may still provide useful context) are treated as out-of-distribution (OOD) samples. They are excluded from training and used solely for evaluation.

Table 1: Dataset statistics. Red: training data. Blue: testing data.

Baselines

We evaluate our framework against both prompting approaches and the vanilla RL baseline:

  • TextRAG: Treats the graph as a plain text corpus and uses a retriever to fetch relevant textual entries.
  • GraphRAG: Extends TextRAG by additionally retrieving the associated local subgraph.
  • Think-on-Graph 2.0 (ToG-2): A knowledge-guided RAG method that interleaves reasoning with iterative retrieval.
  • Graph-CoT: Enables LLMs to actively traverse external graphs through iterative function calls.
  • Vanilla RL: Uses the same reward function and hyperparameters as GraphDancer but without the proposed graph-aware curriculum.

Evaluation Metrics

We report two complementary metrics:

  • Rouge-L: Measures the longest common subsequence between the generated answer and the reference.
  • GPT4Score: An LLM-as-a-judge score computed with the same prompt and rubric as in prior work.

Results

Main Results

Table 2: Main results on GRbench. We highlight the Average performance (purple columns) and the performance Gap compared to our method (green column).

GraphDancer can make a small backbone competitive for graph interaction. Despite its smaller size (3B), GraphDancer matches or outperforms larger prompting-based baselines on average, including Graph-CoT with Qwen3-14B (39.85 Rouge-L / 42.01 GPT4Score) and GPT-4o-mini. This suggests that directly optimizing the multi-round reasoning→action→observation loop can provide benefits beyond in-context prompting, particularly when tool use and long-horizon credit assignment are critical.

Domain transfer is strong but not uniform. Compared with Vanilla RL, GraphDancer improves Rouge-L on E-commerce (+3.21) and Legal (+4.41), and also yields gains on Healthcare. However, performance drops on Literature, indicating that cross-domain transfer may depend on domain-specific graph structures and answer distributions.

Impact of Graph-Aware Curriculum

Table 3: Difficulty-wise Rouge-L (%) across domains.

On the Academic training domain, the largest gap between GraphDancer and Vanilla RL appears on Medium samples (55.04 vs. 22.20), which require multi-round interaction but limited branching. This pattern aligns with our motivation: scheduling from S-round-dominated samples to more involved traces can increase the frequency of successful multi-round trajectories early in training, mitigating sparse-reward instability.

On unseen domains, GraphDancer exhibits clear gains on Hard samples in E-commerce (16.23 vs. 12.41) and especially Legal (15.06 vs. 4.00), suggesting improved robustness when repeated neighborhood expansions are required.

Case Study

Figure 2: Case Study showing cross-domain generalization. Interaction traces on a Medium-difficulty query from E-commerce domain.

The case study analyzes a Medium-difficulty instance from the E-commerce domain, which is unseen during training. The question asks for the price of an item bought together with a specific loudspeaker, requiring a two-hop reasoning path: Anchor Item → Neighbor Item → Price.

Vanilla RL lacks a robust expansion strategy: it uses NodeDegree to verify the existence of a neighbor but fails to retrieve the corresponding node ID. It then attempts to hallucinate a node identifier based on pseudo-array indexing logic, which results in a function-call error.

In contrast, GraphDancer correctly decomposes the query. It uses NeighborCheck to retrieve the exact ID of the associated item, and then performs a second hop with NodeFeature on this new node to obtain the correct price. This case highlights that GraphDancer has learned the syntax of exploration (e.g., "get ID first, then query attributes"), enabling it to navigate unseen schemas effectively.

Error Analysis

Figure 3: Outcome breakdown over all evaluation episodes.

We decompose each episode into four mutually exclusive outcomes: Correct, Invalid Format (invalid tool calls), Loop / Timeout (exceeding the interaction budget), and Premature Stop (terminating with an incorrect answer).

Overall, GraphDancer improves the success rate (39.0%) compared with the base model (36.6%) and Vanilla RL (37.2%). More importantly, GraphDancer substantially reduces navigational paralysis: Loop / Timeout drops to 11.8% of all episodes, nearly halving compared with 21.2% (Qwen2.5-3B-Instruct) and 19.8% (Vanilla RL).

This indicates that the curriculum teaches the agent to navigate purposefully and recognize when to terminate exploration. Finally, GraphDancer mitigates the instability of standard RL fine-tuning: while Vanilla RL exhibits a higher rate of Invalid Format (8.1%), GraphDancer reduces it to 5.7%, better preserving adherence to the function-calling schema during exploration.

Conclusion

We present GraphDancer, an RL framework designed to equip LLMs with the capability to autonomously explore and reason over structured graph environments. By incorporating a novel graph-aware curriculum, we enable moderate-sized models to progressively internalize complex multi-hop reasoning skills, moving beyond the limitations of static retrieval or heavy prompting strategies.

Extensive experiments show that GraphDancer generalizes effectively across diverse unseen domains, including E-commerce, Literature, Healthcare, and Legal, after training on a single Academic domain. Notably, GraphDancer demonstrates robust adaptability to out-of-distribution question types, where graph interactions serve as contextual grounding rather than direct lookups.

These results suggest that through adaptive interaction and curriculum-based training, LLMs can acquire truly generalizable graph reasoning abilities, paving the way for more reliable and factually grounded agents in knowledge-intensive applications.

Appendix: Additional Analysis

We provide additional behavioral analysis and detailed diagnostics stratified by difficulty levels. Use the arrows to navigate through the supplementary figures and tables.

BibTeX

@inproceedings{bai2025graphdancer,
  title={GraphDancer: Training LLMs to Explore and Reason over Graphs via Curriculum Reinforcement Learning},
  author={Bai, Yuyang and Li, Zhuofeng and Nie, Ping and Xie, Jianwen and Zhang, Yu},
  booktitle={Preprint},
  year={2025}
}
Texas A&M University University of Waterloo Lambda