Which Nested Data Format Do LLMs Understand Best? JSON vs. YAML vs. XML vs. Markdown

What’s the best format for passing nested data to an LLM?

Should you use JSON because it’s so popular?

YAML for its human readability?

XML with its explicit closing tags?

Or maybe Markdown?

We put it to the test. Can you guess what the results showed?

(We recently investigated the related, but different, question of which formats of tabular data LLMs understand best.)

Why This Question Matters

Many AI systems need to process nested data - for example, API responses, configuration files or product catalogues.

System Accuracy

Your format choice directly impacts accuracy. Under our stress testing, we found a case where one format resulted in 54% more correct answers than another format.

Token Costs

Some formats use dramatically more tokens than others. XML required 80% more tokens than Markdown for the same data, representing nearly twice the inference cost.

What We Tested

We evaluated three language models - GPT-5 Nano, Llama 3.2 3B Instruct, and Gemini 2.5 Flash Lite - on their ability to answer questions about nested data presented in four formats: JSON, YAML, XML, and Markdown.

Each model was tested on 1,000 questions with data volumes calibrated to stress performance into the 40-60% accuracy range.^*

Key Findings

YAML performed best for both GPT-5 Nano and Gemini 2.5 Flash Lite
Markdown was the most token-efficient format across all models, using 34-38% fewer tokens than JSON, around 10% fewer than YAML
JSON performed poorly for GPT-5 Nano and Gemini 2.5 Flash Lite
XML performed worst for both GPT-5 Nano and Gemini 2.5 Flash Lite
Llama 3.2 3B Instruct showed little format sensitivity, with all formats performing similarly

^* Note: This does not mean you would see 40-60% accuracy in practice. Accuracy is close to 100% with much smaller amounts of data. We chose large data sizes intentionally to stress the models to help make any differences between formats easier to measure.

Experimental Design

Test Data

We used synthetic Terraform-like configuration data with 6-7 levels of nesting to test the models’ ability to navigate complex hierarchical structures. The data contained AWS resource definitions with nested tags, configurations, and references.

Here are small samples of each format:

JSON:

{
  "resource": {
    "aws_subnet": {
      "api-12": {
        "vpc_id": "${aws_instance.main-12.id}",
        "availability_zone": "us-east-1c",
        "tags": {
          "Environment": "development",
          "Project": "api-service",
          "CostCenter": "CC-1106"
        }
      }
    }
  }
}

YAML:

resource:
  aws_subnet:
    api-12:
      vpc_id: ${aws_instance.main-12.id}
      availability_zone: us-east-1c
      tags:
        Environment: development
        Project: api-service
        CostCenter: CC-1106

XML:

<data>
  <resource>
    <aws_subnet>
      <api-12>
        <vpc_id>${aws_instance.main-12.id}</vpc_id>
        <availability_zone>us-east-1c</availability_zone>
        <tags>
          <Environment>development</Environment>
          <Project>api-service</Project>
          <CostCenter>CC-1106</CostCenter>
        </tags>
      </api-12>
    </aws_subnet>
  </resource>
</data>

Markdown:

# resource

## aws_subnet

### api-12

vpc_id: ${aws_instance.main-12.id}
availability_zone: us-east-1c

#### tags

Environment: development
Project: api-service
CostCenter: CC-1106

The Markdown format deserves special attention as we could have implemented it in slightly different ways.

For this test we opted to use heading levels (#, ##, ###, etc.) to represent nesting depth, with leaf values represented as “key: value” pairs.

Methodology

Data

We calibrated the amount of data per model to achieve roughly 40%-60% accuracy to support discrimination between formats.

Questions

1,000 questions per format testing ability to retrieve specific nested values.
Questions followed patterns such as “What is the value of resource.aws_subnet.api-12.tags.Environment?” or “How many items are in resource.aws_security_group.main-45.ingress?”

Prompting and Evaluation

Each question was posed using this prompt template:

Here is some {FORMAT} data:

{data_string}


Question: {question}

Please provide a concise answer based only on the data provided above.

The LLM’s response was checked for the expected answer using simple substring matching.

Results

GPT-5 Nano

Format	Accuracy	95% CI	Tokens	Data Size
YAML	62.1%	[59.1%, 65.1%]	42,477	142.6 KB
Markdown	54.3%	[51.2%, 57.4%]	38,357	114.6 KB
JSON	50.3%	[47.2%, 53.4%]	57,933	201.6 KB
XML	44.4%	[41.3%, 47.5%]	68,804	241.1 KB

GPT-5 Nano showed the strongest format preference, with YAML outperforming XML by 17.7 percentage points. Accuracy was significantly better with YAML than with all other formats.

The model struggled with XML, the most verbose format.

Llama 3.2 3B Instruct

Format	Accuracy	95% CI	Tokens	Data Size
JSON	52.7%	[49.6%, 55.8%]	35,808	124.6 KB
XML	50.7%	[47.6%, 53.8%]	42,453	149.2 KB
YAML	49.1%	[46.0%, 52.2%]	26,263	87.7 KB
Markdown	48.0%	[44.9%, 51.1%]	23,692	70.4 KB

Llama 3.2 3B Instruct didn’t show a strong preference for any particular formats.

Gemini 2.5 Flash Lite

Format	Accuracy	95% CI	Tokens	Data Size
YAML	51.9%	[48.8%, 55.0%]	156,296	439.5 KB
Markdown	48.2%	[45.1%, 51.3%]	137,708	352.2 KB
JSON	43.1%	[40.1%, 46.2%]	220,892	623.8 KB
XML	33.8%	[30.9%, 36.8%]	261,184	745.7 KB

Gemini 2.5 Flash Lite showed the same pattern as GPT-5 Nano: YAML performed best, XML performed worst.

The XML result was particularly poor.

Analysis

YAML: Good Default

YAML’s strong performance for GPT-5 Nano and Gemini 2.5 Flash Lite is noteworthy. Several factors may contribute:

Visual hierarchy through indentation makes structure immediately apparent
Minimal syntax overhead (no closing tags)
Common in training data (configuration files, CI/CD, Kubernetes)

XML: Avoid

XML’s relatively verbose structure required the most tokens and resulted in poor accuracy with all but the Llama model where there was no significant difference between it and other formats.

Possible explanations for this weak showing:

Visual noise: The repetitive <tag></tag> pattern may interfere with content recognition
Token inefficiency: More tokens = more opportunities for attention to diffuse
Training data distribution: XML may be less represented in modern training corpora

Markdown: Most Token-Efficient

Markdown achieved the best token efficiency across all models:

GPT-5 Nano: 34% fewer than JSON, 10% fewer than YAML
Llama 3.2 3B Instruct: 34% fewer than JSON, 10% fewer than YAML
Gemini 2.5 Flash Lite: 38% fewer than JSON, 12% fewer than YAML

Accuracy with Markdown was generally good, making it an interesting option if you’re looking to optimize cost or latency whilst retaining accuracy.

JSON: Poor Except With Llama

Accuracy with JSON was poor with GPT-5 Nano and Gemini 2.5 Flash Lite, suggesting it should be avoided with those models.

It was the best performer for Llama 3.2 3B Instruct but the difference was not significant.

These results suggest you’d be best defaulting to YAML or Markdown rather than JSON for nested data except if you’re using a Llama model.

Model Preferences

GPT-5 Nano: Strong format sensitivity, clear preferences, best with YAML
Llama 3.2 3B Instruct: Format agnostic, consistent performance across formats
Gemini 2.5 Flash Lite: Similar format preferences to GPT-5 Nano

Implications

Nested data format can be significant - test different formats if nested data is an important proportion of what you work with
YAML is a good default if you don’t know which format your specific model prefers
Consider markdown if cost is key - for dense data structures you’ll use about 10% fewer tokens than with YAML
Avoid XML for large-scale nested data in LLM contexts

Limitations & Further Research

Model Capability: We chose models that we knew we’d be able to stress with reasonable resources. It would be interesting to test more powerful models.
Breadth of Providers: We tested models from three different providers. It would be good to test models, such as Claude, Qwen and DeepSeek, from other providers.
Data Domain: We tested dummy data from a single domain (Terraform-like configurations). It would be good to run similar tests with data from other domains.
Question Type: We tested the LLM by asking fairly simple questions about the data. It’s possible that relative performance between formats might be different when asking more complicated or otherwise different types of question.
Amount of Data: We tested the relative performance of different formats with large amounts of nested data. It’s possible that relative performance with smaller amounts of data doesn’t follow the same pattern.

Conclusion

Data format significantly affects language model performance on nested data structures, with effects varying by model. YAML emerged as the strongest format for two out of the three models we tested, while XML consistently underperformed.

If your system feeds significant amounts of nested data into an LLM, then test different formats if you can. Expect models from different providers to respond differently to changes in format.

In the absence of model-specific data, YAML is a good default if accuracy is your priority. Markdown may be preferred if you care more about cost.