Which Nested Data Format Do LLMs Understand Best? JSON vs. YAML vs. XML vs. Markdown

Should you use JSON because it’s so popular?
YAML for its human readabilty?
XML with its explicit closing tags?
Or maybe Markdown?
We put it to the test. Can you guess what the results showed?
(We recently investigated the related, but different, question of which formats of tabular data LLMs understand best.)Why This Question Matters
Many AI systems need to process nested data - for example, API responses, configuration files or product catalogues.
System Accuracy
Your format choice directly impacts accuracy. Under our stress testing, we found a case where one format resulted in 54% more correct answers than another format.
Token Costs
Some formats use dramatically more tokens than others. XML required 80% more tokens than Markdown for the same data, representing nearly twice the inference cost.
Key Findings
We evaluated three language models - GPT-5 Nano, Llama 3.2 3B Instruct, and Gemini 2.5 Flash Lite - on their ability to answer questions about nested data presented in four formats: JSON, YAML, XML, and Markdown.
Each model was tested on 1,000 questions with data volumes calibrated to stress performance into the 40-60% accuracy range.*
Highlights
- YAML performed best for both GPT-5 Nano and Gemini 2.5 Flash Lite
- Markdown was the most token-efficient format across all models, using 34-38% fewer tokens than JSON, around 10% fewer than YAML
- JSON performed poorly for GPT-5 Nano and Gemini 2.5 Flash Lite
- XML performed worst for both GPT-5 Nano and Gemini 2.5 Flash Lite
- Llama 3.2 3B Instruct showed little format sensitivity, with all formats performing similarly
Want More Research-Backed AI Insights?
Get tactical tips for improving your LLM systems delivered to your inbox
We respect your privacy. Unsubscribe at any time.
Experimental Design
Test Data
We used synthetic Terraform-like configuration data with 6-7 levels of nesting to test the models’ ability to navigate complex hierarchical structures. The data contained AWS resource definitions with nested tags, configurations, and references.
Here are small samples of each format:
JSON:
{
"resource": {
"aws_subnet": {
"api-12": {
"vpc_id": "${aws_instance.main-12.id}",
"availability_zone": "us-east-1c",
"tags": {
"Environment": "development",
"Project": "api-service",
"CostCenter": "CC-1106"
}
}
}
}
}
YAML:
resource:
aws_subnet:
api-12:
vpc_id: ${aws_instance.main-12.id}
availability_zone: us-east-1c
tags:
Environment: development
Project: api-service
CostCenter: CC-1106
XML:
<data>
<resource>
<aws_subnet>
<api-12>
<vpc_id>${aws_instance.main-12.id}</vpc_id>
<availability_zone>us-east-1c</availability_zone>
<tags>
<Environment>development</Environment>
<Project>api-service</Project>
<CostCenter>CC-1106</CostCenter>
</tags>
</api-12>
</aws_subnet>
</resource>
</data>
Markdown:
# resource
## aws_subnet
### api-12
vpc_id: ${aws_instance.main-12.id}
availability_zone: us-east-1c
#### tags
Environment: development
Project: api-service
CostCenter: CC-1106
The Markdown format deserves special attention as we could have implemented it in slightly different ways.
For this test we opted to use heading levels (#, ##, ###, etc.) to represent nesting depth, with leaf values represented as “key: value” pairs.
Methodology
Data
We calibrated the amount of data per model to achieve roughly 40%-60% accuracy to support discrimination between formats.
Questions
-
1,000 questions per format testing ability to retrieve specific nested values.
-
Questions followed patterns such as “What is the value of resource.aws_subnet.api-12.tags.Environment?” or “How many items are in resource.aws_security_group.main-45.ingress?”
Prompting and Evaluation
Each question was posed using this prompt template:
Here is some {FORMAT} data:
{data_string}
Question: {question}
Please provide a concise answer based only on the data provided above.
The LLM’s response was checked for the expected answer using simple substring matching.
Results
GPT-5 Nano
Format | Accuracy | 95% CI | Tokens | Data Size |
---|---|---|---|---|
YAML | 62.1% | [59.1%, 65.1%] | 42,477 | 142.6 KB |
Markdown | 54.3% | [51.2%, 57.4%] | 38,357 | 114.6 KB |
JSON | 50.3% | [47.2%, 53.4%] | 57,933 | 201.6 KB |
XML | 44.4% | [41.3%, 47.5%] | 68,804 | 241.1 KB |
GPT-5 Nano showed the strongest format preference, with YAML outperforming XML by 17.7 percentage points. Accuracy was significantly better with YAML than with all other formats.
The model struggled with XML, the most verbose format.
Llama 3.2 3B Instruct
Format | Accuracy | 95% CI | Tokens | Data Size |
---|---|---|---|---|
JSON | 52.7% | [49.6%, 55.8%] | 35,808 | 124.6 KB |
XML | 50.7% | [47.6%, 53.8%] | 42,453 | 149.2 KB |
YAML | 49.1% | [46.0%, 52.2%] | 26,263 | 87.7 KB |
Markdown | 48.0% | [44.9%, 51.1%] | 23,692 | 70.4 KB |
Llama 3.2 3B Instruct showed remarkable format agnosticism.
Gemini 2.5 Flash Lite
Format | Accuracy | 95% CI | Tokens | Data Size |
---|---|---|---|---|
YAML | 51.9% | [48.8%, 55.0%] | 156,296 | 439.5 KB |
Markdown | 48.2% | [45.1%, 51.3%] | 137,708 | 352.2 KB |
JSON | 43.1% | [40.1%, 46.2%] | 220,892 | 623.8 KB |
XML | 33.8% | [30.9%, 36.8%] | 261,184 | 745.7 KB |
Gemini 2.5 Flash Lite showed the same pattern as GPT-5 Nano: YAML performed best, XML performed worst.
The XML result was particularly poor.
Analysis
YAML: Good Default
YAML’s strong performance for GPT-5 Nano and Gemini 2.5 Flash Lite is noteworthy. Several factors may contribute:
- Visual hierarchy through indentation makes structure immediately apparent
- Minimal syntax overhead (no closing tags)
- Common in training data (configuration files, CI/CD, Kubernetes)
XML: Avoid
XML’s relatively verbose structure required the most tokens and resulted in poor accuracy with all but the Llama model where there was no significant difference between it and other formats.
Possible explanations for this weak showing:
- Visual noise: The repetitive
<tag></tag>
pattern may interfere with content recognition - Token inefficiency: More tokens = more opportunities for attention to diffuse
- Training data distribution: XML may be less represented in modern training corpora
Markdown: Most Token-Efficient
Markdown achieved the best token efficiency across all models:
- GPT-5 Nano: 34% fewer than JSON, 10% fewer than YAML
- Llama 3.2 3B Instruct: 34% fewer than JSON, 10% fewer than YAML
- Gemini 2.5 Flash Lite: 38% fewer than JSON, 12% fewer than YAML
Accuracy with Markdown was generally good, making it an interesting option if you’re looking to optimize cost or latency whilst retaining accuracy.
JSON: Poor Except With Llama
Accuracy with JSON was poor with GPT-5 Nano and Gemini 2.5 Flash Lite, suggesting it should be avoided with those models.
It was the best performer for Llama 3.2 3B Instruct but the difference was not significant.
These results suggest you’d be best defaulting to YAML or Markdown rather than JSON for nested data except if you’re using a Llama model.
Model Preferences
- GPT-5 Nano: Strong format sensitivity, clear preferences, best with YAML
- Llama 3.2 3B Instruct: Format agnostic, consistent performance across formats
- Gemini 2.5 Flash Lite: Similar format preferences to GPT-5 Nano
Implications
- Nested data format can be significant - test different formats if nested data is an important proportion of what you work with
- YAML is a good default if you don’t know which format your specific model prefers
- Consider markdown if cost is key - for dense data structures you’ll use about 10% fewer tokens than with YAML
- Avoid XML for large-scale nested data in LLM contexts
Limitations & Further Research
- Model Capability: We chose models that we knew we’d be able to stress with reasonable resources. It would be interesting to test more powerful models.
- Breadth of Providers: We tested models from three different providers. It would be good to test models, such as Claude, Qwen and DeepSeek, from other providers.
- Data Domain: We tested dummy data from a single domain (Terraform-like configurations). It would be good to run similar tests with data from other domains.
- Question Type: We tested the LLM by asking fairly simple questions about the data. It’s possible that relative performance between formats might be different when asking more complicated or otherwise different types of question.
- Amount of Data: We tested the relative performance of different formats with large amounts of nested data. It’s possible that relative performance with smaller amounts of data doesn’t follow the same pattern.
Conclusion
Data format significantly affects language model performance on nested data structures, with effects varying by model. YAML emerged as the strongest format for two out of the three models we tested, while XML consistently underperformed.
If your system feeds significant amounts of nested data into an LLM, then test different formats if you can. Expect models from different providers to respond differently to changes in format.
In the absence of model-specific data, YAML is a good default if accuracy is your priority. Markdown may be preferred if you care more about cost.
Enjoyed This Article?
Get more tactical AI agent insights delivered to your inbox
We respect your privacy. Unsubscribe at any time.