Which Nested Data Format Do LLMs Understand Best? JSON vs. YAML vs. XML vs. Markdown

research ai-evaluation llms
Which Nested Data Format Do LLMs Understand Best? JSON vs. YAML vs. XML vs. Markdown
What’s the best format for passing nested data to an LLM?

Should you use JSON because it’s so popular?

YAML for its human readabilty?

XML with its explicit closing tags?

Or maybe Markdown?

We put it to the test. Can you guess what the results showed?

(We recently investigated the related, but different, question of which formats of tabular data LLMs understand best.)

Why This Question Matters

Many AI systems need to process nested data - for example, API responses, configuration files or product catalogues.

System Accuracy

Your format choice directly impacts accuracy. Under our stress testing, we found a case where one format resulted in 54% more correct answers than another format.

Token Costs

Some formats use dramatically more tokens than others. XML required 80% more tokens than Markdown for the same data, representing nearly twice the inference cost.

Key Findings

We evaluated three language models - GPT-5 Nano, Llama 3.2 3B Instruct, and Gemini 2.5 Flash Lite - on their ability to answer questions about nested data presented in four formats: JSON, YAML, XML, and Markdown.

Each model was tested on 1,000 questions with data volumes calibrated to stress performance into the 40-60% accuracy range.*

Highlights

  • YAML performed best for both GPT-5 Nano and Gemini 2.5 Flash Lite
  • Markdown was the most token-efficient format across all models, using 34-38% fewer tokens than JSON, around 10% fewer than YAML
  • JSON performed poorly for GPT-5 Nano and Gemini 2.5 Flash Lite
  • XML performed worst for both GPT-5 Nano and Gemini 2.5 Flash Lite
  • Llama 3.2 3B Instruct showed little format sensitivity, with all formats performing similarly
* Note: This does not mean you would see 40-60% accuracy in practice. Accuracy is close to 100% with much smaller amounts of data. We chose large data sizes intentionally to stress the models to help make any differences between formats easier to measure.

Want More Research-Backed AI Insights?

Get tactical tips for improving your LLM systems delivered to your inbox

We respect your privacy. Unsubscribe at any time.

Experimental Design

Test Data

We used synthetic Terraform-like configuration data with 6-7 levels of nesting to test the models’ ability to navigate complex hierarchical structures. The data contained AWS resource definitions with nested tags, configurations, and references.

Here are small samples of each format:

JSON:

{
  "resource": {
    "aws_subnet": {
      "api-12": {
        "vpc_id": "${aws_instance.main-12.id}",
        "availability_zone": "us-east-1c",
        "tags": {
          "Environment": "development",
          "Project": "api-service",
          "CostCenter": "CC-1106"
        }
      }
    }
  }
}

YAML:

resource:
  aws_subnet:
    api-12:
      vpc_id: ${aws_instance.main-12.id}
      availability_zone: us-east-1c
      tags:
        Environment: development
        Project: api-service
        CostCenter: CC-1106

XML:

<data>
  <resource>
    <aws_subnet>
      <api-12>
        <vpc_id>${aws_instance.main-12.id}</vpc_id>
        <availability_zone>us-east-1c</availability_zone>
        <tags>
          <Environment>development</Environment>
          <Project>api-service</Project>
          <CostCenter>CC-1106</CostCenter>
        </tags>
      </api-12>
    </aws_subnet>
  </resource>
</data>

Markdown:

# resource

## aws_subnet

### api-12

vpc_id: ${aws_instance.main-12.id}
availability_zone: us-east-1c

#### tags

Environment: development
Project: api-service
CostCenter: CC-1106

The Markdown format deserves special attention as we could have implemented it in slightly different ways.

For this test we opted to use heading levels (#, ##, ###, etc.) to represent nesting depth, with leaf values represented as “key: value” pairs.

Methodology

Data

We calibrated the amount of data per model to achieve roughly 40%-60% accuracy to support discrimination between formats.

Questions

  • 1,000 questions per format testing ability to retrieve specific nested values.

  • Questions followed patterns such as “What is the value of resource.aws_subnet.api-12.tags.Environment?” or “How many items are in resource.aws_security_group.main-45.ingress?”

Prompting and Evaluation

Each question was posed using this prompt template:

Here is some {FORMAT} data:

{data_string}


Question: {question}

Please provide a concise answer based only on the data provided above.

The LLM’s response was checked for the expected answer using simple substring matching.

Results

GPT-5 Nano

FormatAccuracy95% CITokensData Size
YAML62.1%[59.1%, 65.1%]42,477142.6 KB
Markdown54.3%[51.2%, 57.4%]38,357114.6 KB
JSON50.3%[47.2%, 53.4%]57,933201.6 KB
XML44.4%[41.3%, 47.5%]68,804241.1 KB

GPT-5 Nano showed the strongest format preference, with YAML outperforming XML by 17.7 percentage points. Accuracy was significantly better with YAML than with all other formats.

The model struggled with XML, the most verbose format.

Llama 3.2 3B Instruct

FormatAccuracy95% CITokensData Size
JSON52.7%[49.6%, 55.8%]35,808124.6 KB
XML50.7%[47.6%, 53.8%]42,453149.2 KB
YAML49.1%[46.0%, 52.2%]26,26387.7 KB
Markdown48.0%[44.9%, 51.1%]23,69270.4 KB

Llama 3.2 3B Instruct showed remarkable format agnosticism.

Gemini 2.5 Flash Lite

FormatAccuracy95% CITokensData Size
YAML51.9%[48.8%, 55.0%]156,296439.5 KB
Markdown48.2%[45.1%, 51.3%]137,708352.2 KB
JSON43.1%[40.1%, 46.2%]220,892623.8 KB
XML33.8%[30.9%, 36.8%]261,184745.7 KB

Gemini 2.5 Flash Lite showed the same pattern as GPT-5 Nano: YAML performed best, XML performed worst.

The XML result was particularly poor.

Analysis

YAML: Good Default

YAML’s strong performance for GPT-5 Nano and Gemini 2.5 Flash Lite is noteworthy. Several factors may contribute:

  1. Visual hierarchy through indentation makes structure immediately apparent
  2. Minimal syntax overhead (no closing tags)
  3. Common in training data (configuration files, CI/CD, Kubernetes)

XML: Avoid

XML’s relatively verbose structure required the most tokens and resulted in poor accuracy with all but the Llama model where there was no significant difference between it and other formats.

Possible explanations for this weak showing:

  1. Visual noise: The repetitive <tag></tag> pattern may interfere with content recognition
  2. Token inefficiency: More tokens = more opportunities for attention to diffuse
  3. Training data distribution: XML may be less represented in modern training corpora

Markdown: Most Token-Efficient

Markdown achieved the best token efficiency across all models:

  • GPT-5 Nano: 34% fewer than JSON, 10% fewer than YAML
  • Llama 3.2 3B Instruct: 34% fewer than JSON, 10% fewer than YAML
  • Gemini 2.5 Flash Lite: 38% fewer than JSON, 12% fewer than YAML

Accuracy with Markdown was generally good, making it an interesting option if you’re looking to optimize cost or latency whilst retaining accuracy.

JSON: Poor Except With Llama

Accuracy with JSON was poor with GPT-5 Nano and Gemini 2.5 Flash Lite, suggesting it should be avoided with those models.

It was the best performer for Llama 3.2 3B Instruct but the difference was not significant.

These results suggest you’d be best defaulting to YAML or Markdown rather than JSON for nested data except if you’re using a Llama model.

Model Preferences

  • GPT-5 Nano: Strong format sensitivity, clear preferences, best with YAML
  • Llama 3.2 3B Instruct: Format agnostic, consistent performance across formats
  • Gemini 2.5 Flash Lite: Similar format preferences to GPT-5 Nano

Implications

  • Nested data format can be significant - test different formats if nested data is an important proportion of what you work with
  • YAML is a good default if you don’t know which format your specific model prefers
  • Consider markdown if cost is key - for dense data structures you’ll use about 10% fewer tokens than with YAML
  • Avoid XML for large-scale nested data in LLM contexts

Limitations & Further Research

  • Model Capability: We chose models that we knew we’d be able to stress with reasonable resources. It would be interesting to test more powerful models.
  • Breadth of Providers: We tested models from three different providers. It would be good to test models, such as Claude, Qwen and DeepSeek, from other providers.
  • Data Domain: We tested dummy data from a single domain (Terraform-like configurations). It would be good to run similar tests with data from other domains.
  • Question Type: We tested the LLM by asking fairly simple questions about the data. It’s possible that relative performance between formats might be different when asking more complicated or otherwise different types of question.
  • Amount of Data: We tested the relative performance of different formats with large amounts of nested data. It’s possible that relative performance with smaller amounts of data doesn’t follow the same pattern.

Conclusion

Data format significantly affects language model performance on nested data structures, with effects varying by model. YAML emerged as the strongest format for two out of the three models we tested, while XML consistently underperformed.

If your system feeds significant amounts of nested data into an LLM, then test different formats if you can. Expect models from different providers to respond differently to changes in format.

In the absence of model-specific data, YAML is a good default if accuracy is your priority. Markdown may be preferred if you care more about cost.

Enjoyed This Article?

Get more tactical AI agent insights delivered to your inbox

We respect your privacy. Unsubscribe at any time.