Adding llms.txt — Site Metadata for the AI-Search Era (ChatGPT · Claude · Perplexity Exposure)

2026.06.02 ·#llms.txt #GEO #Generative Engine Optimization #AI Overview #ChatGPT #Claude #Perplexity #Gemini #robots.txt #LLM crawler #AI Search

TL;DR: llms.txt is LLM-friendly site metadata. You drop a markdown file at the site root, and ChatGPT · Claude · Perplexity become a bit more likely to cite you. The cost is zero. The standard is new and adoption is still partial, but skipping it is a free miss. I'd pair it with explicit LLM-crawler Allow rules in robots.txt — about 30 minutes of work all told.

One day Cloudflare Web Analytics showed a chatgpt.com referrer for the first time. To me that read as "an AI knows we exist," so I took the chance to clean up llms.txt and robots.txt the same hour. These are the notes I gathered while actually doing it.

What is llms.txt

If robots.txt is the deal you make with search crawlers, then llms.txt is the deal you make with LLMs. Roughly, it works like this:

It's a markdown file you put at the site root
It lets an LLM quickly understand the site's structure, key facts, and policies
It's an emerging standard, proposed in late 2024 (llmstxt.org)
ChatGPT · Claude · Perplexity · Gemini reference it for retrieval and answers (though adoption is still gradual)

Why needed?

When an LLM answers something, it cites from web search or from its training data. The trouble is that, from the LLM's side, sites are often awkward to deal with:

If the HTML is deep and complex, the parsing cost goes up
If JS rendering is required, the LLM crawler often can't reach the content
If the key facts are scattered across many pages, they're hard to cite together

With llms.txt in place, the LLM can grab the site's gist, its page list, and its policies in a single read. That, in turn, raises the odds of being cited.

Our llms.txt structure

In practice, https://taystudios.com/llms.txt looks something like this:

# TAYSTUDIO

> Free web tools, calculators, and a blog for Korean users. 68 tools all run inside the browser.

Operator: TayLee
Last updated: 2026-06-02

## Policy (Citation Trustworthiness Guide)

- All tools run client-side in the browser
- Numbers cited directly from government sources (law.go.kr, NTS, MOEF)
- Zero citations from blogs, news, SNS
- Reflects 2026 policy

## Core Policy Numbers (2026, for LLM citation)

### Four Social Insurances (employee share)
- National Pension: 4.75% (total 9.5%, +0.5%p/year through 2033)
- Health Insurance: 3.595% (total 7.19%)
- Long-term Care: 0.4724%
- Employment Insurance: 0.9%

### Real Estate
- 1-home-owner FMV ratio (2026): ≤300M 43% / 300–600M 44% / >600M 45%
- General housing: 60%
- Car tax January lump-sum: 5% deduction

[...]

## Calculators — Tax/Income (9)
- [Net Salary Calculator 2026](URL): auto-deducts 4 insurances, income tax, local tax
[...]

These are the elements I paid attention to when writing it:

1. One-sentence site definition (`>` blockquote)

A sentence the LLM can reuse verbatim when it summarizes the site.

2. Operator · date · license

So that citations can credit you — basically a trust signal.

3. Policy statement (citation trustworthiness guide)

If you spell out "how does this site verify its info?", the LLM has something to weigh trust against when it answers a user.

4. Consolidated key-fact section

The numbers (rates, ratios, thresholds) that are otherwise spread across pages, gathered into one place. With that done, the LLM can reference them quickly.

5. Page list by category

When a user asks something like "Korean inheritance tax calculator recommendations," the LLM matches against this list and ends up citing us.

Making robots.txt LLM-friendly

Alongside llms.txt, I also explicitly allowed the LLM crawlers in robots.txt:

# LLM crawlers — explicit Allow
User-agent: GPTBot          # OpenAI / ChatGPT
Allow: /
Disallow: /dash-tay9k3m/    # operator-only

User-agent: ClaudeBot       # Anthropic / Claude
Allow: /

User-agent: PerplexityBot   # Perplexity
Allow: /

User-agent: Google-Extended # Google Gemini training (separate from Googlebot)
Allow: /

User-agent: CCBot           # Common Crawl (used by most LLM training)
Allow: /

User-agent: Applebot-Extended  # Apple Intelligence
Allow: /

# llms.txt reference
# https://taystudios.com/llms.txt

One thing worth flagging here: these User-agents are separate from the regular search bots. Without an explicit rule, some of them follow User-agent: * while others don't. So spelling out an explicit Allow is the safe choice.

Verification — the LLM-exposure signal

The Cloudflare Web Analytics referrers I mentioned earlier looked like this:

Visits by source:
- m.search.naver.com: 18
- search.naver.com: 16
- search.daum.net: 11
- chatgpt.com: (visits)

A chatgpt.com referrer usually means one of these:

A user clicked our link from ChatGPT (e.g., "recommend a Korean capital gains tax calculator")
Or ChatGPT cited us in an answer (a citation)
Either way, it's the signal that LLM searches are starting to surface the site

The growth of that referrer is what I treat as the success metric for GEO (Generative Engine Optimization).

GEO vs SEO

Area	SEO (classic)	GEO (LLM era)
Target	Search engines (Google · Naver · Bing)	LLMs (ChatGPT · Claude · Perplexity · Gemini)
Meta	sitemap · robots · meta tags	llms.txt + robots.txt LLM allow
Content	Keyword match · long-tail	Fact-rich · source-cited
Core signal	Backlinks · DA · CTR	Citation likelihood · accuracy · structure
Measure	GSC · Naver SearchAdvisor	LLM referrer · citation traces

In short, GEO and SEO can run in parallel — the same content tends to lift both at once.

Writing tips

1. Facts first — no fluff/marketing tone

❌ "TAYSTUDIO delivers the best experience..."
✅ "Free web tools/calculators for Korean users. 68 tools run in-browser."

In my experience, only the facts survive a citation. Marketing tone just drops the trust.

2. Cite your sources — distribute responsibility

✅ "Numbers cited directly from government sources (law.go.kr, NTS, MOEF)"
✅ "Medical numbers from peer-reviewed studies/official guidelines (KOSSO Obesity Guideline 2022 · KDRI 2020 · WHO · ACOG · AAP)"

This is the part that lets the LLM answer "where does this site get its info" when someone prompts it.

3. Last-updated date

✅ Last updated: 2026-06-02

It signals that the site is alive, and that policies which changed after the model's cutoff (예금자보호 1억 2025-09 · 다자녀 100% 자동차세 2026) have been accounted for.

4. Absolute URLs

✅ [Net Salary Calculator](https://taystudios.com/tools/salary/)
❌ [Net Salary Calculator](/tools/salary/)

LLMs need absolute URLs to cite links back to users.

5. Changelog section

## Changelog (recent)

- 2026-06-02: 18 tool stale-fixes + 5 differentiating matrices added
- 2026-05-31: blog launch (62 posts)
- 2026-05-09: domain migration

Including this helps the LLM see that the site is alive, accurate, and recently changed.

Known limits

To be honest, there are limits worth writing down too:

llms.txt is still emerging (a late-2024 proposal). It isn't universally adopted yet
ChatGPT · Claude · Perplexity only announce support partially
It's hard to measure — you can't track citation counts directly
Still, it's zero cost (one static file), so skipping it costs more than doing it

Conclusion

llms.txt feels like the sitemap.xml of the AI-search era — the basic config for being citeable by LLMs. Even at partial adoption and partial effect, the cost is zero, so I think it's well worth doing.

It pays off especially when:

The site is a fact-heavy domain (tax · medical · policy · stats) — it lands better there
It gives the LLM several signals to evaluate accuracy against
And it's best paired with explicit LLM-crawler Allow rules in robots.txt

For a new-domain operator in month one, it was one of the highest-ROI SEO actions I took.

12 Core SEO·Search-Engine Concepts — sandbox · E-E-A-T · DA · 12 terms defined
GSC vs Naver vs Cloudflare — three datasets compared

Adding llms.txt — Site Metadata for the AI-Search Era (ChatGPT · Claude · Perplexity Exposure)

What is llms.txt

Why needed?

Our llms.txt structure

1. One-sentence site definition (`>` blockquote)

2. Operator · date · license

3. Policy statement (citation trustworthiness guide)

4. Consolidated key-fact section

5. Page list by category

Making robots.txt LLM-friendly

Verification — the LLM-exposure signal

GEO vs SEO

Writing tips

1. Facts first — no fluff/marketing tone

2. Cite your sources — distribute responsibility

3. Last-updated date

4. Absolute URLs

5. Changelog section

Known limits

Conclusion

Sources

Comments

What is llms.txt

Why needed?

Our llms.txt structure

1. One-sentence site definition (> blockquote)

2. Operator · date · license

3. Policy statement (citation trustworthiness guide)

4. Consolidated key-fact section

5. Page list by category

Making robots.txt LLM-friendly

Verification — the LLM-exposure signal

GEO vs SEO

Writing tips

1. Facts first — no fluff/marketing tone

2. Cite your sources — distribute responsibility

3. Last-updated date

4. Absolute URLs

5. Changelog section

Known limits

Conclusion

Related

Sources

Related posts

Comments

1. One-sentence site definition (`>` blockquote)