Why Johnny LLM can’t read web page source code

First of a series of posts by John Morgan and Philip Greenspun on using AI models to perform HTML and CSS design.

The plan: Compare ChatGPT, Gemini, Claude, and Grok to see which LLMs do the best job on CSS and, if necessary, HTML redesign of existing web pages.

All of these models failed at the simplest imaginable HTML/CSS question (not hyperlinks in the queries, but included here for convenience):

Do all four of these HTML pages reference the same .css file? https://philip.greenspun.com/samantha/ https://philip.greenspun.com/samantha/samantha-X https://philip.greenspun.com/photography/ https://philip.greenspun.com/photography/learn/interior

All four pages actually do reference the same .css file, /margins-and-ads.css, and one file also references /philg.css

Sample actual page content:

<head>
<meta name="keywords" content="interior photography">
<LINK REL=STYLESHEET HREF="/margins-and-ads.css" TYPE="text/css">

<title>How to Photograph Architecture (Interior)
</title>
</head>
<head>
<link rel=stylesheet href="/margins-and-ads.css" type="text/css">
<LINK REL=STYLESHEET HREF="/philg.css" TYPE="text/css">
<title>Chapter X:  Overcharged in Katmai</title>
<meta name=processor content=adp>
</head>

from ChatGPT

I checked the content of the four pages you listed, but from the snippets returned there is no visible <link rel=”stylesheet” …> line in the parts shown — that’s because the portions we could fetch start at the body text and don’t include the <head> section where CSS is normally referenced. … If you want precise verification, I can fetch and show the full <head> source (including the exact <link …> line) for each of the four URLs — just let me know.

[after “Yes please fetch the full HEAD source for each one and verify.”]

I tried to fetch the full HTML <head> source of each of the four pages you listed — specifically to look for the <link rel=”stylesheet” …> tag that shows which CSS file (if any) each page references. However, the tool returns only the rendered body content and not the complete <head> section, so I couldn’t directly inspect the <link> tags for CSS in the fetched results.

from Claude

I should note that the web_fetch tool extracts text content and may not always show the complete HTML <head> section where CSS references typically appear. The pages I did fetch appear to be very minimalist in design, consistent with early-to-mid 1990s web design when these pages were originally created.

Without being able to see the complete HTML source including the <head> sections of all four pages, I cannot definitively confirm whether they all reference the same CSS file.

[prompted further]

Now I have all four pages. Let me examine them for CSS file references. Looking at the extracted HTML content from all four pages, I notice that none of them contain explicit <link> tags to external CSS files in the content that was extracted.

from Gemini

[Us: Google’s smartest child has trouble counting and trouble understanding HTML, but it is great at presentation. It says that three pages reference no external CSS and then shows a table featuring two files that reference CSS. It is dead wrong about the interior page, which references margins-and-ads.css. It either cross-contaminated its own thinking by looking at the samantha-X file, which does reference philg.css or found a reference to the file in the BODY of https://philip.greenspun.com/panda/html.]

No, they do not all reference the same CSS file. In fact, three of the pages appear to use no external CSS at all, while one of them uses a specific stylesheet for that section of the site.

Based on the source code for each page:

  1. https://philip.greenspun.com/samantha/
    • CSS: None.
    • Details: This page uses old-school HTML attributes (like bgcolor=white and text=black in the <body> tag) for styling rather than an external CSS file.
  2. https://philip.greenspun.com/samantha/samantha-X
    • CSS: None.
    • Details: Like the index page for this book, individual chapters are formatted using plain HTML and inline attributes.
  3. https://philip.greenspun.com/photography/
    • CSS: References /philg.css
    • Details: This page contains a <link rel="stylesheet" href="/philg.css" type="text/css"> tag in the <head> .
  4. https://philip.greenspun.com/photography/learn/interior
    • CSS: References /philg.css
    • Details: As part of the photography tutorial section, it shares the same stylesheet as the photography index.

Limitation Rationale

Why are these LLMs so hobbled? Apparently, it is intentional for the LLM to see only what a user might see. This ensures that dynamic sites are properly understood and also may prevent some security risks.

When prompted, ChatGPT replied that this limitation is due to its web content fetching tool not returning the full payload and instead querying “search indexes, reader-mode extracts, preprocessed HTML snapshots, [and] content extraction pipelines” that are “optimized for reading and summarizing, not inspection.” It explained that “in those pipelines, the <head> is usually stripped entirely or reduced to a title and a few meta fields [because] CSS links, JS, analytics, preload hints, etc. are irrelevant to text extraction [and] keeping them increases size and complexity with no benefit for search/reading.” If true, this suggests that the tool was not built with full HTML analysis in mind but rather optimized for the admittedly more common text extraction use case.

Next Steps

Since our future AI overlords won’t download a web page from the Internet and click “view source”, the next step was to see what these LLMs could do if we downloaded HTML and CSS files to our desktop computers and then uploaded the files back up to the LLMs along with our questions.

2 thoughts on “Why Johnny LLM can’t read web page source code

  1. Why are you trying to use a chatbot for a coding task in 2026?

    With all due respect, use a coding agent. I recommend Codex, but Claude will likely work for this simple task. Everything else is far behind on our benchmarks.

    Download Codex CLI / App or Claude CLI / Desktop, ensure it has permission to curl + access network inside the sandbox, and it’ll work fine.

    Codex can crank out thousands of lines of bug-free high complexity systems programming code from a single well-designed prompt. Use ‘high’ reasoning mode if you’re trying to build anything real.

    Don’t worry though, you’re not alone: most working programmers haven’t yet realized there was a capabilities step change as of 5.2, i.e. December.

  2. Maybe the billionaire bosses of our AI overlords have read and implemented Brave New World:

    > The classes are Alpha (planned leaders), Beta, Gamma, Delta, and Epsilon (menial labourers of limited intelligence). Each caste is indoctrinated, largely by sleep-conditioning, to prefer their own class—epsilons are happy that they do not have the intellectual burden of alphas—and wears a uniform colour of clothing for easy identification. [1]

    Looks like the AIs you have access to are the Epsilons of the AI, for public consumption. Alphas probably don’t access the whole file either, bosses are typically clueless about technical details. Betas, only available to billionaires controlling the AIs, most likely have access to the HTTP headers and decades of privacy-invasive web telemetry. [Cue villian laughing maniacally and steepling his/her/xher fingers.]

    [1] https://en.wikipedia.org/wiki/Brave_New_World#Plot

Leave a Reply

Your email address will not be published. Required fields are marked *