Why Johnny LLM can’t read web page source code

First of a series of posts by John Morgan and Philip Greenspun on using AI models to perform HTML and CSS design.

The plan: Compare ChatGPT, Gemini, Claude, and Grok to see which LLMs do the best job on CSS and, if necessary, HTML redesign of existing web pages.

All of these models failed at the simplest imaginable HTML/CSS question (not hyperlinks in the queries, but included here for convenience):

Do all four of these HTML pages reference the same .css file? https://philip.greenspun.com/samantha/ https://philip.greenspun.com/samantha/samantha-X https://philip.greenspun.com/photography/ https://philip.greenspun.com/photography/learn/interior

All four pages actually do reference the same .css file, /margins-and-ads.css, and one file also references /philg.css

Sample actual page content:

<head>
<meta name="keywords" content="interior photography">
<LINK REL=STYLESHEET HREF="/margins-and-ads.css" TYPE="text/css">

<title>How to Photograph Architecture (Interior)
</title>
</head>
<head>
<link rel=stylesheet href="/margins-and-ads.css" type="text/css">
<LINK REL=STYLESHEET HREF="/philg.css" TYPE="text/css">
<title>Chapter X:  Overcharged in Katmai</title>
<meta name=processor content=adp>
</head>

from ChatGPT

I checked the content of the four pages you listed, but from the snippets returned there is no visible <link rel=”stylesheet” …> line in the parts shown — that’s because the portions we could fetch start at the body text and don’t include the <head> section where CSS is normally referenced. … If you want precise verification, I can fetch and show the full <head> source (including the exact <link …> line) for each of the four URLs — just let me know.

[after “Yes please fetch the full HEAD source for each one and verify.”]

I tried to fetch the full HTML <head> source of each of the four pages you listed — specifically to look for the <link rel=”stylesheet” …> tag that shows which CSS file (if any) each page references. However, the tool returns only the rendered body content and not the complete <head> section, so I couldn’t directly inspect the <link> tags for CSS in the fetched results.

from Claude

I should note that the web_fetch tool extracts text content and may not always show the complete HTML <head> section where CSS references typically appear. The pages I did fetch appear to be very minimalist in design, consistent with early-to-mid 1990s web design when these pages were originally created.

Without being able to see the complete HTML source including the <head> sections of all four pages, I cannot definitively confirm whether they all reference the same CSS file.

[prompted further]

Now I have all four pages. Let me examine them for CSS file references. Looking at the extracted HTML content from all four pages, I notice that none of them contain explicit <link> tags to external CSS files in the content that was extracted.

from Gemini

[Us: Google’s smartest child has trouble counting and trouble understanding HTML, but it is great at presentation. It says that three pages reference no external CSS and then shows a table featuring two files that reference CSS. It is dead wrong about the interior page, which references margins-and-ads.css. It either cross-contaminated its own thinking by looking at the samantha-X file, which does reference philg.css or found a reference to the file in the BODY of https://philip.greenspun.com/panda/html.]

No, they do not all reference the same CSS file. In fact, three of the pages appear to use no external CSS at all, while one of them uses a specific stylesheet for that section of the site.

Based on the source code for each page:

  1. https://philip.greenspun.com/samantha/
    • CSS: None.
    • Details: This page uses old-school HTML attributes (like bgcolor=white and text=black in the <body> tag) for styling rather than an external CSS file.
  2. https://philip.greenspun.com/samantha/samantha-X
    • CSS: None.
    • Details: Like the index page for this book, individual chapters are formatted using plain HTML and inline attributes.
  3. https://philip.greenspun.com/photography/
    • CSS: References /philg.css
    • Details: This page contains a <link rel="stylesheet" href="/philg.css" type="text/css"> tag in the <head> .
  4. https://philip.greenspun.com/photography/learn/interior
    • CSS: References /philg.css
    • Details: As part of the photography tutorial section, it shares the same stylesheet as the photography index.

Limitation Rationale

Why are these LLMs so hobbled? Apparently, it is intentional for the LLM to see only what a user might see. This ensures that dynamic sites are properly understood and also may prevent some security risks.

When prompted, ChatGPT replied that this limitation is due to its web content fetching tool not returning the full payload and instead querying “search indexes, reader-mode extracts, preprocessed HTML snapshots, [and] content extraction pipelines” that are “optimized for reading and summarizing, not inspection.” It explained that “in those pipelines, the <head> is usually stripped entirely or reduced to a title and a few meta fields [because] CSS links, JS, analytics, preload hints, etc. are irrelevant to text extraction [and] keeping them increases size and complexity with no benefit for search/reading.” If true, this suggests that the tool was not built with full HTML analysis in mind but rather optimized for the admittedly more common text extraction use case.

Next Steps

Since our future AI overlords won’t download a web page from the Internet and click “view source”, the next step was to see what these LLMs could do if we downloaded HTML and CSS files to our desktop computers and then uploaded the files back up to the LLMs along with our questions.

Full post, including comments