LLMs are still surprisingly bad at some simple tasks Terence Eden explores the surprising shortcomings of large language models (LLMs) when asked a straightforward factual question: Which TLDs have the same name as valid HTML5 elements? --- Summary of the Issue The question is simple: compare two lists (valid HTML5 element names and top-level domains, TLDs). The author had previously done this manually, confirming that some TLDs exactly match valid HTML5 elements. Expectation: LLMs should easily perform this task. Reality: LLMs perform poorly, missing matches, including incorrect answers, or providing irrelevant data. --- Evaluation of Three Major LLMs ChatGPT (OpenAI) Presented 6 TLDs matching HTML5 elements (e.g., .audio, .link, .menu). Mistakes: Omits several correct matches. Includes .code, which does not exist as a TLD (there is .codes, which is different). Result: Incomplete and inaccurate. Google Gemini Provided dozens of HTML5 elements, but without cross-referencing actual TLDs. Gave a list of HTML element tags instead of TLDs. Result: No relevant answer; failed completely. Claude (Anthropic) Gave a partial correct list including .audio, .video, .data, .link, .menu, .style, .select. Missed additional correct matches. Added speculative items like .app and .art which are not standard HTML5 elements. Result: Partially correct but overfits data and adds irrelevant answers. --- Author's Perspective on AI and LLMs Finding intersections of two lists is a basic task expected from even a moderately smart intern. Author is skeptical about the hype around LLMs and AI: AI often appears plausible but is mostly "garbage" on detailed factual tasks. People’s different experiences with AI reflects familiarity with the domain. AI's "plausibility" exploits human cognitive biases like the Barnum Effect, generating seemingly correct but often inaccurate content. Calls for a new term similar to Gell-Mann Amnesia, to describe how AI seems convincing to outsiders but useless when scrutinized by experts. --- Reader Comments Highlights Discussion about prompting being wrong is addressed by the author; task is factual, not ambiguous. Author rejects claims of hating progress or conspiracies about being paid by industries to denounce AI. Agreement that LLMs need critical examination similar to Gell-Mann Amnesia applied to AI. --- Additional Information Article includes an interactive relationship graph (not detailed here). Offers multiple sharing options for social media platforms. The blog is richly archived dating back to 1987 with extensive posts available. Search and newsletter subscription options are provided on the site. --- Key Takeaways LLMs currently fail at some simple, verifiable data comparison tasks. Results differ among major AI services, none fully reliable on this simple challenge. Hype about AI abilities should be tempered with critical evaluation. Users and developers need awareness of AI limitations especially in domains requiring precision. --- Source: Terence Eden’s Blog - LLMs are still surprisingly bad at some simple tasks Published: September 21, 2025 | ~650 words | 2 comments