You can't parse XML with regex. Let's do it anyways

You Can't Parse XML with Regex. Let's Do It Anyways. A deep dive into why parsing XML or HTML with regular expressions (regex) is generally discouraged, yet sometimes practical. --- Introduction It's a common mantra: "You cannot parse HTML/XML with regex" because of their nested, complex structure. The author acknowledges this but explores scenarios where using regex might be justifiable or useful, especially for scraping. --- What is XML? XML is a markup language designed for storing, transmitting, and reconstructing data. Key properties: Markup language: Defines a strict, specific document structure (unlike JSON or TOML). Machine-readable: Designed to be parsed into a tree. Human-readable: Can be understood and inspected without special tools. However, XML is horribly complex: The XML 1.0 specification is 59 pages, not including extensions. This complexity can lead to security vulnerabilities (like XML external entity attacks). Novices often attempt regex parsing due to unfamiliarity with XML's full depth. --- XML Parsing Example: A Simple Stack-Based Parser in Bash The blog presents a simplified stack-based parser concept: Uses a stack to track nested XML tags. Demonstrates how a shell script can "walk" an XML-like tree and extract data based on queries. Actual parsing is more complicated due to XML’s many special cases. Key takeaway: Building even a naive parser from scratch is non-trivial. --- How Humans Parse XML Raw XML looks like a compact, unreadable string. Humans "pretty-format" XML to visualize hierarchical relationships. Despite simplicity, the XML is fundamentally a string that needs structural understanding to be properly interpreted. --- HTML is Like XML, But Quirkier HTML is the dominant web markup language but is less strict than XML. Browsers tolerate malformed HTML to improve accessibility (e.g., missing closing tags). HTML standards (the "living standard") are extensive and complex (~1500 pages), primarily defining many edge cases. --- XHTML: The Strict HTML Alternative XHTML combines HTML with XML's strictness. Proposed around 2000, but never widely adopted. XHTML5 exists and is usable but remains a niche. --- Parsing HTML with Regex: When and Why Purists say regex should never be used on HTML/XML. However, scraping practicalities force compromises. Benefits of Using Regex for Scraping Development Speed: Writing a quick regex is faster than creating complex DOM queries. Particularly useful when dealing with deeply nested or obfuscated markup. Adaptability: Regex can extract data from inconsistently structured HTML since it's easier to anchor on unique text snippets than precise selectors. Example: Scraping a train station schedule where markup can change slightly, but textual anchors remain. Simplicity in Specific Tasks: Some tasks (like extracting key-value pairs) are straightforward with regex but cumbersome with selectors. --- Regex Tips Decide if regex is right for your job (best for scraping irregular HTML). Avoid fully parsing the tree with regex; focus on data extraction. Use PCRE (Perl-Compatible Regular Expressions) for powerful features like non-greedy matching (.*?). If PCRE isn't available, emulate non-greedy behavior by complex substitutions. Match the tightest possible character sets. Anchor regex to unique text, not markup. Use whitespace wisely to simplify matching. Request pages with tools like curl instead of browsers for automation. Expect your scraping to break eventually; build fail-safes and notifications. --- Example Scraper Provided a fully annotated Bash script scraping OpenRCT2’s download page. Demonstrates how to: Download HTML. Split content based on anchor phrases. Use grep and sed to extract versions, platforms,