A love letter to the CSV format (2024)

A Love Letter to the CSV Format Published: Tuesday, 9 January 2024 Authors: Guillaume Plique (Research Engineer), Robin de Mourat (Research Designer) Available: Markdown version on GitHub --- Overview This article defends the CSV (Comma-Separated Values) format against frequent claims of its obsolescence in favor of newer data formats like Parquet, newline-delimited JSON, or MessagePack. Instead of dismissing CSV, the authors outline its enduring strengths and why it remains a vital tool for data serialization. --- Why CSV Endures Simplicity CSV’s specification is very straightforward: commas separate values, new lines separate rows, with just a few rules around quoting. Its simplicity is so intuitive that programmers might invent it independently. Open and Collective Standard No single owner or comprehensive standard governs CSV (though RFC 4180 exists, it is controversial and not definitive). CSV is a free, community-driven format with implicit consensus on rules. Plain Text Format Like JSON, YAML, or XML, CSV is human-readable, editable with any text editor, and not binary. Its text nature facilitates simple viewing, editing, and direct data processing. Efficient Streaming CSV files can be read row-by-row without loading the entire file into memory, enabling processing of very large datasets with minimal RAM. This contrasts with columnar formats like Parquet that require more complex buffering or random access. However, columnar formats excel at column-specific operations, fitting data analysis tools like R or pandas. Appendable It’s trivial and efficient to add rows to a CSV by simply appending to the file. In contrast, columnar formats are designed for fast column operations but not row-wise appends. Dynamic Typing Flexibility CSV is dynamically typed, allowing flexible interpretation across programming languages without rigid data typing constraints. This aids data interoperability, though it requires care to avoid data misinterpretation. Succinctness CSV only writes headers once, minimizing redundancy compared to JSON or XML which repeatedly write keys. The format remains concise and imposes minimal overhead. Reversible and Parsable in Reverse Remarkably, a reversed CSV file (byte-wise) remains valid due to its double-quote escaping scheme (a palindrome-based escape). This makes it easy to read the last rows of very large CSV files efficiently to resume aborted processes or analyze tail entries without full file reads. Excel’s Difficulty with CSV is Ironic The article humorously notes that Excel’s imperfect handling of CSV suggests the format’s strength and ubiquity. --- Conclusion CSV is far from dead. Despite criticisms and the rise of more complex data formats optimized for specific use cases, CSV’s universal simplicity, openness, human-readability, streaming efficiency, and flexibility ensure it remains a "seemingly unkillable staple" in data serialization. --- Additional Information Médialab Sciences Po continues to support CSV and related tools, contributing to open-source projects like xsv. The article is part of Médialab’s ongoing engagement with digital tools and data research. --- Médialab Sciences Po Contact & Links Location: 27 rue St Guillaume, Paris VII Email: medialab@sciencespo.fr Social: Mastodon, Bluesky, GitHub, RSS feed available on the Médialab website --- This article offers an insightful, affectionate defense of CSV that highlights overlooked features and practical advantages crucial for many data workflows.