Guide
Data
Developer Reference

Data Formatting & Processing Basics: CSV, JSON, XML, Excel Explained

CSV, JSON, XML, and Excel each serve a distinct role in data engineering. CSV is the simplest tabular exchange format; JSON is the standard for web APIs; XML handles document-centric and namespace-rich data; Excel is the de facto spreadsheet for business workflows. Knowing which to use — and how to validate, convert, and transform each — is a core practical skill.

Part of the Data Engineering & Processing collection.
TL;DR
  • CSV — flat tabular rows separated by commas or tabs. Best for bulk data export, spreadsheet imports, and ETL pipelines. Fragile with commas in values unless quoted correctly.
  • JSON — hierarchical key-value format that maps directly to objects in most languages. The default choice for REST APIs, configs, and structured data exchange.
  • XML — document-oriented markup with namespaces, attributes, and schema support (XSD). Still common in SOAP APIs, enterprise integrations, and document formats like SVG and OOXML.
  • Excel (.xlsx) — rich spreadsheet format with multiple sheets, formulas, and formatting. The dominant format for business data delivery; requires a parser library to process programmatically.
  • Validation ≠ formatting ≠ conversion. Validation checks that data conforms to a format. Formatting changes how it looks. Conversion changes the format entirely.
  • Most data issues in real pipelines are not algorithmic — they are format wrangling: mismatched delimiters, trailing commas, encoding mismatches, null representations, and nested-to-flat transformations.

What "Data Formatting" and "Data Processing" Mean

The terms are often used interchangeably, but they describe different operations:

Formatting

Changing the visual presentation of data without altering its meaning or structure. Pretty-printing minified JSON, adding consistent indentation, normalising whitespace.

Example: {"a":1} → pretty-printed JSON with 2-space indent

Validation

Checking that data conforms to a specified format or schema. Well-formedness checks (is this valid XML?) and schema checks (does this CSV have the right columns?) are both forms of validation.

Example: verifying all rows in a CSV have the same number of columns and correct types

Conversion

Transforming data from one format to another — JSON to CSV, XML to JSON, Excel to CSV. Conversion implies a structural change, not just cosmetic.

Example: flattening nested JSON into rows for a CSV spreadsheet

Data processing is a broader term covering any operation on data: filtering, sorting, aggregating, cleaning, transforming, or loading it into another system. In practice, most preprocessing work before a pipeline run consists of the three operations above: validate the input, format it consistently, convert it to the target format.

Common Data Formats: An Overview

CSV — Comma-Separated Values

CSV is a plain-text format for tabular data. Each line represents a row; fields within a row are separated by a delimiter (usually a comma, sometimes a tab or pipe). The first row is typically a header row with column names.

CSV has no native types — everything is a string. Parsers infer types, which means a column of 01234 may lose its leading zero when read as an integer. Fields that contain the delimiter must be quoted; fields that contain quotes must escape them — and inconsistent quoting is the #1 source of CSV parse failures.

Example

id,name,city,amount
1,Alice,New York,1200.50
2,Bob,"Portland, OR",980.00

Where it's used: database exports, BI tool exports, spreadsheet exchange, ETL pipelines, ML dataset distribution, bank statements.

Watch out for: commas in values (quoting required), different line endings (CRLF vs LF), BOM characters from Excel, inconsistent null representations (empty string,NULL, N/A, \N).

JSON — JavaScript Object Notation

JSON is a lightweight, human-readable format for structured data. It supports objects (key-value maps), arrays, strings, numbers, booleans, and null — a richer type system than CSV. It maps directly to native data structures in JavaScript, Python, Go, and most other languages.

Example

{
  "id": 1,
  "name": "Alice",
  "address": { "city": "New York", "zip": "10001" },
  "tags": ["vip", "active"]
}

Where it's used: REST API request/response bodies, configuration files (package.json, tsconfig.json), NoSQL document stores (MongoDB, Firestore), log shipping (JSON Lines).

Watch out for: trailing commas (not valid in JSON, valid in JavaScript), single-quoted strings (invalid — must use double quotes), comments (not supported in JSON), large numbers losing precision (JavaScript's 64-bit float limit).

XML — Extensible Markup Language

XML is a document markup language designed for structured, mixed-content data. Unlike JSON, XML supports attributes on elements, namespaces for disambiguation, comments, and processing instructions. It has a rich validation ecosystem (XSD schemas, DTDs, Schematron) and query languages (XPath, XQuery, XSLT).

Example

<?xml version="1.0" encoding="UTF-8"?>
<user id="1">
  <name>Alice</name>
  <address>
    <city>New York</city>
    <zip>10001</zip>
  </address>
</user>

Where it's used: SOAP web services, RSS/Atom feeds, SVG, OOXML (.xlsx, .docx internally), Android layouts, Maven/Gradle build configs, XBRL financial data, enterprise middleware (BPEL, WSDL).

Watch out for: unclosed tags, case-sensitive element names, character encoding declarations, special characters that must be escaped (&&amp;).

For a deep dive on JSON vs XML trade-offs, see the JSON vs XML comparison.

Excel (.xlsx / .xls)

Excel workbooks are binary or ZIP-based archives containing one or more sheets with typed cells, formulas, formatting, charts, and pivot tables. The modern .xlsx format is actually a ZIP archive of XML files (OOXML). Reading an Excel file programmatically requires a parser library (SheetJS in the browser, openpyxl in Python, EPPlus in .NET).

Where it's used: financial reports, analyst data delivery, configuration tables, client-supplied data, BI tool exports, HR systems.

Watch out for: dates stored as serial numbers (Excel epoch starts Jan 1, 1900), leading zeros dropped on numeric-looking strings, formula results vs raw values, hidden columns or sheets, merged cells that break row/column logic.

Validation, Formatting, and Conversion: Clear Distinctions

OperationWhat it doesChanges structure?Changes format?
ValidationChecks conformance to a format or schema. Produces pass/fail + errors.NoNo
Formatting / Pretty-printingChanges presentation (indentation, whitespace, sorting). Semantically equivalent output.NoNo
ConversionTransforms data from one format to another (JSON → CSV). May lose fidelity.SometimesYes
TransformationRestructures data within the same format (flatten nested JSON, filter rows). Same format, different shape.YesNo
CleaningFixes data quality issues: removes duplicates, normalises nulls, trims whitespace, standardises casing.SometimesNo

In practice, a single pipeline stage often chains these: validate the input CSV, clean null values, convert to JSON, then flatten to a tabular structure. Each step is a distinct operation with distinct failure modes.

Typical Data Workflows

SaaS Export → CSV Cleanup → Import

The most common data pipeline for non-engineering teams. A CRM, analytics platform, or business intelligence tool exports data to CSV. Before the data is usable elsewhere:

  1. Validate the CSV structure — check delimiter, encoding, expected columns.
  2. Clean nulls — decide whether to drop rows with missing values or fill them.
  3. Deduplicate on a key column (e.g., customer ID).
  4. Rename or extract columns to match the target schema.
  5. Convert to the target format (JSON for an API, SQL INSERT for a database seed).

Tools: CSV Format ValidatorCSV Null Value HandlerCSV DeduplicatorCSV Column RenamerCSV to JSON

JSON API Response → Format → Diff → Flatten → CSV

Common when debugging API integrations, reviewing webhook payloads, or feeding an API response into a spreadsheet or columnar data store:

  1. Format the raw JSON for readability (especially if it arrived minified).
  2. Diff two versions of the payload to identify changes between environments or releases.
  3. Flatten nested objects into dot-notation keys so the structure is tabular.
  4. Export to CSV for spreadsheet analysis or bulk import.

Tools: JSON FormatterJSON DiffJSON FlattenerJSON to CSV

XML Feed → Validate → Convert to JSON/CSV

Typical for RSS/Atom feed processing, SOAP response handling, or legacy enterprise integration:

  1. Validate well-formedness — catch parse errors before processing (missing closing tags, illegal characters).
  2. Use XPath to extract the specific nodes or attributes you need.
  3. Convert to JSON for modern processing, or to CSV for bulk tabular analysis.

Tools: XML ValidatorXPath ValidatorXML to JSON

Choosing the Right Format

Use CSV when…

• Data is tabular with uniform rows and a fixed schema

• The consumer is a spreadsheet, database import, or ML pipeline

• File size matters and nesting is not required

• Human editability in Excel or Google Sheets is needed

Use JSON when…

• Building a REST or GraphQL API

• Data has nested structure (objects within objects)

• Config files need to be human-readable and machine-parseable

• Working in JavaScript/TypeScript (native serialisation)

Use XML when…

• Integrating with a legacy SOAP service or enterprise bus

• Data is document-centric (mixed text and markup)

• You need namespace support or XSD schema validation

• Consuming/publishing RSS or Atom feeds

Use Excel when…

• Handing off data to non-technical stakeholders

• Multiple related tables need to live in one file (sheets)

• Formulas, charts, or pivot tables are part of the deliverable

• The receiving system only accepts .xlsx (common in enterprise software)

Common Mistakes

Not validating before converting

Converting an invalid file produces garbage output silently. Always validate first so you get a meaningful error rather than a corrupt result.

Assuming CSV is "just commas"

CSV has quoting rules, encoding requirements, and delimiter variants. A CSV with addresses or prices that contain commas will break any naive parser that ignores quoting.

Using JSON comments

JSON does not support comments. Adding // or /* */ makes the file invalid JSON. Use JSONC or a different config format (YAML, TOML) if you need inline documentation.

Treating JSON numbers as arbitrary precision

JavaScript represents all numbers as 64-bit floats. Integers larger than 2^53 lose precision when parsed with JSON.parse(). Use strings or a BigInt-aware library for large IDs.

Ignoring character encoding

CSV files from Windows software often use Windows-1252 encoding. If you process them as UTF-8, accented characters become garbage. Check encoding before parsing.

Converting XML to JSON naively

XML has attributes and mixed content (text + child elements) that have no direct JSON equivalent. A naive converter may silently lose attributes or produce inconsistent structures for arrays with one vs multiple elements.

Try the Tools

All tools run entirely in your browser. No data is uploaded. Browse all data tools →

Frequently Asked Questions

Related Reading