Skip to content

from_jsonl_to_parquet.py

Purpose: Convert JSONL event files to Parquet format

Use case: Preparing data for analysis with pandas, polars, or other data science tools

Features:

  • Merges multiple JSONL files into a single Parquet output
  • Adds UNIX timestamp columns in seconds (*_s) alongside microseconds (*_us)
  • Adds bme280_valid column flagging rows within BME280 hardware spec
  • Configurable compression (snappy, gzip, zstd, none)
  • Preserves all dynamic fields from event data
  • Output filename is fixed to run.parquet in the input directory

Required input: JSONL event files produced by get_events.py or get_runs.py

Usage:

# Default pattern (events*.jsonl)
uv run examples/from_jsonl_to_parquet.py 20251221_run126/

# Custom pattern
uv run examples/from_jsonl_to_parquet.py 20251221_run126/ --pattern "*.jsonl"

# Higher compression
uv run examples/from_jsonl_to_parquet.py 20251221_run126/ --compression gzip

# With verbose output
uv run examples/from_jsonl_to_parquet.py 20251221_run126/ --verbose

# Overwrite existing output
uv run examples/from_jsonl_to_parquet.py 20251221_run126/ --overwrite

CLI Options:

Option Default Description
READ_FROM (required) Directory containing JSONL files
--pattern events*.jsonl Glob pattern to filter input files
--compression snappy Compression codec (snappy, gzip, zstd, none)
--add-timestamps / --no-add-timestamps on Add *_s columns from *_us fields
--overwrite off Overwrite existing output file
--verbose / --quiet --quiet Show or suppress status messages
--log-level error Log level (debug/info/error)

Output file: READ_FROM/run.parquet

  • All original fields preserved (31+ columns)
  • Additional *_s timestamp columns (received_s, sent_s, detected_s, gnss_time_s)
  • Additional bme280_valid column (bool)