Best Practices

In this section we outline some suggestions for best practices to follow when using earthmover, based on our experience with the tool. Many of these are based on best practices for using dbt, to which earthmover is similar, although earthmover operates on dataframes rather than database tables.

Project structure

A typical earthmover project might have a structure like this:

project/
├── README.md
├── sources/
│   └── source_file_1.csv
│   └── source_file_2.csv
│   └── source_file_3.csv
├── earthmover.yaml
├── output/
│   └── output_file_1.jsonl
│   └── output_file_2.xml
├── seeds/
│   └── crosswalk_1.csv
│   └── crosswalk_2.csv
├── templates/
│   └── json_template_1.jsont
│   └── json_template_2.jsont
│   └── xml_template_1.xmlt
│   └── xml_template_2.xmlt

The mappings, transformations, and structure of your data – which are probably not sensitive – should generally be separated from the actual input and output – which may be large and/or sensitive, and therefore should not be committed to a version control system. This can be accomplished in two ways:

include a .gitignore or similar file in your project which excludes the sources/ and output/ directories from being committed the repository
remove the sources/ and output/ directories from your project and update earthmover.yaml's sources and destinations to reference another location outside the project/ directory

When dealing with sensitive source data, you may have to comply with security protocols, such as referencing sensitive data from a network storage location rather than copying it to your own computer. In this situation, option 2 above is a good choice.

To facilitate operationalization, we recommended using relative paths from the location of the earthmover.yaml file and using parameters to pass dynamic filenames to earthmover, instead of hard-coding them. For example, rather than

config:
  output_dir: /path/to/outputs/
...
sources:
  source_1:
    file: /path/to/inputs/source_file_1.csv
    header_rows: 1
  source_2:
    file: /path/to/inputs/source_file_2.csv
    header_rows: 1
  seed_1:
    file: /path/to/seeds/seed_1.csv
...
destinations:
  output_1:
    source: $transformations.transformed_1
    ...
  output_2:
    source: $transformations.transformed_2
    ...

instead consider using

config:
  output_dir: ${OUTPUT_DIR}
...
sources:
  source_1:
    file: ${INPUT_FILE_1}
    header_rows: 1
  source_2:
    file: ${INPUT_FILE_2}
    header_rows: 1
  seed_1:
    file: ./seeds/seed_1.csv
...
destinations:
  output_1:
    source: $transformations.transformed_1
    ...
  output_2:
    source: $transformations.transformed_2
    ...

and then run with

earthmover -p '{ "OUTPUT_DIR": "/path/to/outputs/", \
"INPUT_FILE_1": "/path/to/source_file_1.csv", \
"INPUT_FILE_2": "/path/to/source_file_2.csv" }'

Note that with this pattern you can also use optional sources to only create one of the outputs if needed, for example

earthmover -p '{ "OUTPUT_DIR": "/path/to/outputs/", \
"INPUT_FILE_1": "/path/to/source_file_1.csv" }'

would only create output_1 if source_1 had required: False (since INPUT_FILE_2 is missing).

Development practices

While YAML is a data format, it is best to treat the earthmover YAML configuration as code, meaning you should

version it!
avoid putting credentials and other sensitive information in the configuration; rather specify such values as parameters
keep your YAML DRY by using Jinja macros and YAML anchors and aliases

Remember that code is poetry: it should be beautiful! To that end

Carefully choose concise, good names for your sources, transformations, and destinations.
- Good names for sources could be based on their source file/table (e.g. students for students.csv)
- Good names for transformations indicate what they do (e.g. students_with_mailing_addresses)
- Good names for destinations could be based on the destination file (e.g. student_mail_merge.xml)
Add good, descriptive comments throughout your YAML explaining any assumptions or non-intuitive operations (including complex Jinja).
Likewise put Jinja comments in your templates, explaining any complex logic and structures.
Keep YAML concise by consolidating transformation operations where possible. Many operations like add_columns, map_values, and others can operate on multiple columns in a dataframe.
At the same time, avoid doing too much at once in a single transformation; splitting multiple join operations into separate transformations can make debugging easier.

Debugging practices

When developing your transformations, it can be helpful to

specify config » log_level: DEBUG and transformation » operation » debug: True to verify the columns and shape of your data after each operation
turn on config » show_stacktrace: True to get more detailed error messages
avoid name-sharing for a source, a transformation, and/or a destination - this is allowed but can make debugging confusing
install pygraphviz and turn on config » show_graph: True, then visually inspect your transformations in graph.png for structural errors
use a linter/validator to validate the formatting of your generated data

You can remove these settings once your earthmover project is ready for operationalization.

Operationalization practices

Typically earthmover is used when the same (or similar) data transformations must be done repeatedly. (A one-time data transformation task may be more easily done with SQLite or a similar tool.) When deploying/operationalizing earthmover, whether with a simple scheduler like cron or an orchestration tool like Airflow or Dagster, consider

specifying conditions you expect your sources to meet, so earthmover will fail on source data errors
specifying config » log_level: INFO and monitoring logs for phrases like

distinct_rows operation removed NN duplicate rows

filter_rows operation removed NN rows
using the structured run output flag and shipping the output somewhere it can be queried or drive a monitoring dashboard