Best Practices
In this section we outline some suggestions for best practices to follow when using earthmover
, based on our experience with the tool. Many of these are based on best practices for using dbt, to which earthmover
is similar, although earthmover
operates on dataframes rather than database tables.
Project structure
A typical earthmover
project might have a structure like this:
project/
├── README.md
├── sources/
│ └── source_file_1.csv
│ └── source_file_2.csv
│ └── source_file_3.csv
├── earthmover.yaml
├── output/
│ └── output_file_1.jsonl
│ └── output_file_2.xml
├── seeds/
│ └── crosswalk_1.csv
│ └── crosswalk_2.csv
├── templates/
│ └── json_template_1.jsont
│ └── json_template_2.jsont
│ └── xml_template_1.xmlt
│ └── xml_template_2.xmlt
- include a
.gitignore
or similar file in your project which excludes thesources/
andoutput/
directories from being committed the repository - remove the
sources/
andoutput/
directories from your project and updateearthmover.yaml
'ssources
anddestinations
to reference another location outside theproject/
directory
When dealing with sensitive source data, you may have to comply with security protocols, such as referencing sensitive data from a network storage location rather than copying it to your own computer. In this situation, option 2 above is a good choice.
To facilitate operationalization, we recommended using relative paths from the location of the earthmover.yaml
file and using parameters to pass dynamic filenames to earthmover
, instead of hard-coding them. For example, rather than
config:
output_dir: /path/to/outputs/
...
sources:
source_1:
file: /path/to/inputs/source_file_1.csv
header_rows: 1
source_2:
file: /path/to/inputs/source_file_2.csv
header_rows: 1
seed_1:
file: /path/to/seeds/seed_1.csv
...
destinations:
output_1:
source: $transformations.transformed_1
...
output_2:
source: $transformations.transformed_2
...
config:
output_dir: ${OUTPUT_DIR}
...
sources:
source_1:
file: ${INPUT_FILE_1}
header_rows: 1
source_2:
file: ${INPUT_FILE_2}
header_rows: 1
seed_1:
file: ./seeds/seed_1.csv
...
destinations:
output_1:
source: $transformations.transformed_1
...
output_2:
source: $transformations.transformed_2
...
earthmover -p '{ "OUTPUT_DIR": "/path/to/outputs/", \
"INPUT_FILE_1": "/path/to/source_file_1.csv", \
"INPUT_FILE_2": "/path/to/source_file_2.csv" }'
earthmover -p '{ "OUTPUT_DIR": "/path/to/outputs/", \
"INPUT_FILE_1": "/path/to/source_file_1.csv" }'
output_1
if source_1
had required: False
(since INPUT_FILE_2
is missing).
Development practices
While YAML is a data format, it is best to treat the earthmover
YAML configuration as code, meaning you should
- version it!
- avoid putting credentials and other sensitive information in the configuration; rather specify such values as parameters
- keep your YAML DRY by using Jinja macros and YAML anchors and aliases
Remember that code is poetry: it should be beautiful! To that end
- Carefully choose concise, good names for your
sources
,transformations
, anddestinations
.- Good names for
sources
could be based on their source file/table (e.g.students
forstudents.csv
) - Good names for
transformations
indicate what they do (e.g.students_with_mailing_addresses
) - Good names for
destinations
could be based on the destination file (e.g.student_mail_merge.xml
)
- Good names for
- Add good, descriptive comments throughout your YAML explaining any assumptions or non-intuitive operations (including complex Jinja).
- Likewise put Jinja comments in your templates, explaining any complex logic and structures.
- Keep YAML concise by consolidating
transformation
operations where possible. Many operations likeadd_columns
,map_values
, and others can operate on multiplecolumns
in a dataframe. - At the same time, avoid doing too much at once in a single
transformation
; splitting multiplejoin
operations into separate transformations can make debugging easier.
Debugging practices
When developing your transformations, it can be helpful to
- specify
config
»log_level: DEBUG
andtransformation
»operation
»debug: True
to verify the columns and shape of your data after eachoperation
- turn on
config
»show_stacktrace: True
to get more detailed error messages - avoid name-sharing for a
source
, atransformation
, and/or adestination
- this is allowed but can make debugging confusing - install pygraphviz and turn on
config
»show_graph: True
, then visually inspect your transformations ingraph.png
for structural errors - use a linter/validator to validate the formatting of your generated data
You can remove these settings once your earthmover
project is ready for operationalization.
Operationalization practices
Typically earthmover
is used when the same (or similar) data transformations must be done repeatedly. (A one-time data transformation task may be more easily done with SQLite or a similar tool.) When deploying/operationalizing earthmover
, whether with a simple scheduler like cron or an orchestration tool like Airflow or Dagster, consider
- specifying conditions you
expect
your sources to meet, soearthmover
will fail on source data errors -
specifying
config
»log_level: INFO
and monitoring logs for phrases likedistinct_rows
operation removed NN duplicate rowsfilter_rows
operation removed NN rows -
using the structured run output flag and shipping the output somewhere it can be queried or drive a monitoring dashboard