Changelog
This page tracks releases of earthmover, with a summary of what was changed, fixed, added in each new version.
Unreleased changes
- feature: Add
meltandpivotas dataframe operations
2025 releases
v0.4.6
(Released 2025-08-14)
- bugfix:
rename_columnsoperation to an existing column name could result in two columns of the same name; now this results in an error - feature: Add support for non-UTF8 file encodings for fixed-width inputs
v0.4.5
(Released 2025-07-11)
- bugfix: update MANIFEST.in to fix
earthmover init - feature: New year, new docs!
- feature: wildcard matching for columns
- feature: multi-directional sorting
- feature: make
SqlSources hashable (work with state-tracking)
v0.4.4
(Released 2025-03-06)
v0.4.3
(Released 2025-01-23)
- feature: allow a
colspec_fileconfig with column info forfixedwidthinputs - feature: error messages for
keep_columnsanddrop_columnsnow specify the columns
2024 releases
v0.4.2
(Released 2024-11-15)
- feature: interpolate params into destination templates
- feature: lowercase columns
- fix: optional fields recursion
- fix:
earthmover depsfails when not all params are passed - fix: make all pandas/dask config conditional on >3.10
v0.4.1
(Released 2024-11-15)
- feature: allow specifying
colspecsfor fixed-width files - feature: allow
configparams to be passable at the CLI and have aparameter_default - feature: refactor Source columns list logic as a select instead of a rename
- bugfix: []
earthmover depsfailed to find nested local packages](https://github.com/edanalytics/earthmover/pull/134) - bugfix: relative paths not resolved correct when using project composition
- bugfix:
--results-filerequired a directory prefix - bugfix: some functionality was broken for Python versions < 3.10
v0.4.0
(Released 2024-10-16)
- feature: add support for Python 3.12, with corresponding updates to core dataframe dependencies
- feature: add
--setflag for overriding values withinearthmover.ymlfrom the command line
v0.3.8
(Released 2024-09-06)
- bugfix: Jinja in destination
headerfailed if dataframe is empty
v0.3.7
(Released 2024-09-04)
- feature: implementing a limit_rows operation
- feature: add support for a
require_rowsboolean or non-negative int on any node - feature: add support for Jinja in a destination node header and footer
- bugfix: union fails with duplicate columns
v0.3.6
(Released 2024-08-07)
- feature: add
json_array_aggfunction togroup_byoperation - feature: select all columns using "*" in
modify_columnsoperation - internal: set working directory to the location of the
earthmover.yamlfile - documentation: add information on
earthmover initandearthmover cleanto the README - bugfix: fix bug with
earthmover cleanthat could have removed earthmover.yaml files
v0.3.5
(Released 2024-07-12)
- feature: add
earthmover initcommand to initialize a new sample project in the expected bundle structure - internal: expand test run to include the new
debugandflattenoperations, as well as a nested JSON source file - internal: improve customization in write behavior in new file destinations
- bugfix: Fix bug when writing null values in
FileDestination
v0.3.4
(Released 2024-06-26)
- hotfix: Fix bug when writing out JSON in
FileDestination
v0.3.3
(Released 2024-06-18)
- hotfix: Resolve incompatible package dependencies
- hotfix: Fix type casting of nested JSON for destination templates
v0.3.2
(Released 2024-06-14)
- feature: Add
DebugOperationfor logging data head, tail, columns, or metadata midrun - feature: Add
FlattenOperationfor splitting and exploding string columns into values - feature: Add optional 'fill_missing_columns' field to
UnionOperationto fill disjunct columns with nulls, instead of raising an error (defaultFalse) - feature: Add
git_auth_timeoutconfig when entering Git credentials during package composition - feature: Add
earthmover cleancommand that removes local project artifacts - feature: only output compiled template during
earthmover compile - feature: Render full row into JSON lines when
templateis undefined inFileDestination - internal: Move
FileSourcesize-checking andFtpSourceFTP-connecting from compile to execute - internal: Move template-file check from compile to execute in
FileDestination - internal: Allow filepaths to be passed to an optional
FileSource, and check for file before creating empty dataframe - internal: Build an empty dataframe if an empty folder is passed to an optional
FileSource - internal: fix some examples in README
- internal: remove GitPython dependency
- bugfix: fix bug in
FileDestinationwherelinearize: Falseresulted in BOM characters - bugfix: fix bug where nested JSON would be loaded as a stringified Python dictionary
- bugfix: Ensure command list in help menu and log output is always consistent
- bugfix: fix bug in
ModifyColumnsOperationwhere__row_data__was not exposed in Jinja templating
v0.3.1
(Released 2024-04-26)
- internal: allow any ordering of Transformations during graph-building in compile
- internal: only create a
/packagesdir whenearthmover depssucceeds
v0.3.0
(Released 2024-04-17)
- feature: add project composition using
packageskeyword in template file (see README) - feature: add installation extras for optional libraries, and improve error logging to notify which is missing
- feature:
GroupByWithRankOperationcumulatively sums record counts by group-by columns - feature: setting
log_level: DEBUGin template configs or settingdebug: Truefor a node displays the head of the node mid-run - feature: add
optional_fieldskey to all Sources to add optional empty columns when missing from schema - feature: add optional
ignore_errorsandexact_matchboolean flags toDateFormatOperation - internal: force-cast a dataframe to string-type before writing as a Destination
- internal: remove attempted directory-hashing when a source is a directory (i.e., Parquet)
- internal: refactor project to standardize import paths for Node and Operation
- internal: add
Node.full_nameattribute andNode.set_upstream_source()method - internal: unify graph-building into compilation
- internal: refactor compilation and execution code for cleanliness
- internal: unify
Node.compile()into initialization to ease Node development - internal: Remove unused
group_by_with_countandgroup_by_with_aggoperations
v0.2.1
(Released 2024-04-08)
- feature: adding fromjson() function to Jinja
- feature: fix docs typos
- feature:
SortRowsOperationsorts the dataset bycolumns
2023 releases
v0.2.0
(Released 2023-09-11)
- breaking change: remove
sourceas Operation config and move to Transformation; this simplifies templates and reduces memory usage - breaking change:
version: 2required in Earthmover YAML files - feature:
SnakeCaseColumnsOperationconverts all columns to snake_case - feature:
show_progresscan be turned on globally inconfigor locally in any Source, Transformation, or Destination to display a progress bar - feature:
repartitioncan be turned on in any applicableNodeto alter Dask partition-sizes post-execute - feature: improve performance when writing Destination files
- feature: improved Earthmover YAML-parsing and config-retrieval
- internal: rename
YamlEnvironmentJinjaLoadertoJinjaEnvironmentYamlLoaderfor better transparency of use - internal: simplify Earthmover.build_graph()
- internal: unify Jinja rendering into a single util function, instead of redeclaring across project
- internal: unify
Node.verify()intoNode.execute()for improved code legibility - internal: improve attribute declarations across project
- internal: improve type-hinting and doc-strings across project
- bugfix: refactor SqlSource to be compatible with SQLAlchemy 2.x
v0.1.6
(Released 2023-07-11)
- bugfix: fixing a bug to create the results_file directory if needed
- bugfix: process a copy of each nodes data at each step, to avoid modifying original node data which downstreams nodes may rely on
v0.1.5
(Released 2023-06-13)
- bugfix: fixing a bug to skip hashing missing optional source files
- feature: adding a tmp_dir config so we can tell Dask where to store data it spills to disk
- feature: adding a
--results-fileoption to produce structured run metadata - feature: adding a skip exit code
v0.1.4
(Released 2023-05-12)
- bugfix:
config.state_file was being ignored when specified -
bugfix: further issues with multi-line
could render down to which will fail with an error about no sources defined.config.macros- the resolution here (hopefully the last one!) is to pre-load macros (so they can be injected into run-time Jinja contexts) and then just allow the Jinja to render and macro definitions down to nothing in the config YAML... you do have to be careful with Jinja linebreak suppression, i.e. -
bugfix: charset issues when reading / writing non-UTF8 files - this should be resolved by enforcing every file read/write to specify UTF8 encoding
v0.1.3
(Released 2023-05-05)
- feature: implement ability to call
{{ md5(column) }}in Jinja throughout eathmover, with a framework for other Python functions to be added in the future - bugfix: fix multi-line macros issue
v0.1.2
(Released 2023-05-02)
- bugfix: fix continued issues with environment variable expansion under Windows by changing from
os.path.expandvars()to native PythonString.Templateimplementation - bugfix: change how earthmover loads
config.macrosfrom YAML to prevent issues with multi-line macros definitions
v0.1.1
(Released 2023-03-27)
- bugfix: a single quote in the config YAML could prevent environment variable expansion from working since
os.path.expandvars()does not expand variables within single quotes in Python under Windows
v0.1.0
(Released 2023-03-23)
- feature: added parse-time Jinja templating to YAML configuration
Potentially breaking change: if your config YAML contains
add_columnsormodify_columnsoperations with Jinja expressions, these will now be parsed at YAML load time. To preserve the Jinja for runtime parsing, wrap the expressions with{%raw%}...{%endraw%}. See YAML parsing for further information.
- feature: removed dependency on matplotlib, which is only required if your YAML specified
config.show_graph: True... now if you try toshow_graphwithout matplotlib installed, you'll get an error prompting you to install matplotlib
v0.0.7
(Released 2023-02-23)
- feature: added
str_min()andstr_max()functions forgroup byoperation
v0.0.6
(Released 2023-02-17)
- feature: pass
__row_data__dict into Jinja templates for easier dynamic column referencing - bugfix: parameter / env var interpolation into YAML keys, not just values
- refactor error handling key assertion methods
- refactor YAML loader line number context handling
2022 releases
v0.0.5
(Released 2022-12-16)
- trim nodes not connected to a destination from DAG
- ensure all source datatypes return a Dask dataframe
- update optional source functionality to require
columnslist, and pass an empty dataframe through the DAG
v0.0.4
(Released 2022-10-27)
- support running in Google Colab
v0.0.3
(Released 2022-10-27)
- support for Python 3.7
v0.0.2
(Released 2022-09-22)
- initial release