[ uh–pach-ee hoo-dee ] n. Apache Hudi is an open-source data management framework used to simplify incremental data processing and data pipeline development. This framework more efficiently manages business requirements like data lifecycle and improves data quality.
Hudi maintains the timeline of all activity performed on the dataset, to provide instantaneous views of the dataset. Hudi organizes datasets into a directory structure under a basepath very similar to Hive tables. Dataset is broken up into partitions, folders contain files for that partition. Each partition uniquely identified by partition path, relative to the basepath. Each partition records distributed into multiple files. Each file has a unique file id and the commit that produced the file. Multiple files share same file id but written at different commits, in case of updates.