Join this group if you use the Hadoop version of DMX.
Assume say on Linux file system I have 10 files generated by CDC popular tools and these 10 files belongs to the same table so same fields structure.
Now I want to develop a job, apply dedup between the files and generate an output such that final output contains latest and greatest record for every primary key between those files.
Please note that I do not want to do the dedup between yesterday's data and today's data. Instead I wanted to do the dedup between the set of files that are generated for today.
Reason for above is in Hadoop Data Lake generally we store in raw data zone the daily data with business date as a Hive partition. This way we always have latest and greatest data for changed/incremental data for that business date.
My understanding is SyncSort JOIN jobs are always Left Side, Right Side, Output kind of behavior.
In my case I just have one side and this one side is just a HDFS folder for which job should do dedup and final output should be stored as results with business date (derived from the file name or a field within the row).
How can we achieve this?
Kishore Veleti A.V.K.