Join this group if you use the Hadoop version of DMX.
Hi SyncSort team
We are evaluating this use case in SyncSort, please help us our understanding.
We have a table T1_Current in Hive (Database name say current_data_db)
We have another table T1_New in Hive (Database name say newdata_db)
Everyday we get new data and it is stored in newdata_db.T1_New table
Our requirement is to merge the data between newdata_db.T1_new and current_data_db.T1_Current based on a primary key.
in this process the output we get we wanted to have in Hive i.e. output in Hive because of above SyncSort task/job is:
1. AVro file format
2. Partitioned based on a date filed in the dataset (both current and newdata database/tables mentioned above will have a same date column)
3. Bucket the data in T1_Current table based on say primary key and say we need 64 buckets
We are having questions on how to achieve above output through a SyncSort Task/Job.
Here are our assumptions:
1. A SyncSort job can output only one file at a time - we dont know whether it can otuput to many files or not that is why this assumption
2. A SyncSort job can take multiple input files - we are planning to use the DTL language to build this whole task/job. This way we can develop as many INPUT files we can
3. We like SyncSort job to process INPUT in a defined sequence - reason for this is we have Jan1_2016.data, Jan2_2016.data, Jan3_data.data in T1_NEW table and in the table T1_Current we want to ensure that a primary key has only latest and greatest record version i.e. when we consolidate all T1_New Hive table data files i.e. Jan1_2016, Jan2_2016...Feb9_2016 in a single job we like SyncSort to process all the input files in a "use defined sequence".
We feel current we cannot specify sequence numbers for INPUT files in SyncSort but T1_Current can have only latest and great version of each record. To overcome this challenge we are thinkng to process each file in a sequence at a time and that too in a separate job with each job executing in a sequence.
Please let us know if you have any recommendations for the above.
Kishore Veleti A.V.K.