Join this group if you use the Hadoop version of DMX.

79 Members
Join Us!

Hi SyncSort team

We are evaluating this use case in SyncSort, please help us our understanding.

We have a table T1_Current in Hive (Database name say current_data_db)

We have another table T1_New in Hive (Database name say newdata_db)

Everyday we get new data and it is stored in newdata_db.T1_New table

Our requirement is to merge the data between newdata_db.T1_new and current_data_db.T1_Current based on a primary key.

in this process the output we get we wanted to have in Hive i.e. output in Hive because of above SyncSort task/job is:

1. AVro file format

2. Partitioned based on a date filed in the dataset (both current and newdata database/tables mentioned above will have a same date column)

3. Bucket the data in T1_Current table based on say primary key and say we need 64 buckets

We are having questions on how to achieve above output through a SyncSort Task/Job.

Here are our assumptions:

1. A SyncSort job can output only one file at a time - we dont know whether it can otuput to many files or not that is why this assumption

2. A SyncSort job can take multiple input files - we are planning to use the DTL language to build this whole task/job. This way we can develop as many INPUT files we can

3. We like SyncSort job to process INPUT in a defined sequence - reason for this is we have Jan1_2016.data, Jan2_2016.data, Jan3_data.data in T1_NEW table and in the table T1_Current we want to ensure that a primary key has only latest and greatest record version i.e. when we consolidate all T1_New Hive table data files i.e. Jan1_2016, Jan2_2016...Feb9_2016 in a single job we like SyncSort to process all the input files in a "use defined sequence".

We feel current we cannot specify sequence numbers for INPUT files in SyncSort  but T1_Current can have only latest and great version of each record. To overcome this challenge we are thinkng to process each file in a sequence at a time and that too in a separate job with each job executing in a sequence.

Please let us know if you have any recommendations for the above.

Thanks,

Kishore Veleti A.V.K.

You need to be a member of Syncsort Community to add comments!

Join Syncsort Community

Email me when people reply –

Replies

  • Hi Kishore,

    DMX tasks do support multiple sources and multiple targets. In the Task Editor, you can add more files by choosing Add Source/Target. In DTL, you can have multiple /INFILE and /OUTFILE options.

    Also note that there are SourceName() and SourceFullName() functions available, which identify the source file that a record came from. You could potentially derive a key from the filename by creating a value that uses these functions to parse the filename.

This reply was deleted.

To access Syncsort Knowledge Base, visit:

My Support