There are several ways to remove duplicates.
- If the source is DBMS, you can use the property in Source Qualifier to select the distinct records.Or you can also use the SQL Override to perform the same.
- You can use, Aggregator and select all the ports as key to get the distinct values. After you pass all the required ports to the Aggregator, select all those ports , those you need to select for de-duplication. If you want to find the duplicates based on the entire columns, select all the ports as group by key.
The Mapping will look like this.
- You can use Sorter and use the Sort Distinct Property to get the distinct values. Configure the sorter in the following way to enable this.
- You can use, Expression and Filter transformation, to identify and remove duplicate if your data is sorted. If your data is not sorted, then, you may first use a sorter to sort the data and then apply this logic:
- Bring the source into the Mapping designer.
- Let’s assume the data is not sorted. We are using a sorter to sort the data. The Key for sorting would be Employee_ID.
Configure the Sorter as mentioned below.
- Use one expression transformation to flag the duplicates. We will use the variable ports to identify the duplicate entries, based on Employee_ID.
- Use a filter transformation, only to pass IS_DUP = 0. As from the previous expression transformation, we will have IS_DUP =0 attached to only records, which are unique. If IS_DUP > 0, that means, those are duplicate entries.
- Add the ports to the target. The entire mapping should look like this.
v. When you change the property of the Lookup transformation to use the Dynamic Cache, a new port is added to the transformation. NewLookupRow.
The Dynamic Cache can update the cache, as and when it is reading the data.
If the source has duplicate records, you can also use Dynamic Lookup cache and then router to select only the distinct one.