Regarding approach 2:
Your second approach is pretty much how NAT works. Every TCP/UDP client on the local network has up to 65535 ports in use (except port 0) and a private IP. The router knows only a single public IP. Since two clients may both have source port 300, it cannot simply just replace the private IP with a public one, that would cause collisions to appear. Thus it replaces the IP and "translates" the port (NAT: Network Address Translation). On return, it translates the port back and replaces the public with a private IP again, before forwarding the package back. You'd be doing nothing else than that. However, routers keep that information in memory - and they are not too slow when doing NAT (companies with hundreds of computers are NATed to the Internet sometimes and the slow down is hardly noticeably in most cases). You say you want up to thousand transactions a second - but how many clients will there be? As this mainly will define the size of memory needed to backup the mappings. If there are not too many clients, you could keep the mapping with a sorted table in memory, in that case, speed will be the smallest problem (table getting to bigger and server running out of memory is the bigger one).
What is a bit unclear to me is that you once say
Fortunately the client only has about 20 or so instances of things with these IDs - let's call them packages - at any given time and it only needs to have them unique amongst local siblings.
but then you say
Some IDs less than 65535 may still be in play on a given client at any time due to non-expiration.
I guess, what you probably meant by the second statement is, that if a client requests ID 65536, it might still have IDs below 65535 and these can be as low as (let's say) 20. It's not that the client processes IDs in a straight order, right? So you cannot say, just because it now requested 65536, it may have some smaller values, but certainly not in the range 1-1000, correct? It might actually keep a reference to 20, 90, 2005 and 41238 and still go over 65535, that's what you meant?
I personally like your second approach more than the third one, as it is easier to avoid a collision in any case and translating the number back is a plain, simple operation. Although I doubt that your third approach can work in the long run. Okay, you might have a byte to store how often you subtracted 2^16 of the number. However, you can only subtract 117 * 2^16 as largest numbers. What will you do if numbers go above that? Using a different algorithm, that does not subtract, but does what? Divide? Shift bits? In that case you lose granularity, that means this algorithm can't hit any possible number any longer (it will make large jumps). If it was so easy to just apply a magic translation function upon 32 bit to make 16 bit from it (+ one extra byte) and then just transform it back, guess every compression method in this world would use it, as it could, no matter what the 32 bit number was, always compress it down to 24 bit (16 bit + one byte). That would be magic. It is not possible to pack 32 bit into 24 bit and also pack all the logic how to transform it back into it as well. You will need some external storage, which brings us back to your 2nd approach. This is the only approach that will work and it will work for every number in 32 bit number range.