any journels on application layer communication ?

knowlesy · 23 Mar 2010 at 17:09

Im trying to build a packet sniffing program for my dissertation, which will be able to identify what applications are used on a network or a specific machine (via filtering) by tcp and udp packets and analaysing the payload inside of them...

however i need references and previous research into how applications send information across the application layer in tcp or udp packets ? so i can pick up on their "signature" and record those specific packets and implement the types of porgrams in use

or if anyone can identify what specific information in a tcp packet to look for via hex ? and location of data in the datagram ?

sorry mods if its in the wrong section but i thought since im developing it would be best suited here....

NickK · 23 Mar 2010 at 17:29

Well considering the number applications using the payload is infinite and each packet could be from any part of the communication it's a tall order.

If it's acting completely transparent then you would also have the issue that not all traffic may be routed past your application due to load balancing etc.

I would first work on reassembling the session and then work on payload.

Usually sniffers understand specific messaging. Ethereal would be an example of what you're doing:

A: There are currently 759 supported protocols and media, listed below. Descriptions can be found in the ethereal(1) man page.

knowlesy · 23 Mar 2010 at 17:43

NickK said:
Well considering the number applications using the payload is infinite and each packet could be from any part of the communication it's a tall order.

If it's acting completely transparent then you would also have the issue that not all traffic may be routed past your application due to load balancing etc.

I would first work on reassembling the session and then work on payload.

Usually sniffers understand specific messaging. Ethereal would be an example of what you're doing:

agreed im looking to more specific everday applications (in which an average user would use eg p2p, email, browser, im, voip) that a user would enable eg. p2p software such as utorrent or outlook each have there own specific application protocol.

But still how would i identify the application eg like in wireshark the use of the user agent, in http packets to identify the browser ? again i know this as Ive discovered it myself in wireshark.... however i need the references to back this up.

also regardless of transparency I have insinuated the uses of hubs or (iirc its called) port duplicating, in which to filter as well as the use of promisquous mode

NickK · 23 Mar 2010 at 18:11

Ahh WireShark rings a bell too.. the fly in the ointment is SSL as it's decrypted above the IP level. If you're packet sniffing then IPsec may also get in the way..

Well you will need to locate the protocol specifications for each protocol you want to eavesdrop on. Then you need to create a 'fingerprint' scanner for that protocol so when each packet goes through you can piece together the payloads into the protocol messages.

The Tekelec PASM system would match bytes at specific addresses against it's library of protocol message templates to narrow down the messages types. You can then use heuristics to narrow down the type even further according to the templates.

So, as I said before you need to work out:
a) converting packets into a stream
b) analysing the stream efficiently to quick match templates using known byte locations etc.
c) then perhaps attempt to decode the stream by offering it to each of the template decoders until you find the best match at a detailed level.

If maintain a state you can bundle all that information into the 'session' and slowly fill out the blanks such as the HTTP fields or XML in use. Once you have a match on the templates then it's also possible to understand - if it's a TCP/IP session then you know that it's likely to be a TCP/IP based protocol with a session which means for that TCP/IP connection you can reduce the scanner complexity dynamically to make it faster (uses only one protocol as you've identified it for example).

There's no single repository for specifications, each vendor implements standards differently (including bespoke optional enhancements) and finally the specifications differ between versions of the protocol (and some protocols don't identify which version!). Lastly systems can spoof..

On the matter of capacity. If you're rolling this out into realtime large scale IP systems with high traffic loads you will need front and back system. The problem is simple - you need all the information for a 'stream' to be close as possible to your processing component. This means that you need to direct IP traffic towards the same node or at least the same cluster. The front's sole job is to route the traffic so that like IP traffic is consolidated so when the back node analyses it there's zero (or very little) inter-node swapping of packets to analyse.
Probably a bit overkill for your project... but not when you company deals in multi-Gigabit IP bandwidth..

knowlesy · 23 Mar 2010 at 19:08

thanks NickK youve given me a better prespective of how to go about it