Sunday, 30 December 2007

Removing Duplicate lines in a file

Like many people I use Squid to:

- save bandwidth through caching

- controlling access to certain sites eg adware / tracking web sites by applying ACLs

A key 3rd purpose for me is to act as a concentrator / connector to an upstream service, Scansafe. Scansafe has the usual lists of web site categories, but the really useful (and desirable) element is that it does in-stream analysis of the pages the user requests.

This is critical because in the new web 2.0 world using lists simply won't work where anyone can upload content. In addition, lists do not work where a trustworthy site hosts link / content that has been compromised eg India Times.

One of the outcomes of this is that I have 2 sources of bad stuff I need to block: analysis of the squid logs, and analysis of the Scansafe realtime reports.

So what I do is copy them into the same file, and then use the following command to remove the duplicates:

awk '!x[$0]++' file > file.new

I got this command from nir_s on unix.com

No comments:

Post a Comment