Uniq, comm and join 3 Linux command for the CLI

Apr 292012

These small commands are not so known, but i think they can do miracles for you when you have to work from the terminal on text files and you need to compare them, or do operation on lines inside 1 file or merge 2 files applying some criteria.

In this article I’ll show you the most common options for these commands and some practical examples on how to use them.

Uniq

This command filter adjacent matching lines from INPUT (or standard input), writing to OUTPUT (or standard output).
With no options, matching lines are merged to the first occurrence.

Useful options:

-c, –count
prefix lines by the number of occurrences

-d, –repeated
only print duplicate lines

-D, –all-repeated[=delimit-method]
print all duplicate lines
delimit-method={none(default),prepend,separate} Delimiting is done with blank lines

-f, –skip-fields=N
avoid comparing the first N fields

-i, –ignore-case
ignore differences in case when comparing

-s, –skip-chars=N
avoid comparing the first N characters

-w, –check-chars=N
compare no more than N characters in lines

So let’s see some example with a file like this (file.txt):

aa bb cc
aa bb
aa
cc bb
fe fe fe
aa bb fe
aa bb
cc bb
12 34

Uniq compare only adjacent lines, so as first thing you must use the command sort to order the content of the file and send the output of this command to uniq with a pipe:

root@xubuntu-home:/tmp/article# sort file.txt |uniq
12 34
aa
aa bb
aa bb cc
aa bb fe
cc bb
fe fe fe

In the output uniq has removed 2 lines from the original files, if you want to see the number of occurrences of a line you can use the parameter -c

root@xubuntu-home:/tmp/article# sort file.txt |uniq -c
      1 12 34
      1 aa
      2 aa bb
      1 aa bb cc
      1 aa bb fe
      2 cc bb
      1 fe fe fe

L’opzione -c è veramente utile e può essere usato per contare un certo numero di cose come:

– Processi per utente: ps hax -o user | sort | uniq -c

       2 avahi
      1 colord
      1 daemon
     60 linuxaria
      1 messagebus
      2 postfix
     82 root
      1 rtkit
      1 syslog

– Numero e tipo di connessioni attive: netstat -ant | awk '{print $NF}' | grep -v '[a-z]' | sort | uniq -c

49 ESTABLISHED
  6 LISTEN
 25 TIME_WAIT

But now back on our file, if we want to compare only the first 2 character we can use sort file.txt |uniq -w 2 and the output will be:

12 34
aa
cc bb
fe fe fe

Comm

The comm command is a utility that is used to compare two files for common and distinct lines.
comm reads two files as input, regarded as lines of text. comm outputs one file, which contains three columns. The first two columns contain lines unique to the first and second file, respectively. The last column contains lines common to both.

Like uniq, comm expects that the lines are sorted, so also with this command we’ll use the command sort.

From COMM(1) man page, the options available are:

-1 suppress lines unique to FILE1
-2 suppress lines unique to FILE2
-3 suppress lines that appear in both files

So if we have as Input files:

# cat file.txt
aa bb cc
aa bb
aa
cc bb
fe fe fe
aa bb fe
aa bb
cc bb
 
# cat file2.txt
 
aa bb cc
aa bb dd
aa 22
cc bb 33
fe fe fe fe
aa bb fe
aa bb
cc bb 11

To find only those lines which are common to both the files
First we sort both files in a temporary file:

#sort file.txt > file.txt.sorted
#sort file2.txt > file2.txt.sorted

Now we can compare them:

#comm file.txt.sorted file2.txt.sorted
aa bb
aa bb cc
aa bb fe

With process substitution we can do all this with one line and get the same result:

# comm -12 < (sort file.txt) <(sort file2.txt)

Note without the -12 option we would get an output like this one:

# comm < (sort file.txt) <(sort file2.txt)
12 34
aa
	aa 22
		aa bb
aa bb
		aa bb cc
	aa bb dd
		aa bb fe
cc bb
cc bb
	cc bb 11
	cc bb 33
fe fe fe
	fe fe fe fe

Join

The join command takes as input two text files and a number of options. If no command-line argument is given, this command looks for a pair of lines from the two files having the same first field (a sequence of characters that are different from space), and outputs a line composed of the first field followed by the rest of the two lines.

The program arguments specify which character to be used in place of space to separate the fields of the line, which field to use when looking for matching lines, and whether to output lines that do not match. The output can be stored to another file rather than printing using redirection.

A good example would be getting an username and its default login shell listed in /etc/passwd and group name from /etc/group.

The numeric value of “group” is the 4th field in /etc/passwd and the 3rd field in /etc/group file, so we’ll use these 2 fileds to join the information:

$ join -t ":" -1 4 -2 3 -o 1.1 2.1 1.7 /etc/passwd /etc/group
root:root:/bin/bash
daemon:daemon:/bin/sh
bin:bin:/bin/sh
lp:lp:/bin/sh
mail:mail:/bin/sh
news:news:/bin/sh
proxy:proxy:/bin/sh

Explanation of the command:

-t ":" Use the character : as field separator
-1 4 -2 3 join on field 4 on file 1 and on field 3 of file 2
-o 1.1 2.1 1.7 show as output the field 1 of file 1, the field 1 of file 2 and the field 7 of file 1

6 Responses to “Uniq, comm and join 3 Linux command for the CLI”

@Ste says:

Monday April 30th, 2012 at 01:45 PM

ciao, ti segnalo che la riga
#sort file.txt.sorted file2.txt.sorted

dovrebbe essere:
#comm file.txt.sorted file2.txt.sorted

Reply
- linuxari says:
  
  Monday April 30th, 2012 at 05:42 PM
  
  Grazie mille,
  
  Errore nel trascrivere i comandi 🙂
  
  Reply
scribe6324 says:

Monday April 30th, 2012 at 07:32 PM

With #sort file.txt.sorted file2.txt.sorted

Am i suppose to get the below result with the above command
aa bb
aa bb cc
aa bb fe

If so this is what i am getting
12 34
aa
aa 22
aa bb
aa bb
aa bb
aa bb cc
aa bb cc
aa bb dd
aa bb fe
aa bb fe
cc bb
cc bb
cc bb 11
cc bb 33
fe fe fe
fe fe fe fe

I had to do the below for it to work:
comm -12 < /usr/bin/sort "file1.sorted" < /usr/bin/sort "file2.sorted"
aa bb
aa bb cc
aa bb fe

I am getting this when i run the join command
Running join command
root:root:/bin/bash
daemon:daemon:/bin/sh
bin:bin:/bin/sh
sys:sys:/bin/sh
join: file 1 is not in sorted order
lp:lp:/bin/sh
mail:mail:/bin/sh
join: file 2 is not in sorted order
news:news:/bin/sh
uucp:uucp:/bin/sh
proxy:proxy:/bin/sh
www-data:www-data:/bin/sh
backup:backup:/bin/sh
list:list:/bin/sh
irc:irc:/bin/sh
gnats:gnats:/bin/sh
nobody:nogroup:/bin/sh
libuuid:libuuid:/bin/sh
syslog:syslog:/bin/false
messagebus:messagebus:/bin/false
haldaemon:haldaemon:/bin/false

Reply
- linuxari says:
  
  Monday April 30th, 2012 at 08:41 PM
  
  There was a type in the article is not: #sort file.txt.sorted file2.txt.sorted but the correct command is: #comm -12 file.txt.sorted file2.txt.sorted
  
  For the join command i did not got that message on my ubuntu 12.04, to suppress that output you can use: –nocheck-order, or you can sort first both /etc/passwd and /etc/groups.
  
  Best regards
  
  Reply
  - scribe6324 says:
    
    Monday April 30th, 2012 at 10:14 PM
    
    Ok kool, good to know it was a typo, it is clearer now.
    
    First time coming across the comm cammand.
    
    This is what is happening with the join command on Ubuntu 10.04
    
    I tried this:
    
    sort /etc/passwd > /tmp/passwd.sorted
    sort /etc/group > /tmp/group.sorted
    
    join -t “:” -1 4 -2 3 -o 1.1 2.1 1.7 /tmp/passwd.sorted /tmp/group.sorted
    
    But got the below results:
    
    join: file 1 is not in sorted order
    join: file 2 is not in sorted order
    games:games:/bin/sh
    gdm:gdm:/bin/false
    gnats:gnats:/bin/sh
    haldaemon:haldaemon:/bin/false
    hplip:lp:/bin/false
    mail:mail:/bin/sh
    man:man:/bin/sh
    messagebus:messagebus:/bin/false
    news:news:/bin/sh
    nobody:nogroup:/bin/sh
    
    But still had to do the below, like you suggested, to not get the above result:
    
    join –nocheck-order -t “:” -1 4 -2 3 -o 1.1 2.1 1.7 /tmp/passwd.sorted /tmp/group.sorted
    
    games:games:/bin/sh
    gdm:gdm:/bin/false
    gnats:gnats:/bin/sh
    haldaemon:haldaemon:/bin/false
    hplip:lp:/bin/false
    mail:mail:/bin/sh
    man:man:/bin/sh
    messagebus:messagebus:/bin/false
    news:news:/bin/sh
    nobody:nogroup:/bin/sh
    
    rm /tmp/passwd.sorted
    rm /tmp/group.sorted
    
    Thanks for this good article by the way, i sure am doing some learning.
    
    Reply
    - Geoff says:
      
      Wednesday May 2nd, 2012 at 08:47 AM
      
      The input files need to be sorted using the same keys that will be used for the join:
      
      sort -t : -k 4,4 /etc/passwd > /tmp/passwd.sorted
      sort -t : -k 3,3 /etc/group > /tmp/group.sorted
      join -t : -1 4 -2 3 -o 1.1 2.1 1.7 /tmp/passwd.sorted /tmp/group.sorted
      
      You will likely get a lot more output from this than you did with wrongly sorted input files. The point of the “not in sorted order” messages is to warn you that you are probably not getting the output you wanted. It’s a bad idea to suppress the messages unless you really know what you are doing and why.
      
      Reply

Linuxaria

Uniq, comm and join 3 Linux command for the CLI

Uniq

Comm

Join

Popular Posts:

6 Responses to “Uniq, comm and join 3 Linux command for the CLI”

Leave a Reply Cancel reply