Apr 292012
 

These small commands are not so known, but i think they can do miracles for you when you have to work from the terminal on text files and you need to compare them, or do operation on lines inside 1 file or merge 2 files applying some criteria.

In this article I’ll show you the most common options for these commands and some practical examples on how to use them.

Uniq



This command filter adjacent matching lines from INPUT (or standard input), writing to OUTPUT (or standard output).
With no options, matching lines are merged to the first occurrence.

Useful options:

-c, –count
prefix lines by the number of occurrences

-d, –repeated
only print duplicate lines

-D, –all-repeated[=delimit-method]
print all duplicate lines
delimit-method={none(default),prepend,separate} Delimiting is done with blank lines

-f, –skip-fields=N
avoid comparing the first N fields

-i, –ignore-case
ignore differences in case when comparing

-s, –skip-chars=N
avoid comparing the first N characters

-w, –check-chars=N
compare no more than N characters in lines

So let’s see some example with a file like this (file.txt):

aa bb cc
aa bb
aa
cc bb
fe fe fe
aa bb fe
aa bb
cc bb
12 34

Uniq compare only adjacent lines, so as first thing you must use the command sort to order the content of the file and send the output of this command to uniq with a pipe:

root@xubuntu-home:/tmp/article# sort file.txt |uniq
12 34
aa
aa bb
aa bb cc
aa bb fe
cc bb
fe fe fe

In the output uniq has removed 2 lines from the original files, if you want to see the number of occurrences of a line you can use the parameter -c

root@xubuntu-home:/tmp/article# sort file.txt |uniq -c
      1 12 34
      1 aa
      2 aa bb
      1 aa bb cc
      1 aa bb fe
      2 cc bb
      1 fe fe fe

L’opzione -c è veramente utile e può essere usato per contare un certo numero di cose come:

- Processi per utente: ps hax -o user | sort | uniq -c

       2 avahi
      1 colord
      1 daemon
     60 linuxaria
      1 messagebus
      2 postfix
     82 root
      1 rtkit
      1 syslog

- Numero e tipo di connessioni attive: netstat -ant | awk '{print $NF}' | grep -v '[a-z]' | sort | uniq -c

49 ESTABLISHED
  6 LISTEN
 25 TIME_WAIT

But now back on our file, if we want to compare only the first 2 character we can use sort file.txt |uniq -w 2 and the output will be:

12 34
aa
cc bb
fe fe fe

Comm

The comm command is a utility that is used to compare two files for common and distinct lines.
comm reads two files as input, regarded as lines of text. comm outputs one file, which contains three columns. The first two columns contain lines unique to the first and second file, respectively. The last column contains lines common to both.

Like uniq, comm expects that the lines are sorted, so also with this command we’ll use the command sort.

From COMM(1) man page, the options available are:

-1 suppress lines unique to FILE1
-2 suppress lines unique to FILE2
-3 suppress lines that appear in both files

So if we have as Input files:

# cat file.txt
aa bb cc
aa bb
aa
cc bb
fe fe fe
aa bb fe
aa bb
cc bb
 
# cat file2.txt
 
aa bb cc
aa bb dd
aa 22
cc bb 33
fe fe fe fe
aa bb fe
aa bb
cc bb 11

To find only those lines which are common to both the files
First we sort both files in a temporary file:

#sort file.txt > file.txt.sorted
#sort file2.txt > file2.txt.sorted

Now we can compare them:

#comm file.txt.sorted file2.txt.sorted
aa bb
aa bb cc
aa bb fe

With process substitution we can do all this with one line and get the same result:

# comm -12 < (sort file.txt) <(sort file2.txt)

Note without the -12 option we would get an output like this one:

# comm < (sort file.txt) <(sort file2.txt)
12 34
aa
aa 22
aa bb
aa bb
aa bb cc
aa bb dd
aa bb fe
cc bb
cc bb
cc bb 11
cc bb 33
fe fe fe
fe fe fe fe


Join

The join command takes as input two text files and a number of options. If no command-line argument is given, this command looks for a pair of lines from the two files having the same first field (a sequence of characters that are different from space), and outputs a line composed of the first field followed by the rest of the two lines.

The program arguments specify which character to be used in place of space to separate the fields of the line, which field to use when looking for matching lines, and whether to output lines that do not match. The output can be stored to another file rather than printing using redirection.

A good example would be getting an username and its default login shell listed in /etc/passwd and group name from /etc/group.

The numeric value of "group" is the 4th field in /etc/passwd and the 3rd field in /etc/group file, so we'll use these 2 fileds to join the information:

$ join -t ":" -1 4 -2 3 -o 1.1 2.1 1.7 /etc/passwd /etc/group
root:root:/bin/bash
daemon:daemon:/bin/sh
bin:bin:/bin/sh
lp:lp:/bin/sh
mail:mail:/bin/sh
news:news:/bin/sh
proxy:proxy:/bin/sh

Explanation of the command:

-t ":" Use the character : as field separator
-1 4 -2 3 join on field 4 on file 1 and on field 3 of file 2
-o 1.1 2.1 1.7 show as output the field 1 of file 1, the field 1 of file 2 and the field 7 of file 1

Popular Posts:

flattr this!

  6 Responses to “Uniq, comm and join 3 Linux command for the CLI”

  1. ciao, ti segnalo che la riga
    #sort file.txt.sorted file2.txt.sorted

    dovrebbe essere:
    #comm file.txt.sorted file2.txt.sorted

  2. With #sort file.txt.sorted file2.txt.sorted

    Am i suppose to get the below result with the above command
    aa bb
    aa bb cc
    aa bb fe

    If so this is what i am getting
    12 34
    aa
    aa 22
    aa bb
    aa bb
    aa bb
    aa bb cc
    aa bb cc
    aa bb dd
    aa bb fe
    aa bb fe
    cc bb
    cc bb
    cc bb 11
    cc bb 33
    fe fe fe
    fe fe fe fe

    I had to do the below for it to work:
    comm -12 < /usr/bin/sort "file1.sorted" < /usr/bin/sort "file2.sorted"
    aa bb
    aa bb cc
    aa bb fe

    I am getting this when i run the join command
    Running join command
    root:root:/bin/bash
    daemon:daemon:/bin/sh
    bin:bin:/bin/sh
    sys:sys:/bin/sh
    join: file 1 is not in sorted order
    lp:lp:/bin/sh
    mail:mail:/bin/sh
    join: file 2 is not in sorted order
    news:news:/bin/sh
    uucp:uucp:/bin/sh
    proxy:proxy:/bin/sh
    www-data:www-data:/bin/sh
    backup:backup:/bin/sh
    list:list:/bin/sh
    irc:irc:/bin/sh
    gnats:gnats:/bin/sh
    nobody:nogroup:/bin/sh
    libuuid:libuuid:/bin/sh
    syslog:syslog:/bin/false
    messagebus:messagebus:/bin/false
    haldaemon:haldaemon:/bin/false

    • There was a type in the article is not: #sort file.txt.sorted file2.txt.sorted but the correct command is: #comm -12 file.txt.sorted file2.txt.sorted

      For the join command i did not got that message on my ubuntu 12.04, to suppress that output you can use: –nocheck-order, or you can sort first both /etc/passwd and /etc/groups.

      Best regards

      • Ok kool, good to know it was a typo, it is clearer now.

        First time coming across the comm cammand.

        This is what is happening with the join command on Ubuntu 10.04

        I tried this:

        sort /etc/passwd > /tmp/passwd.sorted
        sort /etc/group > /tmp/group.sorted

        join -t “:” -1 4 -2 3 -o 1.1 2.1 1.7 /tmp/passwd.sorted /tmp/group.sorted

        But got the below results:

        join: file 1 is not in sorted order
        join: file 2 is not in sorted order
        games:games:/bin/sh
        gdm:gdm:/bin/false
        gnats:gnats:/bin/sh
        haldaemon:haldaemon:/bin/false
        hplip:lp:/bin/false
        mail:mail:/bin/sh
        man:man:/bin/sh
        messagebus:messagebus:/bin/false
        news:news:/bin/sh
        nobody:nogroup:/bin/sh

        But still had to do the below, like you suggested, to not get the above result:

        join –nocheck-order -t “:” -1 4 -2 3 -o 1.1 2.1 1.7 /tmp/passwd.sorted /tmp/group.sorted

        games:games:/bin/sh
        gdm:gdm:/bin/false
        gnats:gnats:/bin/sh
        haldaemon:haldaemon:/bin/false
        hplip:lp:/bin/false
        mail:mail:/bin/sh
        man:man:/bin/sh
        messagebus:messagebus:/bin/false
        news:news:/bin/sh
        nobody:nogroup:/bin/sh

        rm /tmp/passwd.sorted
        rm /tmp/group.sorted

        Thanks for this good article by the way, i sure am doing some learning.

        • The input files need to be sorted using the same keys that will be used for the join:

          sort -t : -k 4,4 /etc/passwd > /tmp/passwd.sorted
          sort -t : -k 3,3 /etc/group > /tmp/group.sorted
          join -t : -1 4 -2 3 -o 1.1 2.1 1.7 /tmp/passwd.sorted /tmp/group.sorted

          You will likely get a lot more output from this than you did with wrongly sorted input files. The point of the “not in sorted order” messages is to warn you that you are probably not getting the output you wanted. It’s a bad idea to suppress the messages unless you really know what you are doing and why.

 Leave a Reply

(required)

(required)


*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>