These small commands are not so known, but i think they can do miracles for you when you have to work from the terminal on text files and you need to compare them, or do operation on lines inside 1 file or merge 2 files applying some criteria.
In this article I’ll show you the most common options for these commands and some practical examples on how to use them.
Uniq
This command filter adjacent matching lines from INPUT (or standard input), writing to OUTPUT (or standard output).
With no options, matching lines are merged to the first occurrence.
Useful options:
-c, –count
prefix lines by the number of occurrences
-d, –repeated
only print duplicate lines
-D, –all-repeated[=delimit-method]
print all duplicate lines
delimit-method={none(default),prepend,separate} Delimiting is done with blank lines
-f, –skip-fields=N
avoid comparing the first N fields
-i, –ignore-case
ignore differences in case when comparing
-s, –skip-chars=N
avoid comparing the first N characters
-w, –check-chars=N
compare no more than N characters in lines
So let’s see some example with a file like this (file.txt):
aa bb cc aa bb aa cc bb fe fe fe aa bb fe aa bb cc bb 12 34 |
Uniq
compare only adjacent lines, so as first thing you must use the command sort
to order the content of the file and send the output of this command to uniq
with a pipe:
root@xubuntu-home:/tmp/article# sort file.txt |uniq 12 34 aa aa bb aa bb cc aa bb fe cc bb fe fe fe |
In the output uniq
has removed 2 lines from the original files, if you want to see the number of occurrences of a line you can use the parameter -c
root@xubuntu-home:/tmp/article# sort file.txt |uniq -c 1 12 34 1 aa 2 aa bb 1 aa bb cc 1 aa bb fe 2 cc bb 1 fe fe fe |
L’opzione -c è veramente utile e può essere usato per contare un certo numero di cose come:
– Processi per utente: ps hax -o user | sort | uniq -c
2 avahi 1 colord 1 daemon 60 linuxaria 1 messagebus 2 postfix 82 root 1 rtkit 1 syslog |
– Numero e tipo di connessioni attive: netstat -ant | awk '{print $NF}' | grep -v '[a-z]' | sort | uniq -c
49 ESTABLISHED 6 LISTEN 25 TIME_WAIT |
But now back on our file, if we want to compare only the first 2 character we can use sort file.txt |uniq -w 2
and the output will be:
12 34 aa cc bb fe fe fe |
Comm
The comm
command is a utility that is used to compare two files for common and distinct lines.
comm
reads two files as input, regarded as lines of text. comm
outputs one file, which contains three columns. The first two columns contain lines unique to the first and second file, respectively. The last column contains lines common to both.
Like uniq
, comm
expects that the lines are sorted, so also with this command we’ll use the command sort
.
From COMM(1) man page, the options available are:
-1 suppress lines unique to FILE1
-2 suppress lines unique to FILE2
-3 suppress lines that appear in both files
So if we have as Input files:
# cat file.txt aa bb cc aa bb aa cc bb fe fe fe aa bb fe aa bb cc bb # cat file2.txt aa bb cc aa bb dd aa 22 cc bb 33 fe fe fe fe aa bb fe aa bb cc bb 11 |
To find only those lines which are common to both the files
First we sort
both files in a temporary file:
#sort file.txt > file.txt.sorted #sort file2.txt > file2.txt.sorted |
Now we can compare them:
#comm file.txt.sorted file2.txt.sorted aa bb aa bb cc aa bb fe |
With process substitution we can do all this with one line and get the same result:
# comm -12 < (sort file.txt) <(sort file2.txt) |
Note without the -12 option we would get an output like this one:
# comm < (sort file.txt) <(sort file2.txt) 12 34 aa aa 22 aa bb aa bb aa bb cc aa bb dd aa bb fe cc bb cc bb cc bb 11 cc bb 33 fe fe fe fe fe fe fe |
Join
The join
command takes as input two text files and a number of options. If no command-line argument is given, this command looks for a pair of lines from the two files having the same first field (a sequence of characters that are different from space), and outputs a line composed of the first field followed by the rest of the two lines.
The program arguments specify which character to be used in place of space to separate the fields of the line, which field to use when looking for matching lines, and whether to output lines that do not match. The output can be stored to another file rather than printing using redirection.
A good example would be getting an username and its default login shell listed in /etc/passwd and group name from /etc/group.
The numeric value of “group” is the 4th field in /etc/passwd and the 3rd field in /etc/group file, so we’ll use these 2 fileds to join the information:
$ join -t ":" -1 4 -2 3 -o 1.1 2.1 1.7 /etc/passwd /etc/group root:root:/bin/bash daemon:daemon:/bin/sh bin:bin:/bin/sh lp:lp:/bin/sh mail:mail:/bin/sh news:news:/bin/sh proxy:proxy:/bin/sh |
Explanation of the command:
-t ":"
Use the character : as field separator
-1 4 -2 3
join on field 4 on file 1 and on field 3 of file 2
-o 1.1 2.1 1.7
show as output the field 1 of file 1, the field 1 of file 2 and the field 7 of file 1
Popular Posts:
- None Found
ciao, ti segnalo che la riga
#sort file.txt.sorted file2.txt.sorted
dovrebbe essere:
#comm file.txt.sorted file2.txt.sorted
Grazie mille,
Errore nel trascrivere i comandi 🙂
With #sort file.txt.sorted file2.txt.sorted
Am i suppose to get the below result with the above command
aa bb
aa bb cc
aa bb fe
If so this is what i am getting
12 34
aa
aa 22
aa bb
aa bb
aa bb
aa bb cc
aa bb cc
aa bb dd
aa bb fe
aa bb fe
cc bb
cc bb
cc bb 11
cc bb 33
fe fe fe
fe fe fe fe
I had to do the below for it to work:
comm -12 < /usr/bin/sort "file1.sorted" < /usr/bin/sort "file2.sorted"
aa bb
aa bb cc
aa bb fe
I am getting this when i run the join command
Running join command
root:root:/bin/bash
daemon:daemon:/bin/sh
bin:bin:/bin/sh
sys:sys:/bin/sh
join: file 1 is not in sorted order
lp:lp:/bin/sh
mail:mail:/bin/sh
join: file 2 is not in sorted order
news:news:/bin/sh
uucp:uucp:/bin/sh
proxy:proxy:/bin/sh
www-data:www-data:/bin/sh
backup:backup:/bin/sh
list:list:/bin/sh
irc:irc:/bin/sh
gnats:gnats:/bin/sh
nobody:nogroup:/bin/sh
libuuid:libuuid:/bin/sh
syslog:syslog:/bin/false
messagebus:messagebus:/bin/false
haldaemon:haldaemon:/bin/false
There was a type in the article is not: #sort file.txt.sorted file2.txt.sorted but the correct command is: #comm -12 file.txt.sorted file2.txt.sorted
For the join command i did not got that message on my ubuntu 12.04, to suppress that output you can use: –nocheck-order, or you can sort first both /etc/passwd and /etc/groups.
Best regards
Ok kool, good to know it was a typo, it is clearer now.
First time coming across the comm cammand.
This is what is happening with the join command on Ubuntu 10.04
I tried this:
sort /etc/passwd > /tmp/passwd.sorted
sort /etc/group > /tmp/group.sorted
join -t “:” -1 4 -2 3 -o 1.1 2.1 1.7 /tmp/passwd.sorted /tmp/group.sorted
But got the below results:
join: file 1 is not in sorted order
join: file 2 is not in sorted order
games:games:/bin/sh
gdm:gdm:/bin/false
gnats:gnats:/bin/sh
haldaemon:haldaemon:/bin/false
hplip:lp:/bin/false
mail:mail:/bin/sh
man:man:/bin/sh
messagebus:messagebus:/bin/false
news:news:/bin/sh
nobody:nogroup:/bin/sh
But still had to do the below, like you suggested, to not get the above result:
join –nocheck-order -t “:” -1 4 -2 3 -o 1.1 2.1 1.7 /tmp/passwd.sorted /tmp/group.sorted
games:games:/bin/sh
gdm:gdm:/bin/false
gnats:gnats:/bin/sh
haldaemon:haldaemon:/bin/false
hplip:lp:/bin/false
mail:mail:/bin/sh
man:man:/bin/sh
messagebus:messagebus:/bin/false
news:news:/bin/sh
nobody:nogroup:/bin/sh
rm /tmp/passwd.sorted
rm /tmp/group.sorted
Thanks for this good article by the way, i sure am doing some learning.
The input files need to be sorted using the same keys that will be used for the join:
sort -t : -k 4,4 /etc/passwd > /tmp/passwd.sorted
sort -t : -k 3,3 /etc/group > /tmp/group.sorted
join -t : -1 4 -2 3 -o 1.1 2.1 1.7 /tmp/passwd.sorted /tmp/group.sorted
You will likely get a lot more output from this than you did with wrongly sorted input files. The point of the “not in sorted order” messages is to warn you that you are probably not getting the output you wanted. It’s a bad idea to suppress the messages unless you really know what you are doing and why.