Uniq, comm e join 3 comandi per la CLI di Linux

Apr 292012

Questi piccoli comandi non sono così noti, ma penso che possano fare miracoli per voi quando si deve lavorare da terminale su file di testo e avete bisogno di confrontarli, o fare operazioni sulle linee contenute in 1 file o unire 2 file applicando alcuni criteri .

In questo articolo vi mostrerò le opzioni più comuni per questi comandi e alcuni esempi pratici su come usarli.

Uniq

Questo comando consente di filtrare righe adiacenti provnienti da INPUT (o standard input), scrivendo su OUTPUT (o sullo standard output).
Senza opzioni, le linee corrispondenti sono unite alla prima occorrenza.

Opzioni utili:

-c, – count
antepone alle linee il numero di occorrenze

-d, – repeated
stampa solo le righe doppie

-D, –all-repeated[=delimit-method]
stampa tutte le linee duplicate
delimit-method = {none (default), prepend,separate} la delimitazione viene effettuata con righe vuote

-f, – skip-fields = N
evita di confrontare i primi N campi

-i, – ignore-case
ignora le differenze tra maiuscole e minuscole

-s, – skip-chars = N
evita di confrontare i primi N caratteri

-w, – check-chars = N
confrontare non più di N caratteri per linea

Vediamo alcuni esempi applicati ad un file come questo (file.txt):

aa bb cc
aa bb
aa
cc bb
fe fe fe
aa bb fe
aa bb
cc bb
12 34

Uniq compara solo le linee adiacenti, quindi come prima cosa dovete usare il comando sort per ordinare il contenuto del file e quindi inviare l’output di questo comando ad uniq con una pipe:

root@xubuntu-home:/tmp/article# sort file.txt |uniq
12 34
aa
aa bb
aa bb cc
aa bb fe
cc bb
fe fe fe

Nell’output uniq ha tolto 2 linee dal file originale, se volete vedere il numero di occorrenze di una linea è possibile utilizzare il parametro -c

root@xubuntu-home:/tmp/article# sort file.txt |uniq -c
      1 12 34
      1 aa
      2 aa bb
      1 aa bb cc
      1 aa bb fe
      2 cc bb
      1 fe fe fe

The option -c is really useful and can be used to count a number of thing like:

– Process per user: ps hax -o user | sort | uniq -c

       2 avahi
      1 colord
      1 daemon
     60 linuxaria
      1 messagebus
      2 postfix
     82 root
      1 rtkit
      1 syslog

– Number and type of active connection: netstat -ant | awk '{print $NF}' | grep -v '[a-z]' | sort | uniq -c

49 ESTABLISHED
  6 LISTEN
 25 TIME_WAIT

Ma ora torniamo sul nostro file, se vogliamo confrontare solo i primi 2 caratteri possiamo usare file.txt sort | uniq- w 2 e l’output sarà:

12 34
aa
cc bb
fe fe fe

Comm

Il comando comm è un programma di utilità che viene utilizzato per confrontare due file per cercare le linee comuni e distinte.
comm legge due file in input, considerati come linee di testo. comm restituisce un unico file, che contiene tre colonne. Le prime due colonne contengono le linee uniche per il primo e secondo file, rispettivamente. L’ultima colonna contiene le righe comuni a entrambi.

Come uniq , comm si aspetta che le linee siano ordinate, così anche con questo comando useremo il comando sort .

Dalla man page di COMM(1) man page, le opzioni disponibili sono:

-1 toglie le linee uniche nel FILE1
-2 toglie le linee uniche nel FILE2
-3 toglie le linee che sono in entrambi i file

Quindi se abbiamo come file di ingresso:

# cat file.txt
aa bb cc
aa bb
aa
cc bb
fe fe fe
aa bb fe
aa bb
cc bb
 
# cat file2.txt
 
aa bb cc
aa bb dd
aa 22
cc bb 33
fe fe fe fe
aa bb fe
aa bb
cc bb 11

Per trovare solo le linee che sono comuni ad entrambi i file
Prima mettiamo in ordine con sort i file in file temporanei:

#sort file.txt > file.txt.sorted
#sort file2.txt > file2.txt.sorted

Adesso possiamo compararli:

#comm file.txt.sorted file2.txt.sorted
aa bb
aa bb cc
aa bb fe

Con la sostituzione dei processi possiamo fare tutto questo in un’unica riga ed avere lo stesso risultato:

# comm -12 < (sort file.txt) <(sort file2.txt)

Notare che senza l’opzione -12 avremmo avuto un output simile a questo:

# comm < (sort file.txt) <(sort file2.txt)
12 34
aa
	aa 22
		aa bb
aa bb
		aa bb cc
	aa bb dd
		aa bb fe
cc bb
cc bb
	cc bb 11
	cc bb 33
fe fe fe
	fe fe fe fe

Join

Il comando join accetta come input due file di testo e una serie di opzioni. Se a riga di comando nessun argomento è dato, questo comando cerca una coppia di linee dai due file aventi lo stesso primo campo (una sequenza di caratteri che sono diversi dallo spazio), ed emette una linea composto del primo campo seguito dal resto delle due linee.

Gli argomenti del programma possono specificare il carattere da utilizzare al posto di uno spazio per separare i campi della linea, quale campo utilizzare quando si cercano le linee corrispondenti, e se mostrare linee di uscita che non corrispondono. L’uscita può essere memorizzata in un altro file, piuttosto che stampata utilizzando il reindirizzamento.

Un buon esempio potrebbe essere prendere l’username e la sua shell di login di default elencati bel file /etc/passwd ed il gruppo da /etc/group.
Il valore numerico del gruppo è il quarto campo in /etc/passwd ed il terzo campo nel file /etc/group, quindi useremo questi campi per unire i due file

$ join -t ":" -1 4 -2 3 -o 1.1 2.1 1.7 /etc/passwd /etc/group
root:root:/bin/bash
daemon:daemon:/bin/sh
bin:bin:/bin/sh
lp:lp:/bin/sh
mail:mail:/bin/sh
news:news:/bin/sh
proxy:proxy:/bin/sh

Spiegazione del comando:

-t ":" Usa il carattere : come divisore tra i campi
-1 4 -2 3 fai il join sul campo numero 4 del file 1 e nel campo numero 3 del file 2
-o 1.1 2.1 1.7 mostra come output il campo numero 1 del file 1, il campo numero 1 del file 2 ed il campo numero 7 del file 1

6 Responses to “Uniq, comm e join 3 comandi per la CLI di Linux”

@Ste says:

30 April 2012 at 13:45

ciao, ti segnalo che la riga
#sort file.txt.sorted file2.txt.sorted

dovrebbe essere:
#comm file.txt.sorted file2.txt.sorted

Rispondi
- linuxari says:
  
  30 April 2012 at 17:42
  
  Grazie mille,
  
  Errore nel trascrivere i comandi 🙂
  
  Rispondi
scribe6324 says:

30 April 2012 at 19:32

With #sort file.txt.sorted file2.txt.sorted

Am i suppose to get the below result with the above command
aa bb
aa bb cc
aa bb fe

If so this is what i am getting
12 34
aa
aa 22
aa bb
aa bb
aa bb
aa bb cc
aa bb cc
aa bb dd
aa bb fe
aa bb fe
cc bb
cc bb
cc bb 11
cc bb 33
fe fe fe
fe fe fe fe

I had to do the below for it to work:
comm -12 < /usr/bin/sort "file1.sorted" < /usr/bin/sort "file2.sorted"
aa bb
aa bb cc
aa bb fe

I am getting this when i run the join command
Running join command
root:root:/bin/bash
daemon:daemon:/bin/sh
bin:bin:/bin/sh
sys:sys:/bin/sh
join: file 1 is not in sorted order
lp:lp:/bin/sh
mail:mail:/bin/sh
join: file 2 is not in sorted order
news:news:/bin/sh
uucp:uucp:/bin/sh
proxy:proxy:/bin/sh
www-data:www-data:/bin/sh
backup:backup:/bin/sh
list:list:/bin/sh
irc:irc:/bin/sh
gnats:gnats:/bin/sh
nobody:nogroup:/bin/sh
libuuid:libuuid:/bin/sh
syslog:syslog:/bin/false
messagebus:messagebus:/bin/false
haldaemon:haldaemon:/bin/false

Rispondi
- linuxari says:
  
  30 April 2012 at 20:41
  
  There was a type in the article is not: #sort file.txt.sorted file2.txt.sorted but the correct command is: #comm -12 file.txt.sorted file2.txt.sorted
  
  For the join command i did not got that message on my ubuntu 12.04, to suppress that output you can use: –nocheck-order, or you can sort first both /etc/passwd and /etc/groups.
  
  Best regards
  
  Rispondi
  - scribe6324 says:
    
    30 April 2012 at 22:14
    
    Ok kool, good to know it was a typo, it is clearer now.
    
    First time coming across the comm cammand.
    
    This is what is happening with the join command on Ubuntu 10.04
    
    I tried this:
    
    sort /etc/passwd > /tmp/passwd.sorted
    sort /etc/group > /tmp/group.sorted
    
    join -t “:” -1 4 -2 3 -o 1.1 2.1 1.7 /tmp/passwd.sorted /tmp/group.sorted
    
    But got the below results:
    
    join: file 1 is not in sorted order
    join: file 2 is not in sorted order
    games:games:/bin/sh
    gdm:gdm:/bin/false
    gnats:gnats:/bin/sh
    haldaemon:haldaemon:/bin/false
    hplip:lp:/bin/false
    mail:mail:/bin/sh
    man:man:/bin/sh
    messagebus:messagebus:/bin/false
    news:news:/bin/sh
    nobody:nogroup:/bin/sh
    
    But still had to do the below, like you suggested, to not get the above result:
    
    join –nocheck-order -t “:” -1 4 -2 3 -o 1.1 2.1 1.7 /tmp/passwd.sorted /tmp/group.sorted
    
    games:games:/bin/sh
    gdm:gdm:/bin/false
    gnats:gnats:/bin/sh
    haldaemon:haldaemon:/bin/false
    hplip:lp:/bin/false
    mail:mail:/bin/sh
    man:man:/bin/sh
    messagebus:messagebus:/bin/false
    news:news:/bin/sh
    nobody:nogroup:/bin/sh
    
    rm /tmp/passwd.sorted
    rm /tmp/group.sorted
    
    Thanks for this good article by the way, i sure am doing some learning.
    
    Rispondi
    - Geoff says:
      
      2 May 2012 at 08:47
      
      The input files need to be sorted using the same keys that will be used for the join:
      
      sort -t : -k 4,4 /etc/passwd > /tmp/passwd.sorted
      sort -t : -k 3,3 /etc/group > /tmp/group.sorted
      join -t : -1 4 -2 3 -o 1.1 2.1 1.7 /tmp/passwd.sorted /tmp/group.sorted
      
      You will likely get a lot more output from this than you did with wrongly sorted input files. The point of the “not in sorted order” messages is to warn you that you are probably not getting the output you wanted. It’s a bad idea to suppress the messages unless you really know what you are doing and why.
      
      Rispondi

Linuxaria

Uniq, comm e join 3 comandi per la CLI di Linux

Uniq

Comm

Join

Popular Posts:

6 Responses to “Uniq, comm e join 3 comandi per la CLI di Linux”

Leave a Reply Cancel reply