37 Intro to Unix Filters
Sooner or later you will need to manipulate files from the command line: it
could be Rmd
files, R
script files, image files, data files, etc. In this
chapter, we’ll see new ways to manipulate data table files with shell commands
and pipelines.
37.1 Crash Example
Let’s get our feet wet with a working example. The first thing you’ll need to do is create a directory for this tutorial:
Download the file nba2017-players.csv
from the course github repository, and
then invoke ls
to check that the data file was successfully downloaded.
:
curl -O https://raw.githubusercontent.com/ucb-stat133/stat133-spring-2018/master/data/nba2017-players.csv
ls
Now that you have a data file, you can apply the basic commands that we’ve covered in the previous chapter for inspecting the file’s contents. Here’s a table with some of those commands, as well as functions in R that have a similar use:
Command | Description | R alternative |
---|---|---|
wc nba2017-players.csv |
count lines, words, and bytes | object.size() , nrow() |
wc -l nba2017-players.csv |
count number of lines | nrow() |
head nba2017-players.csv |
inspect first 10 rows | head() |
tail nba2017-players.csv |
inspect last 10 rows | tail() |
less nba2017-players.csv |
see contents with a paginator | View() |
In addition to these inspection-oriented commands, you can also use other unix tools to carry out common manipulation of data tables: select rows and columns, sort information, determine frequencies, find unique values, etc.
37.1.1 Redirecting output to a new file with >
For demo purposes, let’s subset the data file nba2017-players.csv
by taking
the first 11 rows:
Instead of displaying output on the screen, we can use the output redirection
operator >
to a new file:
You can use cat
to check that data10.csv
contains the column names, and the firts 10 lines of data:
37.1.2 Selecting columns with cut
How do we select a specific column, say position
, from data10.csv
? We can use the cut
command for this purpose, using the flag -d ","
to specify the field-delimiter, and the flag -f 3
to indicate that we want to extract the third column:
In the same way we created data10.csv
, we can redirect the output of cut
to a new file positions10.txt
37.1.3 Sorting lines with sort
Another useful command is sort
, which as you may guess, allows us to sort the lines of a stream of data:
Notice that the name of the column position
is also part of the output. But what if we just want to play with the positions values, excluding the column name?
We can use tail +2
to exclude the first value (i.e. the column name) . To do this with the column of positions, we must use the pipe operator |
that enables us to take the output of a command and send it as the input of another command:
Now let’s mix |
and >
to rebuild positions10.txt
without the column name:
# positions (second attempt)
cut -d "," -f 3 data10.csv | tail +2 > positions10.txt
cat positions10.txt
Let’s go back to the sorting operation:
37.1.4 Listing unique occurrences with sort -u
What if we want to list only the unique values (i.e. the unique categories)?
sort
has the flag -u
to display only the unique occurrences:
37.1.5 Counting unique occurrences with sort
and uniq
And what if we want the counts of those unique values (i.e. the frequencies)? To find the answer we pipe the output of sort
as the input of the command uniq
with the flag -c
. Here’s the entire pipe:
Now, let’s apply it on the entire data file, step by step:
37.1.6 All in one pipeline
Finally, let’s pipe all the commands in a single line, without creating the intermediate file positions10.txt
:
37.1.7 More examples
What if you want to do the same but now for the teams? In other words, count the number of players in each team?
# count unique team values, i.e. number of players
cut -d "," -f 2 nba2017-players.csv | tail +2 | sort | uniq -c
Find the minimum age (6th column)
Find the maximum age (6th column)
Frequencies of ages:
37.2 Filters
In the above examples we use a set of commands that are formally known as filters:
sort
cut
uniq
- etc (there are more filters)
Filters are a particular type of unix program that expects to work either with file redirection or as a part of a pipeline. These programs read input from standard input, write output to standard output, and often don’t have any starting arguments.
37.3 Extracting columns with cut
When working with files that have a tabular structure (e.g. csv, tsv, field delimited) it is very common to focus on one or more “columns”. To pull vertical columns from a file, you can use the cut
command.
cut
operates based either on character position within the column when using the -c
flag, or on delimited fields when using the -f
flag. By default, cut
expects tabs as the delimiter. If a file separates fields with spaces or commas or any other delimiter, you need to use the option -d
indicating the character used as field delimiter between quote marks.
Option | Description |
---|---|
-f 1,5 |
return columns 1 and 5, delimited by tabs. |
-f 1,5 |
return columns 1 through 5, delimited by tabs. |
-d "," |
use commans as the delimiters. |
-c 2-7 |
return characters 2 through 7 from the file. |
37.4 Sorting lines with sort
The output stream produced by many commands, as well as the lines of a file or of a series of files, can be sorted into alphabetical order with the sort
command. In other words, sort
reads information and sorts it alphabetically. You can customize the behavior of sort
to ignore the case of words, and to reverse the order of a sort. This command also enables you to sort lists of numbers. The table below shows some of the common options for the sort
command:
Option | Description |
---|---|
-n |
sort in numerical order rather than alphabetically. |
-r |
sort in reverse order, z to a or decreasing numbers. |
-f |
fold uppercase into lowercase (i.e. ignore case). |
-u |
return a unique representative of repeated items. |
-k 3 |
sort lines based on column 3 (tab or space delimiters) |
-t "," |
use commas for delimiters. |
-b |
ignore leading blanks. |
-d |
sort in dictionary order. |
37.5 Isolating unique lines with uniq
Another useful command for extracting a subset of values from a file, or summarizing a stream of text, is uniq
. This command removes consecutive identical lines from a file, leaving one unique representative. More precisely, what uniq
does is compare each line it reads with the previous line. If the lines are the same, uniq
does not list the second line. You can use options with uniq
to get more specific results:
Option | Description |
---|---|
-c |
adds a count of how many times each line occurred. |
-u |
lists only lines that are not repeated. |
-d |
lists only lines that are duplicated. |
-i |
ignore case when determining uniqueness |
-f 4 |
ignore the first 4 fields (space delimiter) |
To get a single representative of each unique line from the entire file, in most cases you would need to first sort the lines with the sort
command to group matching lines together. Interestingly, uniq
can be used with the flag -c
to count the number of occurrences of a line. This gives a quick way, for example, to assess the frequencies of values in a given column.