DaleSchool

Text Processing

Beginner25min

Learning Objectives

  • Process column-based text with awk
  • Chain multiple tools with pipes to analyze data
  • Practice log file analysis scenarios

Working Code

Example 1: Basic text analysis with grep and wc

Use pipes to process data step by step:

# Count items in notes.md
grep "item" Documents/notes.md | wc -l

Output:

       2
# Find lines matching a specific pattern
cat Documents/notes.md | grep "^-"

Output:

- item 1
- item 2

Example 2: Combining head and tail

# Check both the beginning and end of a file
head -2 Documents/notes.md
echo "---"
tail -1 Documents/notes.md

Output:

# Notes
- item 1
---
- item 2

Example 3: Saving search results to a file

# Save grep results to a file
grep "item" Documents/notes.md > found.txt
cat found.txt

Output:

- item 1
- item 2
# Compare statistics across files with wc
wc -l Documents/hello.txt Documents/notes.md

Output:

       1 Documents/hello.txt
       3 Documents/notes.md
       4 total

Try It Yourself

awk: Column-Based Text Processing

awk is a mini language designed for text processing. It excels at handling delimited data like CSV files.

Basic structure:

awk 'pattern { action }' file

Example: Basic column output

# Create a simple CSV
echo "name,score,grade" > scores.csv
echo "Alice,85,B" >> scores.csv
echo "Bob,92,A" >> scores.csv
echo "Carol,78,C" >> scores.csv
echo "Dave,65,D" >> scores.csv
# Print only the first column (-F',' sets comma as delimiter)
awk -F',' '{ print $1 }' scores.csv

Output:

name
Alice
Bob
Carol
Dave
# Print name and score only
awk -F',' '{ print $1, $2 }' scores.csv

Output:

name score
Alice 85
Bob 92
Carol 78
Dave 65

Conditional filtering

# Print rows where column 2 (score) is 80 or above
awk -F',' '$2 >= 80' scores.csv

Output:

name,score,grade
Alice,85,B
Bob,92,A
# Skip the header and filter by score (NR: line number)
awk -F',' 'NR > 1 && $2 >= 80' scores.csv

Output:

Alice,85,B
Bob,92,A

Aggregation: BEGIN and END

# Calculate the average
awk -F',' 'NR > 1 { sum += $2; count++ } END { print "Average:", sum/count }' scores.csv

Output:

Average: 80
# Pipe into awk
cat scores.csv | awk -F',' 'NR > 1 { print $1, $2 }'

Output:

Alice 85
Bob 92
Carol 78
Dave 65

"Why?" — Why You Need Text Processing Tools

Server logs, CSV data, config files — most data is text. By chaining these tools with pipes, you can analyze data quickly without spreadsheets or dedicated software.

Real-world scenario: Log analysis

Imagine you have an access log:

2024-01-15 10:30:01 INFO User login: user001
2024-01-15 10:30:05 ERROR Database connection failed
2024-01-15 10:30:10 INFO File upload complete
2024-01-15 10:31:00 WARN Memory usage exceeds 80%
2024-01-15 10:31:05 ERROR File save failed
# Filter ERROR logs only
grep "ERROR" app.log

# Count ERRORs
grep -c "ERROR" app.log

# Extract errors from a specific time window
grep "10:30" app.log | grep "ERROR"

# Save ERROR messages to a file
grep "ERROR" app.log > errors.txt

awk Key Concepts

| Concept | Description | Example | | ------------- | -------------------------- | ------------------------- | | $0 | Entire line | print $0 | | $1, $2... | Each column | print $1, $3 | | NR | Line number | NR > 1 (skip header) | | NF | Number of fields (columns) | print NF | | -F | Field delimiter | -F',', -F'\t' | | BEGIN | Runs before processing | BEGIN { print "Start" } | | END | Runs after processing | END { print sum } |

Common Mistakes

Mistake 1: awk column numbers start at 1

# Wrong: $0 is not the first column
awk -F',' '{ print $0 }' file.csv   # prints the entire line

# Correct
awk -F',' '{ print $1 }' file.csv   # prints the first column

Mistake 2: Forgetting to specify the delimiter

# Processing a comma-delimited CSV with default (space) delimiter
awk '{ print $1 }' scores.csv
# Treats "name,score,grade" as one field

# Correct
awk -F',' '{ print $1 }' scores.csv

Mistake 3: Always verify pipe results

# Build pipes step by step
cat scores.csv | head -3            # check step 1
cat scores.csv | head -3 | grep "8" # check step 2

Build complex pipes incrementally, verifying the output at each step.

Deep Dive

awk: Pattern matching and field operations
# Process only rows matching a pattern
awk -F',' '/A/ { print $1, "excellent" }' scores.csv

# Field arithmetic
awk -F',' 'NR > 1 { print $1, $2 * 1.1, "adjusted" }' scores.csv

# Multiple conditions
awk -F',' 'NR > 1 && $2 >= 80 && $3 == "A"' scores.csv

# Formatted output
awk -F',' 'NR > 1 { printf "%-10s %3d pts\n", $1, $2 }' scores.csv
sed: The stream editor

sed is a tool for transforming text:

# Text substitution (s/original/replacement/)
echo "Hello World" | sed 's/World/Terminal/'

# Global substitution (g flag)
echo "aaa bbb aaa" | sed 's/aaa/xxx/g'

# Delete a specific line
cat file.txt | sed '2d'

# Add line numbers
cat file.txt | sed '='

sed doesn't modify the file itself — it outputs to stdout. To edit in place, use the -i option.

Error handling in pipe chains

If a command in the middle of a pipe fails, results can be unexpected:

# set -o pipefail: treat the whole pipe as failed if any part fails
set -o pipefail

# Save intermediate results to variables for debugging
result=$(cat file.txt | grep "pattern")
echo "Result: $result"
  1. Create a file: echo "a,1" > data.csv, echo "b,2" >> data.csv, echo "c,3" >> data.csv.
  2. Print the first column with awk -F',' '{ print $1 }' data.csv.
  3. Filter rows where column 2 is 2 or above: awk -F',' '$2 >= 2' data.csv.
  4. Sum the second column: awk -F',' '{ sum += $2 } END { print sum }' data.csv.
  5. Count items with cat Documents/notes.md | grep "item" | wc -l.

Q1. In awk -F',' '$2 >= 80' scores.csv, what does -F',' do?

  • A) Filters values greater than 80
  • B) Sets the field delimiter to a comma
  • C) Specifies the file format as CSV
  • D) Specifies a second file