DaleSchool

Advanced awk and sed

Beginner30min

Learning Objectives

  • Use awk's pattern/action structure for data processing
  • Leverage awk's BEGIN, END, NR, and NF
  • Search and replace text with sed
  • Build real-world log analysis pipelines

Working Code

Example 1: awk pattern/action basics

# Create a sample log file
cat > app.log << 'EOF'
2024-01-15 10:30:01 INFO User login: user001
2024-01-15 10:30:05 ERROR Database connection failed
2024-01-15 10:30:10 INFO File upload complete: report.pdf
2024-01-15 10:31:00 WARN Memory usage exceeds 80%
2024-01-15 10:31:05 ERROR File save failed: permission denied
2024-01-15 10:32:00 INFO User logout: user001
EOF

Basic awk usage:

# Print only column 3 (log level)
awk '{ print $3 }' app.log

Output:

INFO
ERROR
INFO
WARN
ERROR
INFO
# Print only ERROR lines
awk '$3 == "ERROR"' app.log

Output:

2024-01-15 10:30:05 ERROR Database connection failed
2024-01-15 10:31:05 ERROR File save failed: permission denied

Example 2: awk BEGIN and END

# Print header at start, summary at end
awk '
  BEGIN { print "=== Error List ===" }
  $3 == "ERROR" { print NR": "$0 }
  END { print "=== Search Complete ===" }
' app.log

Output:

=== Error List ===
2: 2024-01-15 10:30:05 ERROR Database connection failed
5: 2024-01-15 10:31:05 ERROR File save failed: permission denied
=== Search Complete ===

Example 3: sed text substitution

echo "Hello World" | sed 's/World/Terminal/'

Output:

Hello Terminal
# Substitute in a file (output only, file unchanged)
sed 's/INFO/INFORMATION/g' app.log | head -3

Output:

2024-01-15 10:30:01 INFORMATION User login: user001
2024-01-15 10:30:05 ERROR Database connection failed
2024-01-15 10:30:10 INFORMATION File upload complete: report.pdf

Try It Yourself

awk: Aggregation and Calculations

# Count occurrences by log level
awk '{ count[$3]++ } END { for (level in count) print level":", count[level] }' app.log

Output:

INFO: 3
ERROR: 2
WARN: 1
# CSV data processing
cat > sales.csv << 'EOF'
name,product,amount
Alice,laptop,1200000
Bob,mouse,45000
Carol,keyboard,89000
Dave,laptop,1150000
Eve,monitor,350000
EOF

# Calculate total sales
awk -F',' 'NR > 1 { sum += $3 } END { printf "Total sales: %d\n", sum }' sales.csv

Output:

Total sales: 2834000
# Show only products over 1,000,000
awk -F',' 'NR > 1 && $3 >= 1000000 { print $1, $2, $3 }' sales.csv

Output:

Alice laptop 1200000
Dave laptop 1150000

awk NF and NR in Action

# NF: number of fields in the current line
echo "a b c d" | awk '{ print "Fields:", NF }'
# Fields: 4

# Last field
echo "a b c d" | awk '{ print "Last:", $NF }'
# Last: d

# NR: line number
awk 'NR >= 2 && NR <= 4' app.log

Output (lines 2-4):

2024-01-15 10:30:05 ERROR Database connection failed
2024-01-15 10:30:10 INFO File upload complete: report.pdf
2024-01-15 10:31:00 WARN Memory usage exceeds 80%

Advanced sed

# Basic substitution: s/find/replace/flags
echo "apple apple apple" | sed 's/apple/banana/'      # first match only
echo "apple apple apple" | sed 's/apple/banana/g'     # all matches (global)
echo "APPLE" | sed 's/apple/banana/i'                 # case-insensitive

# Substitute only on a specific line (line 2)
sed '2s/ERROR/error/' app.log

# Delete specific lines
sed '1d' app.log           # delete line 1
sed '/ERROR/d' app.log     # delete lines containing ERROR
sed '2,4d' app.log         # delete lines 2-4

# Print only specific lines (-n and p)
sed -n '2,4p' app.log      # print lines 2-4 only
sed -n '/ERROR/p' app.log  # print only ERROR lines

# Prepend text to every line
sed 's/^/[LOG] /' app.log

Editing files in place (-i option):

# Edit in place without backup (be careful!)
sed -i 's/ERROR/error/g' app.log

# Edit with backup (original saved as .bak)
sed -i.bak 's/INFO/info/g' app.log
ls
# app.log  app.log.bak

"Why?" — Real-World Log Analysis Pipelines

Scenario: Apache web server log analysis

Apache access log format:

192.168.1.1 - - [15/Jan/2024:10:30:01] "GET /index.html HTTP/1.1" 200 1234
192.168.1.2 - - [15/Jan/2024:10:30:05] "POST /api/login HTTP/1.1" 401 567
10.0.0.1 - - [15/Jan/2024:10:30:10] "GET /admin HTTP/1.1" 404 234
# Top 10 IPs by request count
awk '{ print $1 }' access.log | sort | uniq -c | sort -rn | head -10

# URLs returning 404
awk '$9 == "404" { print $7 }' access.log

# Requests per hour
awk '{ print substr($4, 2, 14) }' access.log | cut -d: -f2 | sort | uniq -c

# Find IPs to block (over 100 failed auth attempts)
awk '$9 == "401" { print $1 }' access.log | sort | uniq -c | sort -rn | awk '$1 > 100 { print $2 }'

Scenario: CSV data transformation

# Convert CSV to TSV (change delimiter)
sed 's/,/\t/g' data.csv > data.tsv

# Remove header
sed '1d' data.csv > data-no-header.csv

# Remove a specific column (column 2)
awk -F',' 'BEGIN{OFS=","} { $2=""; print }' data.csv

# Rearrange columns (3, 1, 2 order)
awk -F',' 'BEGIN{OFS=","} { print $3, $1, $2 }' data.csv

Common Mistakes

Mistake 1: Confusing awk and grep roles

# grep: find lines containing a pattern
grep "ERROR" app.log

# awk: find lines where a specific field matches
awk '$3 == "ERROR"' app.log   # only where field 3 is exactly "ERROR"

# The difference: grep searches the entire line, awk matches by field
grep "ERROR" app.log   # also matches "NOT_AN_ERROR"
awk '$3 == "ERROR"' app.log  # only exact match in field 3

Mistake 2: sed -i differences across platforms

# macOS requires '' after -i
sed -i '' 's/old/new/g' file.txt  # macOS
sed -i 's/old/new/g' file.txt     # Linux

# Safe approach: preview on stdout first
sed 's/old/new/g' file.txt | head -5
# If the result looks right, apply with -i

Mistake 3: String vs. numeric comparison in awk

# String comparison (use ==)
awk '$3 == "ERROR"' app.log   # correct

# Numeric comparison (arithmetic)
awk '$2 > 100' numbers.csv    # correct

# Comparing string-stored numbers — watch out
awk '$2 > "100"' numbers.csv  # lexicographic! "9" > "100"
awk '$2 + 0 > 100' numbers.csv  # force numeric conversion

Deep Dive

awk: Multi-file processing and FILENAME
# Print current filename during multi-file processing
awk '{ print FILENAME, $0 }' *.log

# Per-file statistics
awk '{ count[FILENAME]++ } END { for (f in count) print f, count[f] }' *.log

# FNR: line number within current file (NR is global line number)
awk '{ print FILENAME, FNR, $0 }' file1.txt file2.txt
sed with regular expressions
# Convert date format (YYYY-MM-DD -> DD/MM/YYYY)
echo "2024-01-15" | sed 's/\([0-9]*\)-\([0-9]*\)-\([0-9]*\)/\3\/\2\/\1/'
# 15/01/2024

# Remove blank lines
sed '/^$/d' file.txt

# Remove comment lines (starting with #)
sed '/^#/d' config.txt

# Trim leading/trailing whitespace
sed 's/^[[:space:]]*//;s/[[:space:]]*$//' file.txt

# Strip HTML tags
sed 's/<[^>]*>//g' page.html
Real-world: Config file auto-update script
#!/bin/bash
set -euo pipefail

CONFIG="app.conf"

# Update a config value
update_config() {
  local key="$1"
  local value="$2"
  # Replace if exists, append if not
  if grep -q "^${key}=" "$CONFIG"; then
    sed -i.bak "s/^${key}=.*/${key}=${value}/" "$CONFIG"
  else
    echo "${key}=${value}" >> "$CONFIG"
  fi
}

update_config "DB_HOST" "localhost"
update_config "DB_PORT" "5432"
update_config "DEBUG" "false"

echo "Config updated"
  1. Create a log file and extract only error lines with awk '$3 == "ERROR"' filename.
  2. Count occurrences by level with awk '{ count[$3]++ } END { for (k in count) print k, count[k] }' filename.
  3. Substitute text with sed 's/INFO/INFORMATION/g' filename.
  4. Print only ERROR lines with sed -n '/ERROR/p' filename.
  5. Create a CSV file and sum the second column with awk -F',' 'NR > 1 { sum += $2 } END { print "Total:", sum }' filename.

Q1. In awk 'BEGIN { print "Start" } $3 == "ERROR" { count++ } END { print "Errors:", count }' log.txt, when does each block execute?

  • A) BEGIN=every line, $3=="ERROR"=file start, END=file end
  • B) BEGIN=before processing, $3=="ERROR"=on each matching line, END=after processing
  • C) BEGIN=first condition, $3=="ERROR"=file start, END=file end
  • D) BEGIN=always, $3=="ERROR"=in ERROR files, END=always