Working Code
Example 1: awk pattern/action basics
# Create a sample log file
cat > app.log << 'EOF'
2024-01-15 10:30:01 INFO User login: user001
2024-01-15 10:30:05 ERROR Database connection failed
2024-01-15 10:30:10 INFO File upload complete: report.pdf
2024-01-15 10:31:00 WARN Memory usage exceeds 80%
2024-01-15 10:31:05 ERROR File save failed: permission denied
2024-01-15 10:32:00 INFO User logout: user001
EOF
Basic awk usage:
# Print only column 3 (log level)
awk '{ print $3 }' app.log
Output:
INFO
ERROR
INFO
WARN
ERROR
INFO
# Print only ERROR lines
awk '$3 == "ERROR"' app.log
Output:
2024-01-15 10:30:05 ERROR Database connection failed
2024-01-15 10:31:05 ERROR File save failed: permission denied
Example 2: awk BEGIN and END
# Print header at start, summary at end
awk '
BEGIN { print "=== Error List ===" }
$3 == "ERROR" { print NR": "$0 }
END { print "=== Search Complete ===" }
' app.log
Output:
=== Error List ===
2: 2024-01-15 10:30:05 ERROR Database connection failed
5: 2024-01-15 10:31:05 ERROR File save failed: permission denied
=== Search Complete ===
Example 3: sed text substitution
echo "Hello World" | sed 's/World/Terminal/'
Output:
Hello Terminal
# Substitute in a file (output only, file unchanged)
sed 's/INFO/INFORMATION/g' app.log | head -3
Output:
2024-01-15 10:30:01 INFORMATION User login: user001
2024-01-15 10:30:05 ERROR Database connection failed
2024-01-15 10:30:10 INFORMATION File upload complete: report.pdf
Try It Yourself
awk: Aggregation and Calculations
# Count occurrences by log level
awk '{ count[$3]++ } END { for (level in count) print level":", count[level] }' app.log
Output:
INFO: 3
ERROR: 2
WARN: 1
# CSV data processing
cat > sales.csv << 'EOF'
name,product,amount
Alice,laptop,1200000
Bob,mouse,45000
Carol,keyboard,89000
Dave,laptop,1150000
Eve,monitor,350000
EOF
# Calculate total sales
awk -F',' 'NR > 1 { sum += $3 } END { printf "Total sales: %d\n", sum }' sales.csv
Output:
Total sales: 2834000
# Show only products over 1,000,000
awk -F',' 'NR > 1 && $3 >= 1000000 { print $1, $2, $3 }' sales.csv
Output:
Alice laptop 1200000
Dave laptop 1150000
awk NF and NR in Action
# NF: number of fields in the current line
echo "a b c d" | awk '{ print "Fields:", NF }'
# Fields: 4
# Last field
echo "a b c d" | awk '{ print "Last:", $NF }'
# Last: d
# NR: line number
awk 'NR >= 2 && NR <= 4' app.log
Output (lines 2-4):
2024-01-15 10:30:05 ERROR Database connection failed
2024-01-15 10:30:10 INFO File upload complete: report.pdf
2024-01-15 10:31:00 WARN Memory usage exceeds 80%
Advanced sed
# Basic substitution: s/find/replace/flags
echo "apple apple apple" | sed 's/apple/banana/' # first match only
echo "apple apple apple" | sed 's/apple/banana/g' # all matches (global)
echo "APPLE" | sed 's/apple/banana/i' # case-insensitive
# Substitute only on a specific line (line 2)
sed '2s/ERROR/error/' app.log
# Delete specific lines
sed '1d' app.log # delete line 1
sed '/ERROR/d' app.log # delete lines containing ERROR
sed '2,4d' app.log # delete lines 2-4
# Print only specific lines (-n and p)
sed -n '2,4p' app.log # print lines 2-4 only
sed -n '/ERROR/p' app.log # print only ERROR lines
# Prepend text to every line
sed 's/^/[LOG] /' app.log
Editing files in place (-i option):
# Edit in place without backup (be careful!)
sed -i 's/ERROR/error/g' app.log
# Edit with backup (original saved as .bak)
sed -i.bak 's/INFO/info/g' app.log
ls
# app.log app.log.bak
"Why?" — Real-World Log Analysis Pipelines
Scenario: Apache web server log analysis
Apache access log format:
192.168.1.1 - - [15/Jan/2024:10:30:01] "GET /index.html HTTP/1.1" 200 1234
192.168.1.2 - - [15/Jan/2024:10:30:05] "POST /api/login HTTP/1.1" 401 567
10.0.0.1 - - [15/Jan/2024:10:30:10] "GET /admin HTTP/1.1" 404 234
# Top 10 IPs by request count
awk '{ print $1 }' access.log | sort | uniq -c | sort -rn | head -10
# URLs returning 404
awk '$9 == "404" { print $7 }' access.log
# Requests per hour
awk '{ print substr($4, 2, 14) }' access.log | cut -d: -f2 | sort | uniq -c
# Find IPs to block (over 100 failed auth attempts)
awk '$9 == "401" { print $1 }' access.log | sort | uniq -c | sort -rn | awk '$1 > 100 { print $2 }'
Scenario: CSV data transformation
# Convert CSV to TSV (change delimiter)
sed 's/,/\t/g' data.csv > data.tsv
# Remove header
sed '1d' data.csv > data-no-header.csv
# Remove a specific column (column 2)
awk -F',' 'BEGIN{OFS=","} { $2=""; print }' data.csv
# Rearrange columns (3, 1, 2 order)
awk -F',' 'BEGIN{OFS=","} { print $3, $1, $2 }' data.csv
Common Mistakes
Mistake 1: Confusing awk and grep roles
# grep: find lines containing a pattern
grep "ERROR" app.log
# awk: find lines where a specific field matches
awk '$3 == "ERROR"' app.log # only where field 3 is exactly "ERROR"
# The difference: grep searches the entire line, awk matches by field
grep "ERROR" app.log # also matches "NOT_AN_ERROR"
awk '$3 == "ERROR"' app.log # only exact match in field 3
Mistake 2: sed -i differences across platforms
# macOS requires '' after -i
sed -i '' 's/old/new/g' file.txt # macOS
sed -i 's/old/new/g' file.txt # Linux
# Safe approach: preview on stdout first
sed 's/old/new/g' file.txt | head -5
# If the result looks right, apply with -i
Mistake 3: String vs. numeric comparison in awk
# String comparison (use ==)
awk '$3 == "ERROR"' app.log # correct
# Numeric comparison (arithmetic)
awk '$2 > 100' numbers.csv # correct
# Comparing string-stored numbers — watch out
awk '$2 > "100"' numbers.csv # lexicographic! "9" > "100"
awk '$2 + 0 > 100' numbers.csv # force numeric conversion
Deep Dive
awk: Multi-file processing and FILENAME
# Print current filename during multi-file processing
awk '{ print FILENAME, $0 }' *.log
# Per-file statistics
awk '{ count[FILENAME]++ } END { for (f in count) print f, count[f] }' *.log
# FNR: line number within current file (NR is global line number)
awk '{ print FILENAME, FNR, $0 }' file1.txt file2.txt
sed with regular expressions
# Convert date format (YYYY-MM-DD -> DD/MM/YYYY)
echo "2024-01-15" | sed 's/\([0-9]*\)-\([0-9]*\)-\([0-9]*\)/\3\/\2\/\1/'
# 15/01/2024
# Remove blank lines
sed '/^$/d' file.txt
# Remove comment lines (starting with #)
sed '/^#/d' config.txt
# Trim leading/trailing whitespace
sed 's/^[[:space:]]*//;s/[[:space:]]*$//' file.txt
# Strip HTML tags
sed 's/<[^>]*>//g' page.html
Real-world: Config file auto-update script
#!/bin/bash
set -euo pipefail
CONFIG="app.conf"
# Update a config value
update_config() {
local key="$1"
local value="$2"
# Replace if exists, append if not
if grep -q "^${key}=" "$CONFIG"; then
sed -i.bak "s/^${key}=.*/${key}=${value}/" "$CONFIG"
else
echo "${key}=${value}" >> "$CONFIG"
fi
}
update_config "DB_HOST" "localhost"
update_config "DB_PORT" "5432"
update_config "DEBUG" "false"
echo "Config updated"
- Create a log file and extract only error lines with
awk '$3 == "ERROR"' filename. - Count occurrences by level with
awk '{ count[$3]++ } END { for (k in count) print k, count[k] }' filename. - Substitute text with
sed 's/INFO/INFORMATION/g' filename. - Print only ERROR lines with
sed -n '/ERROR/p' filename. - Create a CSV file and sum the second column with
awk -F',' 'NR > 1 { sum += $2 } END { print "Total:", sum }' filename.
Q1. In awk 'BEGIN { print "Start" } $3 == "ERROR" { count++ } END { print "Errors:", count }' log.txt, when does each block execute?
- A) BEGIN=every line, $3=="ERROR"=file start, END=file end
- B) BEGIN=before processing, $3=="ERROR"=on each matching line, END=after processing
- C) BEGIN=first condition, $3=="ERROR"=file start, END=file end
- D) BEGIN=always, $3=="ERROR"=in ERROR files, END=always