Extracting Href Values from HTML and Saving to CSV

Document that combines the Python script using BeautifulSoup and the Bash script using curl for extracting href values from HTML and saving them to a CSV file:


Using Python and BeautifulSoup

Install Required Libraries

Make sure to install the necessary Python libraries:

pip install requests
pip install beautifulsoup4

Python Script

import requests
from bs4 import BeautifulSoup
import csv

# Replace the URL with the desired web page
url = "https://example.com"

# Fetch HTML content from the URL
response = requests.get(url)
html_data = response.text

# Parse the HTML data
soup = BeautifulSoup(html_data, 'html.parser')

# Find all 'a' tags and extract href values
href_values = [a['href'] for a in soup.find_all('a', href=True)]

# Write the values to a CSV file
with open('output.csv', 'w', newline='') as csvfile:
    csv_writer = csv.writer(csvfile)
    csv_writer.writerow(['Href'])

    for href in href_values:
        csv_writer.writerow([href])

Save this Python script in a file, for example, extract_href.py, and run it using:

python extract_href.py

This script will create a CSV file named ‘output.csv’ containing the href values.

Using curl and Command-line Tools (Bash)

Bash Script

url="https://example.com"  # Replace this with your URL

# Fetch HTML content from the URL using curl
html_data=$(curl -s "$url")

# Extract href values using grep and awk
href_values=$(echo "$html_data" | grep -o '<a [^>]*href=["'\''][^"'\'']*["'\'']' | awk -F'"' '{print $2}')

# Print href values
echo "$href_values"

# Save href values to a CSV file
echo "$href_values" | awk 'BEGIN {print "Href"} {print $1}' > output.csv

Save this Bash script in a file, for example, extract_href.sh, and run it using:

bash extract_href.sh

This script will also create a CSV file named ‘output.csv’ with the extracted href values.