Document that combines the Python script using BeautifulSoup and the Bash script using curl for extracting href values from HTML and saving them to a CSV file:
Using Python and BeautifulSoup
Install Required Libraries
Make sure to install the necessary Python libraries:
pip install requests
pip install beautifulsoup4
Python Script
import requests
from bs4 import BeautifulSoup
import csv
# Replace the URL with the desired web page
url = "https://example.com"
# Fetch HTML content from the URL
response = requests.get(url)
html_data = response.text
# Parse the HTML data
soup = BeautifulSoup(html_data, 'html.parser')
# Find all 'a' tags and extract href values
href_values = [a['href'] for a in soup.find_all('a', href=True)]
# Write the values to a CSV file
with open('output.csv', 'w', newline='') as csvfile:
csv_writer = csv.writer(csvfile)
csv_writer.writerow(['Href'])
for href in href_values:
csv_writer.writerow([href])
Save this Python script in a file, for example, extract_href.py
, and run it using:
python extract_href.py
This script will create a CSV file named ‘output.csv’ containing the href
values.
Using curl and Command-line Tools (Bash)
Bash Script
url="https://example.com" # Replace this with your URL
# Fetch HTML content from the URL using curl
html_data=$(curl -s "$url")
# Extract href values using grep and awk
href_values=$(echo "$html_data" | grep -o '<a [^>]*href=["'\''][^"'\'']*["'\'']' | awk -F'"' '{print $2}')
# Print href values
echo "$href_values"
# Save href values to a CSV file
echo "$href_values" | awk 'BEGIN {print "Href"} {print $1}' > output.csv
Save this Bash script in a file, for example, extract_href.sh
, and run it using:
bash extract_href.sh
This script will also create a CSV file named ‘output.csv’ with the extracted href
values.