Python for SEO: Automate the Boring Stuff (Even If You're Not a Developer)
You spend 4 hours every week manually checking rankings, pulling reports, and copying data between spreadsheets. That's 200+ hours per year doing tasks a simple Python script could handle in minutes.
Most SEO professionals know they should automate. They've heard about Python. They've seen the impressive case studies. But they hit the same wall: "I'm not a developer. Where do I even start?"
Here's the truth: you don't need to become a programmer. You need to copy, paste, and modify scripts that solve specific problems. With AI assistants like ChatGPT, you can generate working code by describing what you want in plain English. The barrier has never been lower.
In this guide, you'll learn practical Python automation for common SEO tasks, even if you've never written a line of code. We'll cover automating rank tracking, finding broken links at scale, pulling Search Console data, keyword clustering, and competitor analysis. Each section includes copy-paste scripts you can use today.
The SEOs who automate win. The ones who don't waste time on work a machine should do.
Why Python Is the SEO Automation Language of Choice
Every programming language can automate tasks. But Python dominates SEO for specific reasons.
Low Learning Curve
Python reads almost like English. Compare this Python code to pull a webpage:
import requests
response = requests.get("https://example.com")
print(response.status_code)
Even if you've never coded, you can guess what this does. Request a URL, get the response, print the status code. That readability matters when you're learning.
SEO-Specific Libraries
Python has pre-built tools for everything SEOs need:
| Library | What It Does |
|---|---|
requests |
Fetch webpages and API data |
beautifulsoup4 |
Parse HTML and extract elements |
pandas |
Manipulate data like Excel on steroids |
advertools |
SEO-specific functions (sitemap parsing, robots.txt, crawling) |
google-api-python-client |
Connect to Google Search Console API |
scrapy |
Build powerful web crawlers |
You don't build these from scratch. You install them and use them.
AI Can Write It For You
This is the game-changer. Describe what you want to ChatGPT or Claude:
"Write a Python script that checks all URLs in a CSV file and flags any returning 404 errors"
You'll get working code in seconds. The AI handles syntax. You handle strategy.
The Community Factor
Stuck on something? Search "python [your problem] seo" and you'll find solutions. Reddit's r/SEO, GitHub repositories, and SEO blogs have shared thousands of scripts. You rarely need to invent solutions from scratch.
Getting Started: Your Python Setup (15 Minutes)
Before you can run scripts, you need Python installed and a place to write code. This takes 15 minutes.
Option 1: Google Colab (Easiest, No Installation)
If you want to skip installation entirely, use Google Colab:
- Go to colab.research.google.com
- Sign in with your Google account
- Click "New Notebook"
- Start writing code
Pros: Nothing to install. Free. Runs in your browser.
Cons: Need internet connection. Files don't persist unless you save them.
For beginners, Colab removes every installation headache. I recommend starting here.
Option 2: Local Installation (More Power)
For serious automation, install Python locally:
Windows:
1. Download Python from python.org
2. Run the installer
3. IMPORTANT: Check "Add Python to PATH" during installation
4. Open Command Prompt and type python --version to verify
Mac:
1. Open Terminal
2. Install Homebrew if you haven't: /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
3. Run brew install python
4. Verify with python3 --version
Installing Libraries
Most scripts need additional libraries. Install them with pip:
pip install requests beautifulsoup4 pandas advertools
In Google Colab, add an exclamation mark:
!pip install requests beautifulsoup4 pandas advertools
Your First Script Test
Verify everything works with this simple script:
import requests
response = requests.get("https://httpbin.org/status/200")
print(f"Status code: {response.status_code}")
if response.status_code == 200:
print("Python is working. You're ready to automate.")
If you see "Python is working," you're set.
Automating Rank Tracking with Python
Checking rankings manually is tedious. Checking them programmatically through Google Search Console gives you accurate, scalable data.
Why Use GSC Data for Rank Tracking
Third-party rank trackers estimate positions. Google Search Console tells you the exact average position Google recorded for each query. It's first-party data, which means it's accurate.
The catch? The GSC interface limits exports and makes bulk analysis painful. Python fixes that.
Connecting to the Search Console API
First, enable the API and get credentials:
- Go to Google Cloud Console
- Create a new project (or select existing)
- Enable "Google Search Console API"
- Create credentials (OAuth 2.0 Client ID, type "Desktop app")
- Download the JSON credentials file
Then install the Google client library:
pip install google-api-python-client google-auth-oauthlib
Pulling Rank Data at Scale
This script pulls all queries where you appeared in search results:
from google.oauth2.credentials import Credentials
from google_auth_oauthlib.flow import InstalledAppFlow
from googleapiclient.discovery import build
import pandas as pd
# Authentication
SCOPES = ['https://www.googleapis.com/auth/webmasters.readonly']
flow = InstalledAppFlow.from_client_secrets_file('credentials.json', SCOPES)
creds = flow.run_local_server(port=0)
# Connect to API
service = build('searchconsole', 'v1', credentials=creds)
# Your website property
site_url = 'https://yourwebsite.com/'
# Query the API
request = {
'startDate': '2025-01-01',
'endDate': '2025-01-28',
'dimensions': ['query', 'page'],
'rowLimit': 25000
}
response = service.searchanalytics().query(siteUrl=site_url, body=request).execute()
# Convert to DataFrame
rows = response.get('rows', [])
data = []
for row in rows:
data.append({
'query': row['keys'][0],
'page': row['keys'][1],
'clicks': row['clicks'],
'impressions': row['impressions'],
'ctr': row['ctr'],
'position': row['position']
})
df = pd.DataFrame(data)
df.to_csv('gsc_rankings.csv', index=False)
print(f"Exported {len(df)} rows to gsc_rankings.csv")
Finding Quick-Win Keywords
Once you have the data, find keywords where you're close to page one:
# Load your GSC data
df = pd.read_csv('gsc_rankings.csv')
# Quick wins: position 8-20 with decent impressions
quick_wins = df[
(df['position'] >= 8) &
(df['position'] <= 20) &
(df['impressions'] >= 100)
].sort_values('impressions', ascending=False)
print("Keywords close to page 1 (worth optimizing):")
print(quick_wins[['query', 'position', 'impressions', 'clicks']].head(20))
These keywords need a small push to reach page one. Our guide on mining Google Search Console for keyword opportunities covers how to prioritize and act on these discoveries.
Building a Bulk Broken Link Checker
Broken links hurt user experience and waste crawl budget. Checking them manually across hundreds of pages? Nobody has time for that.
The Simple Approach
This script checks a list of URLs and flags any that return errors:
import requests
import pandas as pd
from concurrent.futures import ThreadPoolExecutor
def check_url(url):
"""Check if a URL is accessible and return status."""
try:
response = requests.head(url, timeout=10, allow_redirects=True)
return {
'url': url,
'status_code': response.status_code,
'final_url': response.url,
'is_redirect': url != response.url,
'is_broken': response.status_code >= 400
}
except requests.RequestException as e:
return {
'url': url,
'status_code': 'Error',
'final_url': None,
'is_redirect': False,
'is_broken': True,
'error': str(e)
}
# Load URLs from a file (one URL per line)
with open('urls_to_check.txt', 'r') as f:
urls = [line.strip() for line in f if line.strip()]
# Check URLs in parallel (much faster)
with ThreadPoolExecutor(max_workers=10) as executor:
results = list(executor.map(check_url, urls))
# Convert to DataFrame and save
df = pd.DataFrame(results)
broken = df[df['is_broken'] == True]
print(f"Checked {len(df)} URLs")
print(f"Found {len(broken)} broken links")
df.to_csv('link_audit_results.csv', index=False)
broken.to_csv('broken_links.csv', index=False)
Extracting Links from Your Site
Don't have a URL list? Crawl your site to find all links first:
from bs4 import BeautifulSoup
import requests
from urllib.parse import urljoin, urlparse
def extract_links(url, base_domain):
"""Extract all links from a page."""
try:
response = requests.get(url, timeout=10)
soup = BeautifulSoup(response.text, 'html.parser')
links = []
for a_tag in soup.find_all('a', href=True):
href = a_tag['href']
full_url = urljoin(url, href)
# Only include links to external sites or same domain
links.append({
'source_page': url,
'link_url': full_url,
'anchor_text': a_tag.get_text(strip=True)[:100],
'is_external': urlparse(full_url).netloc != base_domain
})
return links
except Exception as e:
print(f"Error processing {url}: {e}")
return []
# Extract links from your homepage
base_url = "https://yourwebsite.com"
base_domain = urlparse(base_url).netloc
all_links = extract_links(base_url, base_domain)
df = pd.DataFrame(all_links)
print(f"Found {len(df)} links")
df.to_csv('extracted_links.csv', index=False)
Scaling to Full Site Audits
For comprehensive audits, use the advertools library which handles crawling efficiently:
import advertools as adv
# Crawl your entire site
adv.crawl(
'https://yourwebsite.com',
'site_crawl.jl',
follow_links=True
)
# Load and analyze results
df = pd.read_json('site_crawl.jl', lines=True)
# Find pages with broken outbound links
broken_outbound = df[df['status'] >= 400]
print(f"Pages returning errors: {len(broken_outbound)}")
Automating Keyword Clustering with Python
Grouping hundreds of keywords by topic manually takes hours. Python does it in seconds using semantic similarity.
Why Automate Clustering
Manual clustering works for 50 keywords. But what about 500? Or 5,000? At scale, you need automation.
Automated keyword clustering groups semantically similar terms so you can:
- Plan content that targets multiple related keywords
- Avoid keyword cannibalization
- Identify content gaps in your strategy
Simple Clustering with TF-IDF
This approach groups keywords based on word overlap:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
# Your keyword list
keywords = [
"python seo automation",
"automate seo tasks python",
"python script for seo",
"rank tracking automation",
"automated rank checker",
"track rankings python",
"broken link checker",
"find broken links seo",
"link audit automation",
"competitor analysis python",
"automate competitor research",
"seo competitor tracking"
]
# Convert keywords to TF-IDF vectors
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(keywords)
# Cluster into groups (adjust n_clusters based on your data)
n_clusters = 4
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
clusters = kmeans.fit_predict(X)
# Create results DataFrame
df = pd.DataFrame({
'keyword': keywords,
'cluster': clusters
})
# Display clusters
for i in range(n_clusters):
print(f"\n--- Cluster {i} ---")
cluster_keywords = df[df['cluster'] == i]['keyword'].tolist()
for kw in cluster_keywords:
print(f" {kw}")
Advanced Clustering with Embeddings
For better semantic understanding, use sentence embeddings:
# First install: pip install sentence-transformers
from sentence_transformers import SentenceTransformer
from sklearn.cluster import AgglomerativeClustering
import pandas as pd
# Load embedding model (runs locally, no API needed)
model = SentenceTransformer('all-MiniLM-L6-v2')
# Your keywords
keywords = pd.read_csv('keywords.csv')['keyword'].tolist()
# Generate embeddings
embeddings = model.encode(keywords)
# Cluster based on semantic similarity
clustering = AgglomerativeClustering(
n_clusters=None,
distance_threshold=1.5, # Adjust for tighter/looser clusters
metric='cosine',
linkage='average'
)
clusters = clustering.fit_predict(embeddings)
# Results
df = pd.DataFrame({
'keyword': keywords,
'cluster': clusters
})
df = df.sort_values('cluster')
df.to_csv('clustered_keywords.csv', index=False)
print(f"Created {df['cluster'].nunique()} clusters from {len(keywords)} keywords")
When to Use Automated vs. Tool-Based Clustering
Python clustering gives you control and handles large datasets. But dedicated tools like BrightKeyword add search intent classification and volume data that pure semantic clustering misses.
For most workflows, I recommend:
1. Use BrightKeyword for initial research (gets you intent + metrics)
2. Use Python clustering for massive datasets (10,000+ keywords)
3. Always manually review cluster assignments before acting
See our AI keyword research workflow guide for combining automated discovery with human verification.
Automating Competitor Analysis
Knowing what competitors rank for reveals gaps in your strategy. Python helps you gather and analyze this data systematically.
Scraping Competitor Sitemaps
Every site's sitemap reveals their content structure:
import advertools as adv
import pandas as pd
# Parse competitor sitemaps
competitors = [
'https://competitor1.com/sitemap.xml',
'https://competitor2.com/sitemap.xml',
'https://competitor3.com/sitemap.xml'
]
all_urls = []
for sitemap in competitors:
try:
df = adv.sitemap_to_df(sitemap)
df['competitor'] = sitemap.split('/')[2] # Extract domain
all_urls.append(df)
print(f"Found {len(df)} URLs in {sitemap}")
except Exception as e:
print(f"Error with {sitemap}: {e}")
# Combine all sitemaps
combined = pd.concat(all_urls, ignore_index=True)
combined.to_csv('competitor_urls.csv', index=False)
# Analyze URL patterns
print("\nURL patterns by competitor:")
for domain in combined['competitor'].unique():
domain_urls = combined[combined['competitor'] == domain]
print(f"\n{domain}:")
# Count URLs containing common patterns
blog_count = domain_urls['loc'].str.contains('/blog/', case=False).sum()
product_count = domain_urls['loc'].str.contains('/product', case=False).sum()
print(f" Blog posts: {blog_count}")
print(f" Product pages: {product_count}")
Extracting Title Tags at Scale
Title tags reveal keyword targeting strategy:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from concurrent.futures import ThreadPoolExecutor
def get_title(url):
"""Extract title tag from a URL."""
try:
response = requests.get(url, timeout=10)
soup = BeautifulSoup(response.text, 'html.parser')
title = soup.find('title')
return {
'url': url,
'title': title.get_text(strip=True) if title else None
}
except Exception as e:
return {'url': url, 'title': None, 'error': str(e)}
# Load competitor URLs
urls = pd.read_csv('competitor_urls.csv')['loc'].tolist()[:500] # Limit for demo
# Extract titles in parallel
with ThreadPoolExecutor(max_workers=10) as executor:
results = list(executor.map(get_title, urls))
df = pd.DataFrame(results)
df.to_csv('competitor_titles.csv', index=False)
# Find common title patterns
print("Sample titles extracted:")
print(df[df['title'].notna()].head(20))
Identifying Content Gaps
Compare your content against competitors:
import pandas as pd
# Load your sitemap and competitor sitemap
your_urls = adv.sitemap_to_df('https://yoursite.com/sitemap.xml')
competitor_urls = pd.read_csv('competitor_urls.csv')
# Extract topics from URLs (simple approach using URL segments)
def extract_topic(url):
"""Extract likely topic from URL path."""
path = url.split('/')
# Get meaningful path segments (skip empty and short ones)
segments = [s for s in path if len(s) > 3 and s not in ['www', 'com', 'https:', 'http:']]
return ' '.join(segments[-2:]) if segments else None
your_urls['topic'] = your_urls['loc'].apply(extract_topic)
competitor_urls['topic'] = competitor_urls['loc'].apply(extract_topic)
# Find topics competitors cover that you don't
your_topics = set(your_urls['topic'].dropna().str.lower())
competitor_topics = set(competitor_urls['topic'].dropna().str.lower())
gaps = competitor_topics - your_topics
print(f"\nTopics competitors cover that you don't ({len(gaps)} found):")
for topic in list(gaps)[:20]:
print(f" - {topic}")
Log File Analysis for Technical SEO
Server logs reveal how search engines actually crawl your site. This data is invisible in Google Search Console but critical for technical SEO.
What Log Analysis Reveals
Your server logs show:
- Which pages Googlebot crawls most frequently
- Crawl budget waste on low-value pages
- Pages that get crawled but aren't indexed
- Crawl errors you didn't know existed
- Bot behavior patterns over time
Parsing Log Files with Python
Server logs are typically in Combined Log Format. Here's how to parse them:
import pandas as pd
import re
from datetime import datetime
# Log format regex (Combined Log Format)
log_pattern = r'(\S+) \S+ \S+ \[([^\]]+)\] "(\S+) (\S+) [^"]*" (\d+) (\d+|-)'
def parse_log_line(line):
"""Parse a single log line into components."""
match = re.match(log_pattern, line)
if match:
return {
'ip': match.group(1),
'timestamp': match.group(2),
'method': match.group(3),
'url': match.group(4),
'status': int(match.group(5)),
'size': match.group(6)
}
return None
# Parse log file
log_entries = []
with open('access.log', 'r') as f:
for line in f:
parsed = parse_log_line(line)
if parsed:
log_entries.append(parsed)
df = pd.DataFrame(log_entries)
print(f"Parsed {len(df)} log entries")
Identifying Googlebot Crawls
Filter for search engine bots:
# Common bot IP ranges and user agents
# For production, verify IPs against Google's published ranges
# Simple approach: filter by known bot IPs or patterns
# (In real logs, you'd check User-Agent header)
# Assuming you have User-Agent in your logs:
googlebot_ips = df[df['ip'].str.contains('66.249', na=False)] # Google's IP range
print(f"Googlebot requests: {len(googlebot_ips)}")
# Most crawled pages
top_crawled = googlebot_ips['url'].value_counts().head(20)
print("\nMost crawled pages by Googlebot:")
print(top_crawled)
# Crawl budget waste: bot requests to non-indexable pages
wasted_crawls = googlebot_ips[
googlebot_ips['url'].str.contains('/tag/|/page/|facet=|sort=', case=False)
]
print(f"\nPotentially wasted crawls: {len(wasted_crawls)}")
Finding Crawl Anomalies
Spot pages that get heavy bot traffic but shouldn't:
# Group by URL and calculate stats
url_stats = googlebot_ips.groupby('url').agg({
'ip': 'count', # Request count
'status': lambda x: (x >= 400).sum() # Error count
}).rename(columns={'ip': 'crawl_count', 'status': 'error_count'})
# High crawl + high error = problem
problems = url_stats[
(url_stats['crawl_count'] > 10) &
(url_stats['error_count'] > 5)
].sort_values('crawl_count', ascending=False)
print("URLs with high crawl rate AND errors:")
print(problems.head(20))
Scheduling and Running Scripts Automatically
Running scripts manually defeats the purpose of automation. Schedule them to run on their own.
Option 1: Cron Jobs (Mac/Linux)
Cron runs scripts on a schedule:
# Open crontab editor
crontab -e
# Run rank tracker every Monday at 6 AM
0 6 * * 1 /usr/bin/python3 /path/to/rank_tracker.py
# Run broken link checker weekly on Sunday
0 3 * * 0 /usr/bin/python3 /path/to/link_checker.py
Option 2: Task Scheduler (Windows)
- Open Task Scheduler
- Create Basic Task
- Set trigger (daily, weekly, etc.)
- Action: Start a program
- Program:
python.exe - Arguments:
C:\path\to\your\script.py
Option 3: GitHub Actions (Cloud, Free)
Run scripts in the cloud without your computer:
# .github/workflows/seo-automation.yml
name: Weekly SEO Report
on:
schedule:
- cron: '0 6 * * 1' # Every Monday at 6 AM UTC
jobs:
run-report:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.11'
- name: Install dependencies
run: pip install -r requirements.txt
- name: Run script
run: python rank_tracker.py
Sending Results via Email
Have scripts email you the results:
import smtplib
from email.mime.text import MIMEText
from email.mime.multipart import MIMEMultipart
from email.mime.base import MIMEBase
from email import encoders
def send_report(recipient, subject, body, attachment_path=None):
"""Send an email with optional attachment."""
sender = "your-email@gmail.com"
password = "your-app-password" # Use app password, not account password
msg = MIMEMultipart()
msg['From'] = sender
msg['To'] = recipient
msg['Subject'] = subject
msg.attach(MIMEText(body, 'plain'))
if attachment_path:
with open(attachment_path, 'rb') as f:
part = MIMEBase('application', 'octet-stream')
part.set_payload(f.read())
encoders.encode64(part)
part.add_header('Content-Disposition', f'attachment; filename={attachment_path}')
msg.attach(part)
with smtplib.SMTP_SSL('smtp.gmail.com', 465) as server:
server.login(sender, password)
server.send_message(msg)
# Example usage after running your script:
send_report(
recipient="you@yourcompany.com",
subject="Weekly Rank Tracking Report",
body="Attached is this week's ranking data.",
attachment_path="gsc_rankings.csv"
)
Using ChatGPT to Write Custom Scripts
You don't need to memorize Python syntax. AI assistants generate working code from plain English descriptions.
Effective Prompts for SEO Scripts
Be specific about:
- Input format (CSV, API, URL list)
- Desired output (CSV, email, database)
- Edge cases (what if a URL fails? what if data is missing?)
Good prompt:
Write a Python script that:
1. Reads a CSV file called "keywords.csv" with a column named "keyword"
2. For each keyword, searches Google and extracts the titles of the top 10 results
3. Saves results to a new CSV with columns: keyword, position, title, url
4. Handles errors gracefully (if a request fails, log it and continue)
5. Adds a 2-second delay between requests to avoid rate limiting
Bad prompt:
Write a Python script for SEO
Iterating on AI Output
AI rarely produces perfect code on the first try. Common iterations:
- "This throws an error on line X. Fix it."
- "Make this run faster using parallel processing."
- "Add error handling so it doesn't crash on bad URLs."
- "Export the results to both CSV and JSON."
Each iteration refines the script. Save working versions so you can revert if an iteration breaks something.
Building a Personal Script Library
Over time, collect scripts that solve your recurring problems:
/seo-scripts/
/rank-tracking/
gsc_rank_export.py
quick_wins_finder.py
/technical/
broken_link_checker.py
sitemap_validator.py
redirect_chain_checker.py
/content/
keyword_clusterer.py
title_extractor.py
/competitor/
sitemap_scraper.py
content_gap_analyzer.py
Document each script with a comment header explaining what it does, required inputs, and expected outputs.
Frequently Asked Questions
How to automate keyword research with Python?
Python can automate keyword data collection by connecting to APIs (Google Search Console, DataForSEO, SEMrush API) and processing the results. You can cluster keywords using machine learning libraries like scikit-learn. However, for complete keyword research including search volume and competition metrics, you'll need API access to data providers. See our AI keyword research workflow for combining automation with validation.
Do I need to learn Python to automate SEO tasks?
Not deeply. With AI assistants like ChatGPT, you can describe what you want in English and get working code. You need enough Python knowledge to run scripts, install libraries, and debug basic errors. Most SEO professionals can be productive within a few hours of learning.
What are the best Python libraries for SEO?
The essential libraries are: requests for fetching URLs, beautifulsoup4 for parsing HTML, pandas for data manipulation, advertools for SEO-specific functions, and google-api-python-client for Search Console integration. For advanced work, add scrapy for crawling and sentence-transformers for semantic analysis.
How do I connect Python to Google Search Console?
Enable the Search Console API in Google Cloud Console, create OAuth credentials, download the credentials JSON file, and use the google-api-python-client library to authenticate and query. The script in the rank tracking section of this guide shows the complete process.
Can Python replace paid SEO tools?
Python can replicate many functions of paid tools: rank tracking, broken link checking, sitemap analysis, and basic crawling. However, paid tools offer advantages: maintained databases of backlinks and keywords, user-friendly interfaces, and support. Python works best for custom automation that tools don't offer or for processing data from multiple sources.
How often should I run automated SEO scripts?
It depends on the task. Rank tracking weekly is sufficient for most sites. Broken link checks monthly. Competitor analysis quarterly. Log file analysis weekly if you're actively troubleshooting crawl issues. More frequent runs waste resources without providing actionable insights.
Stop Doing Manually What Machines Do Better
Every hour you spend on repetitive SEO tasks is an hour you're not spending on strategy, content, or analysis. Python automation isn't about becoming a developer. It's about removing busywork so you can focus on work that actually moves rankings.
Here's what to do next:
- Pick one task you do repeatedly (checking rankings, finding broken links, pulling reports)
- Set up Google Colab to avoid installation headaches
- Copy a script from this guide and run it with your data
- Ask ChatGPT to modify it for your specific needs
- Schedule it so it runs without you
Start small. Automate one thing. Then another. Within a month, you'll wonder how you ever operated without it.
The SEOs who embrace automation spend their time on strategy. The ones who resist it spend their time on spreadsheets. Choose wisely.