Aditya Sundar - Waseda University
Table of Contents
- Table of Contents
- 1. Introduction
- 2. Methodology
- 2.1 Overview
- 2.2 Technology Stack
- 2.3 Crawler Workflow
- 2.4 Performance Considerations
- 2.5 Data Storage and Management
- 2.6 Data Sources and Target Websites
- 3. Results and Discussion
- 3.1 Overview of Crawling Results
- 3.2 Crawling Speed and Efficiency
- 3.3 Error Analysis
- 3.4 Limitations
- 3.5 Considerations for Optimizing Web Crawling
- 4. Next Steps and Future Directions
1. Introduction
Web crawling is a fundamental process on the internet, it enables automated systems to systematically browse the web and gather data from websites. This technology has a wide range of applications, from search engine indexing to data mining and content aggregation. By following links from a starting point, web crawlers traverse through web pages, collecting information that can be stored and analyzed. This process is vital for search engines like Google, which relies on crawlers to index content and make it easily accessible to users through search queries.
Nowadays, most websites have a sitemap, a crucial tool for web crawlers as it provides a structured list of URLs within a domain. It serves as a guide, ensuring all the important pages of a domain are indexed efficiently. Sitemaps are valuable, especially in large websites where certain pages may not be easily discoverable through standard navigation. However, not all websites contain sitemaps, which can leave crawlers with the challenging task of exhaustively exploring the site to capture all URLs.
This project focuses on developing a manual web crawler designed to systematically explore websites and generate comprehensive lists of URLs within a single domain, regardless of whether a sitemap is provided. The crawler not only works effectively on sites without sitemaps but was also employed on sites with existing sitemaps to create external sitemaps—detailed records of URLs that reference another domain. This extended capability is valuable for ensuring complete coverage of a site’s content, which is crucial for advanced SEO strategies and data analysis.
2. Methodology
2.1 Overview
The primary goal of this project was to implement a method to be able to retrieve all URLs within a specific domain, whether it has a sitemap or not. The web crawler was designed to be able to run on a laptop with low specs even without proxies, and is able to handle large-scale websites with extensive internal and external linking. The crawler generates a comprehensive list of internal and external URLs that can be used for further analysis. Furthermore, for websites that render dynamic content, a dynamic web crawler using Playwright was used.
2.2 Technology Stack
To achieve the objectives, the web crawler was made with the following technologies:
Python
: The primary programming language used due to the number of useful libraries available for web scraping and asynchronous operations.requests
: Used for handling synchronous HTTP requests. Utilized to obtain crawling rules fromrobots.txt
initially.aiohttp
: Used for handling asynchronous HTTP requests for improved concurrency.BeautifulSoup
: Employed for parsing HTML content and extracting links from web pages.asyncio
: Python’s standard library for writing concurrent code with the async/await syntax.SQLite
: A lightweight database used to store URLs, external domain references, and errors while crawling to alleviate memory usage, data retrieval, and persistence while crawling.fake_useragent
: Utilized to randomly generate HTTP headers, simulating different user agents to avoid detection as a bot.cchardet
: Used for detecting and handling different character encodings in web content.playwright
: A headless browser used to crawl through websites that render content dynamically using headless browsers. This was used to handle dynamic content loading, simulate user interactions, and fetch content rendered via JavaScript, which is often missed by traditional crawlers.
2.3 Crawler Workflow
- Initialization: The crawler starts with normalizing the provided start URL, ensuring a consistent format and removing unnecessary fragments. Before beginning, the crawler ensures that the existence of the robots.txt is checked to respect the crawling rules specified by the domain. The appropriate tables for storing results while crawling are also initialized in an SQLite database.
def check_robots_txt(url, timeout=5, retry_count=3, retry_delay=2):
"""
Check the robots.txt file for the given URL. If the robots.txt file contains
only comments or no useful rules, default to allowing all crawling.
"""
parsed_url = urlparse(url)
robots_url = f"{parsed_url.scheme}://{parsed_url.netloc}/robots.txt"
rp.set_url(robots_url)
for attempt in range(retry_count):
try:
response = requests.get(robots_url, timeout=timeout)
if response.status_code < 400:
rp.parse(response.text.splitlines())
print(f"Found robots.txt: Parsed entries from {robots_url}: {response.text.splitlines()}")
return
else:
print(f"No robots.txt found at {robots_url}, allowing crawling.")
rp.parse(['User-agent: *', 'Disallow:'])
return
except requests.Timeout:
print(f"Timeout accessing robots.txt at {robots_url}. Retrying...")
except requests.RequestException as e:
print(f"Error accessing robots.txt at {robots_url}: {e}, retrying in {retry_delay} seconds...")
time.sleep(retry_delay)
# If all attempts fail, allow crawling by default
print(f"Failed to access robots.txt at {robots_url} after {retry_count} attempts, allowing crawling.")
rp.parse(['User-agent: *', 'Disallow:'])
Using urllib.robotparser
, a robots parser is initialized at the start. The function is designed to locate and fetch the rules for crawling set by the domain. It constructs the link from the start URL and sends a request. It attempts 3 times by default with a delay if required. If robots.txt is located, then the rules provided are followed, else it is assumed that crawling is allowed by default. This approach balances the need to respect website restrictions with the practicalities of robust web crawling in the face of unreliable network conditions.
- URL Fetching and Parsing: The crawler sends HTTP requests to the target URLs asynchronously in proper batches. The responses are cached to avoid redundant requests. The HTML content is parsed using
BeautifulSoup
focused on only extracting anchor (<a>
) tags that contain internal or external references. Appropriate amounts of delay were added to each request to ensure that the server was not overloaded with requests.
async def async_crawl(session, urls, parent_urls, base_domain, batch_size=1000):
global skipped_urls, total_crawled # Add total_crawled to track across batches
total_urls = len(urls)
# Clear the temp table
cursor.execute('DELETE FROM temp_content')
conn.commit()
for i in range(0, total_urls, batch_size):
if total_crawled >= MAX_CRAWLED_URLS:
print(f"Reached the maximum limit of {MAX_CRAWLED_URLS} crawled URLs. Stopping.")
break
start_time = time.time()
good_urls = 0
batch = urls[i:i + batch_size]
parent_batch = parent_urls[i:i + batch_size]
tasks = [fetch_page(session, url, parent_url) for url, parent_url in zip(batch, parent_batch)]
results = await asyncio.gather(*tasks, return_exceptions=True)
print(f"------------------\nBatch {i//batch_size + 1} / {total_urls//batch_size + 1} complete...")
if len(errors) > 1000:
flush_errors_to_db()
for url, (content, is_file, fetched_url, parent_url) in zip(batch, results):
if content is not None or is_file:
url = normalize_url(url)
new_links, parent_url = extract_links(url, content, base_domain, parent_url) if content else (set(), parent_url)
new_links_json = json.dumps(list(new_links)) # Convert the set to a JSON string
cursor.execute('''
INSERT INTO temp_content (fetched_url, new_links, parent_url)
VALUES (?, ?, ?)
''', (fetched_url, new_links_json, parent_url))
good_urls += 1
total_crawled += 1 # Increment the global counter for crawled URLs
conn.commit()
print(f"Succesfully fetched URLs: {good_urls}, Skipped URLs: {skipped_urls}")
skipped_urls = 0 # Reset for the next batch
print(f"Flushed to db...\nRemaining: {max(0, total_urls - i - batch_size)}")
end_time = time.time()
print(f"Time taken: {end_time - start_time:.2f} seconds")
print(f"------------------\nTotal crawled: {total_crawled} / {MAX_CRAWLED_URLS}\n------------------")
await asyncio.sleep(5)
# Retrieve all crawled data from the temp table
cursor.execute('SELECT fetched_url, new_links, parent_url FROM temp_content')
crawled_data = cursor.fetchall()
# Convert the new_links back to a set
processed_crawled_data = [
(fetched_url, set(json.loads(new_links)) if new_links else set(), parent_url)
for fetched_url, new_links, parent_url in crawled_data
]
return processed_crawled_data
This function is responsible for efficiently fetching and processing batches of URLs using asyncio
and an HTTP session initialized by aiohttp
. The function processes URLs in batches to manage memory usage. Each batch is determined by the batch_size
parameter but is 1000 by default. For each batch, the function fetches the content of URLs asynchronously, which speeds up the crawling process.
To store intermediate results such as retrieved links before returning them, a temporary table, temp_content
, is created. To track the current performance of the crawler for each batch, a timer is used, and the results of each batch are printed accordingly.
If the global limit for maximum crawled URLs is reached, then the program is terminated.
async def fetch_page(session, url, parent_url, retry_count=3, retry_delay=7, timeout_sec=20):
"""
Fetch the content of the given URL with retry on server errors.
"""
global skipped_urls
url = normalize_url(url)
if is_url_in_db(url):
return None, False, url, parent_url
if not rp.can_fetch("*", url):
skipped_urls += 1
errors.append({'url': url, 'error': 'robots.txt restriction'})
return None, False, url, parent_url
timeout = aiohttp.ClientTimeout(total=timeout_sec)
for attempt in range(retry_count):
try:
async with session.get(url, headers=headers, timeout=timeout) as response:
default_delay = random.uniform(1, 5)
if response.status in [404, 403, 400, 500, 429, 502]:
skipped_urls += 1
if response.status == 429 and 'Retry-After' in response.headers:
retry_after = int(response.headers['Retry-After'])
print(f"HTTP 429: Retrying after {retry_after} seconds")
await asyncio.sleep(retry_after)
errors.append({'url': url, 'error': f'HTTP {response.status}'})
if response.status == 400 or response.status == 403 or response.status == 404:
return None, False, url, parent_url # No point retrying these
await asyncio.sleep(retry_delay)
continue # Retry the request
if response.status != 200:
skipped_urls += 1
errors.append({'url': url, 'error': f'HTTP {response.status}'})
return None, False, url, parent_url
if is_file_url(url):
return None, True, url, parent_url
try:
raw_content = await response.read()
if not raw_content:
raise ValueError("Empty response content")
detected_encoding = cchardet.detect(raw_content)['encoding']
encoding = detected_encoding if detected_encoding else 'utf-8'
content = raw_content.decode(encoding, errors='replace')
except (UnicodeDecodeError, ValueError, Exception) as e:
skipped_urls += 1
errors.append({'url': url, 'error': f'Decoding error: {str(e)}'})
return None, False, url, parent_url
await asyncio.sleep(default_delay)
return content, False, url, parent_url
except (aiohttp.ClientError, asyncio.TimeoutError, Exception) as e:
errors.append({'url': url, 'error': f'Other error: {str(e)}'})
await asyncio.sleep(retry_delay)
continue # Retry the request
skipped_urls += 1
errors.append({'url': url, 'error': 'Max retry attempts reached'})
return None, False, url, parent_url
This function is responsible for fetching a page’s content and returning it. If an error occurs during a request, it is handled appropriately by logging the error and retrying if necessary after a delay. If the response is successful, the encoding is detected using cchardet
, if it is not detected, then it falls back to utf-8
.
URLs that already exist in the database are not fetched, and URLs that link to a file rather than a webpage are also ignored. Before fetching, the function checks the rules for crawling defined in robots.txt and proceeds accordingly.
- Link Extraction and Normalization: All extracted links are normalized to ensure consistency, and those that match the base domain are considered internal links and are added to the list of URLs that are to be crawled. External links are recorded and stored separately in a dedicated database table.
def extract_links(url, content, base_domain, parent_url):
global external_links_set
url = normalize_url(url)
if is_url_in_db(url):
return set(), parent_url
internal_links = set()
soup = BeautifulSoup(content, 'lxml', parse_only=SoupStrainer('a'))
for a in soup.find_all('a', href=True):
try:
href = a['href']
# Handle the base URL for relative paths
parsed_url = urlparse(url)
# If the path ends in a file-like structure (e.g., .html, .php, etc.), remove
if not parsed_url.path.endswith('/') and '.' in parsed_url.path.split('/')[-1]:
base_url = url.rsplit('/', 1)[0] + '/'
else:
base_url = url
link = urljoin(base_url, href)
link = normalize_url(link)
if urlparse(link).netloc == base_domain:
if not is_url_in_db(link):
internal_links.add(link)
else:
domain = urlparse(link).netloc
if domain:
external_links_set.add(domain)
except ValueError as ve:
error_message = f"Skipping malformed URL in href '{a['href']}': {ve}"
print(error_message)
errors.append({'url': a['href'], 'error': error_message})
except Exception as e:
error_message = f"General error processing link '{a['href']}': {e}"
print(error_message)
errors.append({'url': a['href'], 'error': error_message})
# Flush external links to the database if the set size exceeds 1000
if len(external_links_set) >= 1000:
flush_external_links()
return internal_links, parent_url
This function is responsible for extracting all links found in the URL. All external references are stored separately and are identified by checking the netloc. To account for relative URLs, a backslash is added before combining the two portions. The global set of external links is flushed for every 1000 entries to avoid a decrease in performance.
- Data Storage: To manage memory constraints, links were fetched and parsed in batches, then stored in SQLite databases rather than being kept in memory while crawling. Moreover, an indexed column was used to ensure efficient querying and avoiding duplicate URLs.
# Connect to SQLite database and create a table if it doesn't exist
conn = sqlite3.connect('crawled_urls.db')
cursor = conn.cursor()
# Drop the table if it exists
cursor.execute('DROP TABLE IF EXISTS urls')
cursor.execute('DROP TABLE IF EXISTS crawl_errors')
cursor.execute('DROP TABLE IF EXISTS external_domains')
# Create the table with an additional parent_url column
cursor.execute('''
CREATE TABLE IF NOT EXISTS urls (
id INTEGER PRIMARY KEY AUTOINCREMENT,
site_url_domain TEXT,
parsed_url TEXT UNIQUE,
parent_url TEXT,
depth INTEGER
)
''')
cursor.execute('''
CREATE TABLE IF NOT EXISTS crawl_errors (
id INTEGER PRIMARY KEY AUTOINCREMENT,
url TEXT,
error TEXT
)
''')
cursor.execute('''
CREATE TABLE IF NOT EXISTS external_domains (
id INTEGER PRIMARY KEY AUTOINCREMENT,
domain TEXT UNIQUE,
reference_count INTEGER
)
''')
conn.commit()
# Create an index to speed up searches
cursor.execute('CREATE INDEX IF NOT EXISTS idx_parsed_url ON urls(parsed_url)')
conn.commit()
A database and appropriate tables are initialized and created before crawling. The code sets up a SQLite database with three tables to store data from web crawling. This is essential for tracking internal URLs, external domain references, and any errors that may arise while crawling.
- Concurrency Management: The crawler is designed to handle multiple URLs simultaneously using asynchronous requests (
aiohttp
). This approach significantly improves performance, especially for large-scale crawls, although it is constrained by the available system resources (RAM and CPU). Based on the web server’s restrictions, the number of concurrent requests can be manually modified.
async def crawl_website(start_url):
global total_crawled
start_url = normalize_url(start_url)
base_domain = urlparse(start_url).netloc
to_crawl = {start_url}
crawled = set()
next_to_crawl = set()
depth = 0
parent_urls = {start_url: None} # Map of URLs to their parent URLs
connector = aiohttp.TCPConnector(limit=100) # Number of concurrent requests
async with aiohttp.ClientSession(connector=connector) as session:
while to_crawl and total_crawled < MAX_CRAWLED_URLS:
print(f"\n\n===== Total Crawled: {total_crawled} =====\nLeft to crawl for now: {len(to_crawl)}")
parent_batch = [parent_urls[url] for url in to_crawl]
crawled_data = await async_crawl(session, list(to_crawl), parent_batch, base_domain)
print(f"\n\n----- Fetched data from {len(to_crawl)} sites -----")
urls_to_save = []
for current_url, new_links, parent_url in crawled_data:
normalized_url = normalize_url(current_url)
# Check if the URL is already in the database or the current crawled set
if normalized_url not in crawled and not is_url_in_db(normalized_url):
for link in new_links:
parent_urls[link] = current_url # Set the parent_url for the new links
next_to_crawl.update(new_links - crawled)
crawled.add(normalized_url)
parsed_url = urlparse(normalized_url)
site_url_domain = f"{parsed_url.scheme}://{parsed_url.netloc}"
urls_to_save.append((site_url_domain, normalized_url, parent_url, depth))
# Save to database every 1000 entries
if len(urls_to_save) >= 1000:
save_to_db(urls_to_save)
urls_to_save.clear()
crawled.clear()
# Save any remaining URLs in the batch
if urls_to_save:
save_to_db(urls_to_save)
# Clear crawled data to save memory
crawled_data.clear()
crawled.clear()
flush_external_links()
flush_errors_to_db()
to_crawl, next_to_crawl = next_to_crawl, set()
depth += 1
# Stop if the limit is reached
if total_crawled >= MAX_CRAWLED_URLS:
print(f"Reached the maximum limit of {MAX_CRAWLED_URLS} crawled URLs. Stopping.")
break
print(f"Total URLs crawled: {total_crawled}")
return crawled
This function is responsible for creating an asynchronous HTTP session and managing the database for the retrieved URLs. Each iteration in the loop is considered to be a single level in terms of depth.
The algorithm used is a BFS (Breadth-First-Search), as it explores all URLs at the current level before moving on to the next level. BFS ensures that the crawler systematically processes URLs in layers, and expands outward. A BFS was chosen as it can capture the entire website level by level, even when the crawler is interrupted.
The function also periodically saves the crawled URLs to the database and flushes the logged errors and external references to the database for each depth. The crawler is interrupted and ended if the maximum limit of crawled URLs has been reached.
- Error Handling: The crawler attempts to include robust error-handling mechanisms. It ensures the retry of failed requests with exponential backoff and logs all such errors into the database for further analysis, such as HTTP status codes and decoding issues.
- Completion and Data Export: The crawler currently contains a maximum limit to URLs crawled for testing purposes, but naturally, the crawler comes to a stop once it has no new URLs to crawl within the given domain.
2.4 Performance Considerations
Given the limitations of the available hardware, the crawler was designed to balance performance and resource consumption. Asynchronous operations were employed to maximize concurrency without overwhelming system resources. The use of caching further optimized performance by reducing redundant HTTP requests. SQLite databases were used to store data during execution rather than keeping them in memory, hence improving efficiency and reducing memory usage for large websites.
2.5 Data Storage and Management
The crawler uses SQLite to store and manage data throughout the crawl. URLs, errors, and external domains are recorded in separate tables, with indexes on key columns to speed up queries. Data is flushed into the database promptly to ensure less stress on memory and ensure data persistence in case of a runtime error.
2.6 Data Sources and Target Websites
Initially, data sources originated from different government websites across various faculties and departments, as few of them had sitemaps, aiming to capture a complete list of URLs since these websites were rarely too large to crawl. When the additional objective of retrieving external references was introduced, the focus was shifted to websites that intentionally referenced a large number of external domains, such as media, blogs, and news sites, regardless of whether they had a sitemap. All targeted sites were primarily in Japanese.
3. Results and Discussion
3.1 Overview of Crawling Results
The web crawlers were deployed across various types of websites, including government portals, media sites, blogs, and news platforms. Across these diverse domains, the crawlers managed to retrieve a significant number of URLs if not all. Most sites yielded a total of around 50,000 URLs before the crawling process was intentionally halted due to time and hardware limitations.
Website Internal URLs External References
=================================================================
Entertainment Site 1 47870 7010999
Gaming Site 2 143058 3027460
Press Release Site 1 99243 2136887
News Site 6 117164 1895515
Entertainment Site 3 62825 1794906
News Site 3 102180 1787551
Entertainment Site 2 82511 1060470
Government Site 1 28841 55700
News Site 4 24278 50871
Education Site 1 50458 28328
News Site 2 35514 26885
Tech News Site 2 21380 22289
News Site 5 50250 19009
Tech News Site 1 50108 14004
Tech News Site 3 50588 13121
Health Site 1 2478 6266
Gaming Site 1 50232 3066
Science Site 1 50798 2533
News Site 1 50031 970
Government Site 2 1600 210
Total Internal URLs: 1121407
Total External References: 18957040
The following graph (Fig 2.) visualizes the ratio of the number of external references to the number of internal links crawled for a given domain. Each bar represents a different domain, and the height of the bar indicates the ratio.
From the graph, it is clear that entertainment/game/news sites tend to have a larger number of external references compared to government sites. This suggests that these sites focus more on internal content and have few external references.
In terms of SEO, having a higher ratio of external references could be beneficial for media sites such as entertainment, game, and news sites, as it may increase the perceived credibility and could attract backlinks from reputable sites. However, for government sites, a high ratio is unnecessary as credibility is not as important. Those sites should focus on internal linking rather than the risk of linking to low-quality external sites. Although a balanced approach is often considered the best, and ensures that external links are meaningful and contribute positively to SEO goals and user experience.
3.2 Crawling Speed and Efficiency
The speed of the crawlers varied significantly depending on the website’s structure and the server’s response time. Websites with simpler structures and fast response times allowed the crawlers to operate efficiently, retrieving the URLs quickly with minimal errors. On the other hand, websites with an extra layer of protection or slow response times resulted in slower crawl rates.
3.3 Error Analysis
Errors were an inevitable part of the crawling process, arising from various factors:
- Dead Links: Many errors were the result of attempting to access dead or broken links. These errors led to 404 pages or timed out during the request.
- Rate Limiting: Some websites imposed a strict rate limit on the number of requests the crawler could make within a certain period. Normally this would lead to 429 (Too Many Requests) errors. However, due to the careful rate management, these errors were relatively rare.
- Decoding Issues: In some cases, the crawler encountered links that contained a file format that could not be parsed such as a video or another type of document, or due to inconsistencies or errors in character encoding used by the websites. Despite these errors being logged, the successfully retrieved URLs were saved, allowing the crawler to continue its operation.
- Robots.txt Restrictions: Certain pages were skipped due to restrictions defined in the robots.txt files of the website. These files are used by domains to guide the behavior of crawlers, specifying restricted portions of the domain. Respecting these restrictions was crucial to ensuring ethical and responsible crawling.
- Server-Side Errors: There were a few other errors, including status codes like 500 (Internal Server Error), 502 (Bad Gateway), and 503 (Service Unavailable). These errors occurred when the server was unable to process the request.
===== Errors Summary =====
Decoding Errors: 3271
Robots.txt Restrictions: 2696
Status Code 404 Errors: 43611
Status Code 502 Errors: 72
Status Code 500 Errors: 48
Status Code 403 Errors: 478
Status Code 429 Errors: 1
Status Code 400 Errors: 583
Status Code 503 Errors: 29555
Other Errors: 124261
Retry Max Errors: 11994
Total Actual Skipped Links: 62050
The high number of “Other Errors” indicates that many issues were not easily categorized or could be related to generic connectivity or network problems that did not provide detailed error messages.
If an unambiguous error occurred (categorized as "Other Errors") or if the errors were associated with status codes 500, 502, or 503, the request was retried after a set timeout delay. For 429 errors, the request was retried after the duration specified by the 'Retry-After' header.
Hence, the actual amount of links unable to be fetched from all the status code errors and “Other Errors” is equal to only the sum of 403, 404 errors, and “Retry Max Errors”.
3.4 Limitations
Most of the project's limitations arose from hardware and network constraints. To avoid IP blocks, the crawler's speed often had to be drastically reduced, increasing the time taken for each crawl. Due to time constraints and limited memory, the crawler was often manually halted after retrieving approximately 50,000 URLs to prevent system overload.
3.5 Considerations for Optimizing Web Crawling
Rate Limit Adjusting:
connector = aiohttp.TCPConnector(limit=100)
The rate limit used for most domains was at most 100 concurrent requests. Some domains required adjusting this limit, hence if too many HTTP 429 errors were encountered, the crawler was restarted with a lower limit.
Using a lower limit in general proved to be crucial for maintaining access to sites over long periods, even if it meant slowing the crawling speed. By balancing the number of concurrent requests, the crawler was able to maintain efficiency without violating server limits.
Single-Page Applications (SPAs):
A significant challenge with modern websites is the prevalence of single-page applications (SPAs), which dynamically render content. Traditional crawlers may miss such dynamically generated content. To address this, Playwright was integrated to handle these types of websites, as it allows the crawler to simulate user interactions and load JavaScript-rendered pages, capturing links that would otherwise be missed.
async def scrape_all_links(start_url):
async with async_playwright() as p:
browser = await p.chromium.launch(headless=False)
page = await browser.new_page()
Playwright allows for a crawler to simulate a real browser, meaning it can interact with websites just like a human user would. While this can reduce the crawling speed due to the overhead of rendering pages and waiting for JavaScript content to render, it also significantly decreases the chances of being blocked by the server, as the crawler is harder to distinguish from a legitimate user.
While SPAs can present challenges for SEO if not implemented with proper considerations, they are not inherently bad practice. For this project, although a dynamic web crawler was implemented to handle some SPAs, many were excluded from the crawl due to their slow rendering of large amounts of dynamic content, which would have significantly impacted crawling efficiency.
Link Normalization and Duplication:
Some websites contained links to the same resource with different URLs (e.g., http://example.com
and https://example.com/
). This leads to duplicate crawling and storage.
To avoid such duplications, all URLs were normalized in a basic manner. The fragments were removed, the URL schemes were all unified (only HTTPS was used), moreover trailing backslashes were removed. Storing all the unique crawled URLs in a database helped prevent re-crawling previously visited links.
Error Handling and Retries:
When crawling, one of the most important aspects to look out for is errors. Crawlers often encountered a range of errors, HTTP 404 (Not Found), 403 (Forbidden), server-side errors (500, 502), or connection timeouts.
It was important to implement robust error handling to deal with these issues. Depending on the errors, the request was either retried after a delay or skipped. All errors were logged for analysis, and skipping the appropriate unrecoverable errors ensured that the crawler remained operational.
4. Next Steps and Future Directions
The crawler’s efficiency could be increased with the use of proxies and improved hardware. With proxies, the chances of being blocked drastically reduce, increasing the number of concurrent requests which can drastically reduce the time taken. Improved hardware can increase memory, resulting in consistent performance even during large crawls.
Exploring the use of advanced crawling frameworks such as Scrapy could be beneficial as they offer robust features that could improve the efficiency and reliability of the crawler, although a thorough comparison of its performance and customizability is required.
With the collected data, a promising future direction is to analyze the relationships between different websites by constructing a graph network. Using text embedding and cosine similarity calculations, it is possible to quantify the similarity between websites based on their content. This could be valuable for understanding how different sites are interconnected, and uncovering possible hidden patterns in content sharing.