Saturday, October 25, 2014

Tools for checking broken web links - part 2

Part 1 of this 2-part series on Linux link checking tools reviewed the tool linkchecker. This post concludes the series by presenting another tool, klinkstatus.

Unlike linkchecker which has a command-line interface, klinkstatus is only available as a GUI tool. Installing klinkstatus on Debian/Ubuntu systems is as easy as:

$ sudo apt-get install klinkstatus

After installation, I could not locate klinkstatus in the GNOME menu system. No problem. To run the program, simply execute the klinkstatus command in a terminal window.

For an initial exploratory test run, simply specify the starting URL for link checking in the top part of the screen (e.g., http://linuxcommando.blogspot.ca), and click the Start Search button.

You can pause link checking by clicking the Pause Search button, and review the latest results. To resume, click Pause Search again; to stop, Stop Search.

Now that you have reviewed the initial results, you can customize subsequent checks in order to constrain the amount of output that you need to manually analyze and address afterward.

The program's user interface is very well designed. You can specify the common parameters right on the main screen. For instance, after exploratory testing, I want to prevent link checking for certain domains. To do that, enter the domain names in the Do not check regular expression field. Use the OR operator (the vertical bar '|') to separate multiple domains, e.g., google.com|blogger.com|digg.com.

To customize a parameter that is not exposed on the main screen, click Settings, and then Configure KLinkStatus. There, you will find more parameters such as the number of simultaneous connections (threads) and the timeout threshold.

The link checking output is by default arranged in a tree view with the broken links highlighted in red. The tree structure allows you to quickly determine the location of the broken link with respect to your website.

You may choose to recheck a broken link to determine if the problem is a temporary one. Right click the link in the result pane and select Recheck.

Note that right clicking a link brings up other options such as Open URL and Open Referrer URL. With these options, you can quickly view the context of the broken link. This feature would be very useful if it worked. Unfortunately, clicking either option fails with the error message: Unable to run the command specified. The file or folder http://.... does not exist. This turns out to be an unresolved linkchecker bug. A workaround is to first click Copy URL (or Copy Referrer URL) from the right click menu, and then paste it into a web browser to open it manually.

The link checking output can be exported to a HTML file. Click File, then Export to HTML, and select whether to include All or just the Broken links.

Below is a final note to my fellow non-US bloggers (I'm blogging from Canada).

If I enter linuxcommando.blogspot.com as the starting URL, the search is immediately redirected to linuxcommando.blogspot.ca, and stops there. To klinkstatus, blogspot.com and blogspot.ca are 2 different domains, and when the search reaches an "external" domain (blogspot.ca), it is programmed to not follow links from there. To correct the problem, I specify linuxcommando.blogspot.ca instead as the starting URL.

1 comment:

Anonymous said...

Nice site.

wget, curl, lynx are always nice alternatives too.
This bash script checks domains, but the regex could be changed to capture the entire web link.

declare -- url="${1:-http://linuxcommando.blogspot.com}"
declare -r RE='(https?://[^/]+/)'
declare -- line='' u='' response=''
declare -A urls=()

# Capture domains in page
while read -re line; do
if [[ "${line}" =~ $RE ]]; then
((urls[${BASH_REMATCH[0]}]++))
fi
done < <(lynx --dump "${url}")

# For each captured domain, dump its' header.
for u in ${!urls[@]}; do
while read -re line; do
if [[ "${line}" == HTTP* ]]; then
response="${line}"
break;
elif [[ "${line}" == Alert* ]]; then
response="${line}"
break;
fi
done < <(lynx -dump -head "${u}" 2>&1)

printf -- "%03d %-40s %s\n" "${urls[${u}]}" "${u}" "${response}"
response=''
done | sort --key=2.41