simple tag ignore hook for urlwatch

I recently set up urlwatch to alert me if some web pages I'm interested in are changed. It has a nice pushbullet integration and is pretty easy to set up. Too easy in fact. Pro tip, after configuring your preferred notification service and setting enabled: true you're done. I spent a while faffing about thinking there had to be more to it. There isn't.

What I found however is that one of the pages I was monitoring had a dynamically generated <script> tag in it which was triggering spurious notifications I wanted to suppress. There didn't seem to be an obvious way to ignore particular tags so I created a simple hook to do this.

from urlwatch import filters  
from urlwatch import jobs  
from urlwatch import reporters  
from bs4 import BeautifulSoup

class IgnoreFilter(filters.FilterBase):  
    __kind__ = 'ignore'

    def filter(self, data, subfilter=None):
        if subfilter is None:
            return data

        soup = BeautifulSoup(data, 'html.parser')
        for element in soup.select(subfilter):
            element.extract()
        return soup

This adds a new filter type called ignore which accepts a CSS selector as parameter. It then uses the magical BeautifulSoup HTML parser to find all the elements which match the selector and remove them before returning the remaining HTML.

Urlwatch then does its normal comparison against the previous run to see if anything has changed and carries as usual.

To use the filter update your config like so altering the CSS selector to suit your needs.

$ urlwatch --edit

---
name: "some site"  
url: "https://something.com/"  
filter: "ignore:body > script:nth-of-type(2)"  
---

This ignores the second <script> tag beneath the <body>.