simple tag ignore hook for urlwatch
I recently set up urlwatch to alert me if some web pages I'm interested in are changed. It has a nice pushbullet integration and is pretty easy to set up. Too easy in fact. Pro tip, after configuring your preferred notification service and setting
enabled: true you're done. I spent a while faffing about thinking there had to be more to it. There isn't.
What I found however is that one of the pages I was monitoring had a dynamically generated
<script> tag in it which was triggering spurious notifications I wanted to suppress. There didn't seem to be an obvious way to ignore particular tags so I created a simple hook to do this.
from urlwatch import filters from urlwatch import jobs from urlwatch import reporters from bs4 import BeautifulSoup class IgnoreFilter(filters.FilterBase): __kind__ = 'ignore' def filter(self, data, subfilter=None): if subfilter is None: return data soup = BeautifulSoup(data, 'html.parser') for element in soup.select(subfilter): element.extract() return soup
This adds a new filter type called
ignore which accepts a CSS selector as parameter. It then uses the magical BeautifulSoup HTML parser to find all the elements which match the selector and remove them before returning the remaining HTML.
Urlwatch then does its normal comparison against the previous run to see if anything has changed and carries as usual.
To use the filter update your config like so altering the CSS selector to suit your needs.
$ urlwatch --edit --- name: "some site" url: "https://something.com/" filter: "ignore:body > script:nth-of-type(2)" ---
This ignores the second
<script> tag beneath the