A Python Script for Recursive Search-And-Replace

Posted by fulvio at Jul 18, 2012 10:55 AM |
Filed under:

I wrote a little Python script to solve a find and replace problem:
The problem was that I had a directory tree with several thousand files, about 2000 of which were static HTML (yeah, don't ask....I blame my predecessor for this), with the typical Google Analytics tracking code, e.g.:

<script type="text/javascript">

var _gaq = _gaq || [];
_gaq.push(['_setAccount', 'UA-11111111-1']);
_gaq.push(['_trackPageview']);

(function() {
var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
})();

</script>

 

Now my client started worrying about the privacy of their users, and asked me to remove all these snippets.  By the way, In Europe it will soon become illegal to use Google Analytics without asking visitors whether they allow tracking them.

I wrote a regex that would capture this multi-line snippet easily.
In addition, several HTML files also had event handlers calling the GA tracking script, e.g:

<a href="crime_and_punishment_jp.pdf" onMouseDown="javascript: _gaq.push(['_trackEvent', 'Trans
lations', 'Downloaded PDF', 'Crime and Punishment vol 2 (Japanese)']); ">

The following script works fine, and takes only a few seconds to traverse several thousand files.
Here is my script:  I ran it in python 2.5, hence I didn't have the "with" statement available.

from os import walk
from os.path import join
from re import compile

def handler(error):
print error

script_pattern=compile(r'<script[^<]*_gaq[^<]*</script>')
event_pattern=compile(r'onMouseDown\s*=\s*"\s*javascript:\s*_gaq.push[^"]+"')

modified = []
event = []

for root, dirs, files in walk('myfolder', onerror=handler):
for name in files:
path = join(root, name)
file = open(path, 'r+b')
all = file.read()
file.seek(0)
if script_pattern.search(all):
modified.append(path)
fixed = script_pattern.sub('', all)
if event_pattern.search(fixed):
event.append(path)
fixed = event_pattern.sub('', fixed)
file.write(fixed)
file.truncate()
file.close()

for i in modified: print i
print len(modified)

for i in event: print i
print len(event)
Filed under: