import re htmlString = '</dd><dt> Fine, thank you. </dt><dd> Molt bé, gràcies. (<i>mohl behh, GRAH-syuhs</i>)' SearchStr = '(\<\/dd\>\<dt\>)+ ([\w+\,\.\s]+)([\&\#\d\;]+)(\<\/dt\>\<dd\>)+ ([\w\,\s\w\s\w\?\!\.]+) (\(\<i\>)([\w\s\,\-]+)(\<\/i\>\))' Result ='utf-8'), htmlString.decode('utf-8'), re.I | re.U) print Result.groups()
Works that way. The expression contains non-latin characters, so it usually fails. You’ve got to decode into Unicode and use re.U (Unicode) flag.
I’m a beginner too and I faced that issue a couple of times myself.
