I am currently using Beautiful Soup to parse an HTML file and calling get_text()
, but it seems like I’m being left with a lot of \xa0 Unicode representing spaces. Is there an efficient way to remove all of them in Python 2.7, and change them into spaces? I guess the more generalized question would be, is there a way to remove Unicode formatting?
I tried using: line = line.replace(u'\xa0',' ')
, as suggested by another thread, but that changed the \xa0’s to u’s, so now I have “u”s everywhere instead. ):
EDIT: The problem seems to be resolved by str.replace(u'\xa0', ' ').encode('utf-8')
, but just doing .encode('utf-8')
without replace()
seems to cause it to spit out even weirder characters, \xc2 for instance. Can anyone explain this?
Related Posts:
- What is the meaning of [:] in python [duplicate]
- Python – ‘ascii’ codec can’t decode byte
- UnicodeEncodeError: ‘ascii’ codec can’t encode character u’\xa0′ in position 20: ordinal not in range(128)
- Using unicode character u201c
- What is a unicode string?
- Convert Unicode to ASCII without errors in Python
- TypeError: ‘int’ object is not subscriptable
- TypeError: ‘int’ object is not callable
- TypeError: ‘int’ object is not callable
- Python ‘If not’ syntax [duplicate]
- RuntimeWarning: invalid value encountered in divide
- Converting dictionary to JSON
- WinError 2 The system cannot find the file specified (Python)
- IndexError: tuple index out of range —– Python
- sklearn error ValueError: Input contains NaN, infinity or a value too large for dtype(‘float64’)
- Could not find a version that satisfies the requirement tensorflow
- sklearn error ValueError: Input contains NaN, infinity or a value too large for dtype(‘float64’)
- Local variable referenced before assignment?
- ln (Natural Log) in Python
- Python Traceback (most recent call last)
- Unable to plot Double Bar, Bar plot using pyplot for ndarray
- How to pip or easy_install tkinter on Windows
- Cannot find module cv2 when using OpenCV
- error UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xff in position 0: invalid start byte
- u’\ufeff’ in Python string
- Understand the Find() function in Beautiful Soup
- Why are Python’s ‘private’ methods not actually private?
- Remove list from list in Python
- How to create a new text file using Python
- Python – Reading and writing csv files with utf-8 encoding
- TypeError: write() argument must be str, not bytes (Python 3 vs Python 2 )
- What does the ‘b’ character do in front of a string literal?
- How to import files in python using sys.path.append?
- How do I install the yaml package for Python?
- Check string “None” or “not” in Python 2.7
- Change figure size and figure format in matplotlib
- What does the ‘b’ character do in front of a string literal?
- how to update spyder on anaconda
- how does \r (carriage return) work in Python
- What is Python buffer type for?
- IndexError: index 1 is out of bounds for axis 0 with size 1/ForwardEuler
- (unicode error) ‘unicodeescape’ codec can’t decode bytes in position 2-3: truncated \UXXXXXXXX escape
- Create 3D array using Python
- TypeError: unsupported operand type(s) for -: ‘list’ and ‘list’
- AttributeError: ‘datetime’ module has no attribute ‘strptime’
- Python add item to the tuple
- python encoding utf-8
- Converting binary to decimal integer output
- How can I read pdf in python?
- No module named setuptools
- How to use 2to3 properly for python?
- How to fix: “UnicodeDecodeError: ‘ascii’ codec can’t decode byte”
- Using BeautifulSoup to search HTML for string
- Difference between scikit-learn and sklearn
- How to detect key presses?
- What is the difference between json.load() and json.loads() functions
- How can I from bs4 import BeautifulSoup?
- How to XOR two strings in Python
- TypeError: ‘_io.TextIOWrapper’ object is not subscriptable
- Copy a list of list by value and not reference
- Why I get ‘list’ object has no attribute ‘items’?
- Split a python list into other “sublists” i.e smaller lists
- scrapy run spider from script
- python error: TypeError: an integer is required
- Get an attribute value based on the name attribute with BeautifulSoup
- UnicodeEncodeError: ‘charmap’ codec can’t encode characters
- If list index exists, do X
- Python Requests – No connection adapters
- How to detect key presses?
- Numpy, multiply array with scalar
- Installation of pygame with Anaconda
- urllib and “SSL: CERTIFICATE_VERIFY_FAILED” Error
- ‘virtualenv’ is not recognized as an internal or external command, operable program or batch file
- How can I copy a Python string?
- Split string using a newline delimiter with Python
- No module named urllib3
- can we use XPath with BeautifulSoup?
- AttributeError: ‘tuple’ object has no attribute
- What is the meaning of “int(a[::-1])” in Python?
- What should I use to open a url instead of urlopen in urllib3
- Can I remove script tags with BeautifulSoup?
- How to completely uninstall python 2.7.13 on Ubuntu 16.04
- Spell Checker for Python
- TypeError: coercing to Unicode: need string or buffer, int found
- How to open html file?
- “Python version 2.7 required, which was not found in the registry” error when attempting to install netCDF4 on Windows 8
- Install py2exe for python 2.7 over pip: this package requires Python 3.3 or later
- UnicodeDecodeError, invalid continuation byte
- Does python have header files like C/C++?
- Python decoding Unicode is not supported
- In python, how can I print lines that do NOT contain a certain string, rather than print lines which DO contain a certain string:
- input() error – NameError: name ‘…’ is not defined
- ImportError: No module named BeautifulSoup
- Symbol not found: __PyCodecInfo_GetIncrementalDecoder
- How to write unicode strings into a file?
- Python 2.7 mixing iteration and read methods would lose data
- Homebrew brew doctor warning about /Library/Frameworks/Python.framework, even with brew’s Python installed
- What does an ‘r’ represent before a string in python?
- Python list index out of range on return value of split
- Python: OSError: [Errno 2] No such file or directory: ”