installing textract for python 3

I came across this library textract for extracting text from various formats. I was interested to use this for extracting text from html files. Here is what I did to get it to install on my machine.

The installation outlines some steps that you need to perform. These steps can be found at the following url:

http://textract.readthedocs.org/en/latest/installation.html

I wanted to install it for python 3.4 on ubuntu 14.04, but it seemed to only support python 2.x. Here is what I did to get it to install for python 3.4

Get your virtualenv setup first

virtualenv -p /usr/bin/python3.4 /usr/local

install required libraries for linux as outlined in the installation page.
download the source file for textract from

https://pypi.python.org/pypi/textract

untar the downloaded file
cd into the directory and look for cases of :

except ShellError, e:

and change it to

except ShellError as e:

edit the requirements/python file comment out

pdfminer==20140328

install the python 3 equivalent

pip install pdfminer3k

finally run

python3.4 setup.py install

Everything should install at this point.