installing textract for python 3

I came across this library textract for extracting text from various formats. I was interested to use this for extracting text from html files. Here is what I did to get it to install on my machine.

The installation outlines some steps that you need to perform. These steps can be found at the following url:

I wanted to install it for python 3.4 on ubuntu 14.04, but it seemed to only support python 2.x. Here is what I did to get it to install for python 3.4

Get your virtualenv setup first

virtualenv -p /usr/bin/python3.4 /usr/local

1. install required libraries for linux as outlined in the installation page.

2. download the source file for textract from

3. untar the downloaded file

4. cd into the directory and look for cases of :

except ShellError, e:

and change it to

except ShellError as e:

5. edit the requirements/python file comment out


6. install the python 3 equivalent

pip install pdfminer3k

7. finally run

python3.4 install

Everything should install at this point.