

Goose3 is a python3 fork of the python-goose library. To use goose3, one must run everything using python3. All python commands assume the usage of the correct python version.

Using pip

The easiest way to install goose3 is to use pip:

$ pip install goose3

From source

To install from source simply clone the repository on GitHub, then run the following command from the extracted folder:

$ python setup.py install


Setting up Goose3 using the standard configuration is fairly straight forward:

from goose3 import Goose

g = Goose()
article = g.extract(url='http://this-url.html')

For extracting lots of HTML files or URLs, one can also use it as a context manager:

from goose3 import Goose

urls = [...]
with Goose() as g:
    for tmp in urls:
        article = g.extract(url=tmp)

Setting Config Options

One can also alter how goose3 performs the extraction and what items are extracted by passing a configuration to Goose. There are several ways to set the configuration options.

For more details on available configuration settings, see Configuration

Use Configuration object

from goose3 import Goose
from goose3.configuration import Configuration

config = Configuration()
config.strict = False  # turn of strict exception handling
config.browser_user_agent = 'Mozilla 5.0'  # set the browser agent string
config.http_timeout = 5.05  # set http timeout in seconds

with Goose(config) as g:

Use Dictionary

One can pass in a dictionary with keys that match the configuration properties one would like to change:

from goose3 import Goose

config = {}
config['strict'] = False  # turn of strict exception handling
config['browser_user_agent'] = 'Mozilla 5.0'  # set the browser agent string
config['http_timeout'] = 5.05  # set http timeout in seconds

with Goose(config) as g:

Or if there are only a few changes:

from goose3 import Goose

with Goose({'http_timeout': 5.0}) as g:

After Object Creation

One can also change configuration options after the Goose object has been created:

from goose3 import Goose

g = Goose()
g.config.browser_user_agent = 'Mozilla 5.0'

Configuration Helper Classes

For some, more complex configuration options, there are classes available to help ensure that the correct values are provided. One does not need to use the provided classes, but it does make things a bit simpler.

from goose3 import Goose
from goose3.configuration import Configuration, ArticleContextPattern, PublishDatePattern, AuthorPattern

config = Configuration()

# we know of a particular article location in the site we are pulling from
config.known_context_patterns = ArticleContextPattern(attr="id", value="my-site-article")

# publish date
config.known_publish_date_tags = PublishDatePattern(attr="id", value="pubdate", content="content")

# author
config.known_author_patterns = AuthorPattern(attr="id", value="writer", content="content")

Reading Results

Results from the extraction are returned as an Article object. Reading the desired results is as simple as reading the desired property. The most commonly asked for property is cleaned_text which holds the non-html formatted text of the extracted article.

For more details and for all available properties, see Article

from goose3 import Goose

urls = [...]
with Goose() as g:
    for tmp in urls:
        article = g.extract(url=tmp)

Using with PyInstaller

It should be possible to use goose3 with tools such as PyInstaller to add goose to your executable program. To do so, you will need to add the required resources to the executable.

You will need to add the files to a folder in your executable called goose3/resources/ to match the location that goose3 checks for the required files.

pyinstaller \
    --add-data="goose3/resources/images/*;goose3/resources/images/" \
    --add-data="goose3/resources/text/*;goose3/resources/text/" \