Goose3 API

Goose3

class goose3.Goose(config: Configuration | dict | None = None)[source]

Extract most likely article content and aditional metadata from a URL or previously fetched HTML document

Parameters:

config (Configuration, dict) – A configuration file or dictionary representation of the configuration file

Returns:

An instance of the goose extraction object

Return type:

Goose

close()[source]

Close the network connection and perform any other required cleanup

Note

Auto closed when using goose as a context manager or when garbage collected

extract(url: str | None = None, raw_html: str | None = None) Article[source]

Extract the most likely article content from the html page

Parameters:
  • url (str) – URL to pull and parse

  • raw_html (str) – String representation of the HTML page

Returns:

Representation of the article contents including other parsed and extracted metadata

Return type:

Article

shutdown_network()[source]

Close the network connection

Note

Auto closed when using goose as a context manager or when garbage collected

Configuration

Configuration options to change how and what goose3 extracts and parses.

class goose3.Configuration[source]
property available_parsers: List[str]

A list of all possible parser values for the parser_class

Note

Not settable

Type:

list(str)

property browser_user_agent: str

Browser user agent string to use when making URL requests

Note

Defaults to Goose/{goose3.__version__}

Examples

Using the non-standard browser agent string is advised when pulling frequently

>>> config.browser_user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2)'
>>> config.browser_user_agent = 'AppleWebKit/534.52.7 (KHTML, like Gecko)'
>>> config.browser_user_agent = 'Version/5.1.2 Safari/534.52.7'
property enable_image_fetching: bool

Turn on or off image extraction

Note

Defaults to False

Type:

bool

get_parser() Parser | ParserSoup | Any[source]

Retrieve the current parser class to use for extraction

Returns:

The parser to use

Return type:

Parser

property http_auth: tuple

Authentication class and information to pass to the requests library

Type:

tuple

property http_headers: dict

Custom headers to pass directly to the supporting requests object

Type:

dict

property http_proxies: dict

Proxy information to pass directly to the supporting requests object

Type:

dict

property http_timeout

The time delay to pass to requests to wait for the response in seconds

Note

Defaults to 30.0

Type:

float

property imagemagick_convert_path: str

Path to the convert program that is part of imagemagick

Note

Defaults to “/opt/local/bin/convert”

Warning

Currently not used / implemented

Type:

str

property imagemagick_identify_path: str

Path to the identify program that is part of imagemagick

Note

Defaults to “/opt/local/bin/identify”

Warning

Currently not used / implemented

Type:

str

property images_min_bytes: int

Minimum number of bytes for an image to be evaluated to be the main image of the site

Note

Defaults to 4500 bytes

Type:

int

property keep_footnotes: bool

Specify if footnotes should be kept or not in the cleaned_text output

Note

Defaults to True

Type:

bool

property known_author_patterns: list

The tags to search to find the likely published date

Note

Each entry must be a dictionary with the following keys: attribute, value, and content.

Type:

list

property known_context_patterns: list

The context patterns to search to find the likely article content

Note

Each entry must be a dictionary with the following keys: attr and value or just tag

Type:

list

property known_publish_date_tags

The tags to search to find the likely published date

Note

Each entry must be a dictionary with the following keys: attribute, value, and content.

Type:

list

property local_storage_path

The local path to store temporary files

Note

Defaults to the value of os.path.join(tempfile.gettempdir(), ‘goose’)

Type:

str

property parse_headers: bool

Specify if headers should be pulled or not in the cleaned_text output

Note

Defaults to True

Type:

bool

property parser_class: str

The key of the parser to use

Note

Defaults to lxml

Type:

str

property pretty_lists: bool

Specify if lists should be pretty printed in the cleaned_text output

Note

Defaults to True

Type:

bool

property stopwords_class: Type[StopWords]

The StopWords class to use when analyzing article content

Note

Defaults to the english stop words

Note

Current stop words available in goose3.text include:

StopWords, StopWordsChinese, StopWordsArabic, and StopWordsKorean

Type:

StopWords

property strict: bool

Enable strict mode and throw exceptions instead of swallowing them.

Note

Defaults to True

Type:

bool

property target_language: str

The default target language if the language is not extractable or if use_meta_language is set to False

Note

Default language is ‘en’

Type:

str

property use_meta_language: bool

Determine if language should be extracted from the meta tags or not. If this is set to False then the target_language will be used. Also, if extraction fails then the target_language will be utilized.

Note

Defaults to True

Type:

bool

Configuration Helper Classes

class goose3.configuration.ArticleContextPattern(*, attr=None, value=None, tag=None, domain=None)[source]

Help ensure correctly generated article context patterns

Parameters:
  • attr (str) – The attribute type: class, id, etc

  • value (str) – The value of the attribute

  • tag (str) – The type of tag, such as article that contains the main article body

  • domain (str) – The domain to which this pattern pertains (optional)

Note

Must provide, at a minimum, (attr and value) or (tag)

class goose3.configuration.AuthorPattern(*, attr=None, value=None, content=None, tag=None, subpattern=None)[source]

Ensures that the author patterns are correctly formed for use with the known_author_patterns of configuration

Parameters:
  • attr (str) – The attribute type: class, id, etc

  • value (str) – The value of the attribute

  • content (str) – The name of another attribute (of the element) that contains the value

  • tag (str) – The type of tag, such as author that contains the author information

  • subpattern (str) – A subpattern for elements within the main attribute

class goose3.configuration.PublishDatePattern(*, attr=None, value=None, content=None, subcontent=None, tag=None, domain=None)[source]

Ensure correctly formed publish date patterns; to be used in conjuntion with the configuration known_publish_date_tags property

Parameters:
  • attr (str) – The attribute type: class, id, etc

  • value (str) – The value of the attribute

  • content (str) – The name of another attribute (of the element) that contains the value

  • subcontent (str) – The name of a json object key (optional)

  • tag (str) – The type of tag, such as time that contains the publish date

  • domain (str) – The domain to which this pattern pertains (optional)

Note

Must provide, at a minimum, (attr and value) or (tag)

Article

The result of a goose3 extraction is to return an Article object that contains the results of the parsing process.

class goose3.Article[source]
property additional_data

A property bucket for consumers of goose3 to store custom data extractions

Note

Read only

Type:

dict

property authors

A listing of authors as parsed from the meta tags

Note

Read only

Type:

list(str)

The canonical link of the article if found in the meta data

Note

Read only

Type:

str

property cleaned_text

Cleaned text of the article without HTML tags; most commonly desired property

Note

Read only

Type:

str

property doc

lxml document that is being processed

Note

Read only

Type:

etree

property domain

Domain of the article parsed

Note

Read only

Type:

str

property final_url

The URL that was used to pull and parsed; None if raw_html was used and no url element was found.

Note

Read only

Type:

str

property infos

The summation of all data available about the extracted article

Note

Read only

Type:

dict

The hash of the final url to be used for various identification tasks

Note

Read only

Type:

str

A listing of URL links within the article

Note

Read only

Type:

list(str)

property meta_description

Contents of the meta-description field from the HTML source

Note

Read only

Type:

str

property meta_encoding

Contents of the encoding/charset field from the HTML source

Note

Read only

Type:

str

property meta_favicon

Contents of the meta-favicon field from the HTML source

Note

Read only

Type:

str

property meta_keywords

Contents of the meta-keywords field from the HTML source

Note

Read only

Type:

str

property meta_lang

Contents of the meta-lang field from the HTML source

Note

Read only

Type:

str

property movies

A listing of all videos within the article such as YouTube or Vimeo

Returns:

See more information on the goose3.Video class

Return type:

list(Video)

Note

Read only

Type:

list(Video)

property opengraph

All opengraph tag data

Note

Read only

Type:

dict

property publish_date

The date the article was published based on meta tag extraction

Note

Read only

Type:

str

property publish_datetime_utc

The date time version of the published date based on meta tag extraction in the UTC timezone, if timezone information is known

Note

Read only

Type:

datetime.datetime

property raw_doc

Original, uncleaned, and untouched lxml document to be processed

Note

Read only

Type:

etree

property raw_html

The HTML represented as a string

Note

Read only

Type:

str

property schema

All schema tag data

Note

Read only

Type:

dict

property tags

List of article tags (non-metadata tags)

Note

Read only

Type:

list(str)

property title

Title extracted from the HTML source

Note

Read only

Type:

str

property top_image

The top image object that likely represents the article

Returns:

See more information on the goose3.Image class

Return type:

Image

Note

Read only

Type:

Image

property top_node

The top Element that is a candidate for the main body of the article

Note

Read only

Type:

etree

property top_node_raw_html

The top html of Element that is a candidate for the main body of the article without cleaning

Note

Read only

Type:

etree

property tweets

A listing of embeded tweets in the article

Note

Read only

Type:

list(str)

Image

class goose3.Image[source]
property bytes

The size of the image in bytes

Note

Read only

Type:

int

property confidence_score

The confidence score that this is the main image

Note

Read only

Type:

float

property extraction_type

The extraction type used

Note

Read only

Type:

str

property height

The image height in pixels

Note

Read only

Type:

int

property src

Source URL for the image

Note

Read only

Type:

str

property top_image_node

The most likely top image element node

Note

Read only

Type:

etree

property width

The image width in pixels

Note

Read only

Type:

int

Video

class goose3.Video[source]

Video object

property embed_code

The embed code of the video

Note

Read only

Type:

str

property embed_type

The type of embeding such as embed, object, or iframe

Note

Read only

Type:

str

property height

The video height in pixels

Note

Read only

Type:

int

property provider

The video provider

Note

Read only

Type:

str

property src

The URL source of the video

Note

Read only

Type:

str

property width

The video width in pixels

Note

Read only

Type:

int