Goose3 API
Goose3
- class goose3.Goose(config: Configuration | dict | None = None)[source]
Extract most likely article content and aditional metadata from a URL or previously fetched HTML document
- Parameters:
config (Configuration, dict) – A configuration file or dictionary representation of the configuration file
- Returns:
An instance of the goose extraction object
- Return type:
- close()[source]
Close the network connection and perform any other required cleanup
Note
Auto closed when using goose as a context manager or when garbage collected
- extract(url: str | None = None, raw_html: str | None = None) Article [source]
Extract the most likely article content from the html page
- Parameters:
url (str) – URL to pull and parse
raw_html (str) – String representation of the HTML page
- Returns:
Representation of the article contents including other parsed and extracted metadata
- Return type:
Configuration
Configuration options to change how and what goose3 extracts and parses.
- class goose3.Configuration[source]
- property available_parsers: List[str]
A list of all possible parser values for the parser_class
Note
Not settable
- Type:
list(str)
- property browser_user_agent: str
Browser user agent string to use when making URL requests
Note
Defaults to Goose/{goose3.__version__}
Examples
Using the non-standard browser agent string is advised when pulling frequently
>>> config.browser_user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2)' >>> config.browser_user_agent = 'AppleWebKit/534.52.7 (KHTML, like Gecko)' >>> config.browser_user_agent = 'Version/5.1.2 Safari/534.52.7'
- property enable_image_fetching: bool
Turn on or off image extraction
Note
Defaults to False
- Type:
bool
- get_parser() Parser | ParserSoup | Any [source]
Retrieve the current parser class to use for extraction
- Returns:
The parser to use
- Return type:
Parser
- property http_auth: tuple
Authentication class and information to pass to the requests library
See also
- Type:
tuple
- property http_headers: dict
Custom headers to pass directly to the supporting requests object
See also
- Type:
dict
- property http_proxies: dict
Proxy information to pass directly to the supporting requests object
See also
- Type:
dict
- property http_timeout
The time delay to pass to requests to wait for the response in seconds
Note
Defaults to 30.0
- Type:
float
- property imagemagick_convert_path: str
Path to the convert program that is part of imagemagick
Note
Defaults to “/opt/local/bin/convert”
Warning
Currently not used / implemented
- Type:
str
- property imagemagick_identify_path: str
Path to the identify program that is part of imagemagick
Note
Defaults to “/opt/local/bin/identify”
Warning
Currently not used / implemented
- Type:
str
- property images_min_bytes: int
Minimum number of bytes for an image to be evaluated to be the main image of the site
Note
Defaults to 4500 bytes
- Type:
int
- property keep_footnotes: bool
Specify if footnotes should be kept or not in the cleaned_text output
Note
Defaults to True
- Type:
bool
- property known_author_patterns: list
The tags to search to find the likely published date
Note
Each entry must be a dictionary with the following keys: attribute, value, and content.
- Type:
list
- property known_context_patterns: list
The context patterns to search to find the likely article content
Note
Each entry must be a dictionary with the following keys: attr and value or just tag
- Type:
list
- property known_publish_date_tags
The tags to search to find the likely published date
Note
Each entry must be a dictionary with the following keys: attribute, value, and content.
- Type:
list
- property local_storage_path
The local path to store temporary files
Note
Defaults to the value of os.path.join(tempfile.gettempdir(), ‘goose’)
- Type:
str
- property parse_headers: bool
Specify if headers should be pulled or not in the cleaned_text output
Note
Defaults to True
- Type:
bool
- property parser_class: str
The key of the parser to use
Note
Defaults to lxml
- Type:
str
- property pretty_lists: bool
Specify if lists should be pretty printed in the cleaned_text output
Note
Defaults to True
- Type:
bool
- property stopwords_class: Type[StopWords]
The StopWords class to use when analyzing article content
Note
Defaults to the english stop words
Note
Current stop words available in goose3.text include:
StopWords, StopWordsChinese, StopWordsArabic, and StopWordsKorean
- Type:
StopWords
- property strict: bool
Enable strict mode and throw exceptions instead of swallowing them.
Note
Defaults to True
- Type:
bool
- property target_language: str
The default target language if the language is not extractable or if use_meta_language is set to False
Note
Default language is ‘en’
- Type:
str
- property use_meta_language: bool
Determine if language should be extracted from the meta tags or not. If this is set to False then the target_language will be used. Also, if extraction fails then the target_language will be utilized.
Note
Defaults to True
- Type:
bool
Configuration Helper Classes
- class goose3.configuration.ArticleContextPattern(*, attr=None, value=None, tag=None, domain=None)[source]
Help ensure correctly generated article context patterns
- Parameters:
attr (str) – The attribute type: class, id, etc
value (str) – The value of the attribute
tag (str) – The type of tag, such as article that contains the main article body
domain (str) – The domain to which this pattern pertains (optional)
Note
Must provide, at a minimum, (attr and value) or (tag)
- class goose3.configuration.AuthorPattern(*, attr=None, value=None, content=None, tag=None, subpattern=None)[source]
Ensures that the author patterns are correctly formed for use with the known_author_patterns of configuration
- Parameters:
attr (str) – The attribute type: class, id, etc
value (str) – The value of the attribute
content (str) – The name of another attribute (of the element) that contains the value
tag (str) – The type of tag, such as author that contains the author information
subpattern (str) – A subpattern for elements within the main attribute
- class goose3.configuration.PublishDatePattern(*, attr=None, value=None, content=None, subcontent=None, tag=None, domain=None)[source]
Ensure correctly formed publish date patterns; to be used in conjuntion with the configuration known_publish_date_tags property
- Parameters:
attr (str) – The attribute type: class, id, etc
value (str) – The value of the attribute
content (str) – The name of another attribute (of the element) that contains the value
subcontent (str) – The name of a json object key (optional)
tag (str) – The type of tag, such as time that contains the publish date
domain (str) – The domain to which this pattern pertains (optional)
Note
Must provide, at a minimum, (attr and value) or (tag)
Article
The result of a goose3 extraction is to return an Article object that contains the results of the parsing process.
- class goose3.Article[source]
- property additional_data
A property bucket for consumers of goose3 to store custom data extractions
Note
Read only
- Type:
dict
- property authors
A listing of authors as parsed from the meta tags
Note
Read only
- Type:
list(str)
- property canonical_link
The canonical link of the article if found in the meta data
Note
Read only
- Type:
str
- property cleaned_text
Cleaned text of the article without HTML tags; most commonly desired property
Note
Read only
- Type:
str
- property doc
lxml document that is being processed
Note
Read only
- Type:
etree
- property domain
Domain of the article parsed
Note
Read only
- Type:
str
- property final_url
The URL that was used to pull and parsed; None if raw_html was used and no url element was found.
Note
Read only
- Type:
str
- property infos
The summation of all data available about the extracted article
Note
Read only
- Type:
dict
- property link_hash
The hash of the final url to be used for various identification tasks
Note
Read only
- Type:
str
- property links
A listing of URL links within the article
Note
Read only
- Type:
list(str)
- property meta_description
Contents of the meta-description field from the HTML source
Note
Read only
- Type:
str
- property meta_encoding
Contents of the encoding/charset field from the HTML source
Note
Read only
- Type:
str
- property meta_favicon
Contents of the meta-favicon field from the HTML source
Note
Read only
- Type:
str
- property meta_keywords
Contents of the meta-keywords field from the HTML source
Note
Read only
- Type:
str
- property meta_lang
Contents of the meta-lang field from the HTML source
Note
Read only
- Type:
str
- property movies
A listing of all videos within the article such as YouTube or Vimeo
- Returns:
See more information on the goose3.Video class
- Return type:
list(Video)
Note
Read only
- Type:
list(Video)
- property opengraph
All opengraph tag data
Note
Read only
- Type:
dict
- property publish_date
The date the article was published based on meta tag extraction
Note
Read only
- Type:
str
- property publish_datetime_utc
The date time version of the published date based on meta tag extraction in the UTC timezone, if timezone information is known
Note
Read only
- Type:
datetime.datetime
- property raw_doc
Original, uncleaned, and untouched lxml document to be processed
Note
Read only
- Type:
etree
- property raw_html
The HTML represented as a string
Note
Read only
- Type:
str
- property schema
All schema tag data
Note
Read only
- Type:
dict
- property tags
List of article tags (non-metadata tags)
Note
Read only
- Type:
list(str)
- property title
Title extracted from the HTML source
Note
Read only
- Type:
str
- property top_image
The top image object that likely represents the article
- Returns:
See more information on the goose3.Image class
- Return type:
Note
Read only
- Type:
- property top_node
The top Element that is a candidate for the main body of the article
Note
Read only
- Type:
etree
- property top_node_raw_html
The top html of Element that is a candidate for the main body of the article without cleaning
Note
Read only
- Type:
etree
- property tweets
A listing of embeded tweets in the article
Note
Read only
- Type:
list(str)
Image
- class goose3.Image[source]
- property bytes
The size of the image in bytes
Note
Read only
- Type:
int
- property confidence_score
The confidence score that this is the main image
Note
Read only
- Type:
float
- property extraction_type
The extraction type used
Note
Read only
- Type:
str
- property height
The image height in pixels
Note
Read only
- Type:
int
- property src
Source URL for the image
Note
Read only
- Type:
str
- property top_image_node
The most likely top image element node
Note
Read only
- Type:
etree
- property width
The image width in pixels
Note
Read only
- Type:
int
Video
- class goose3.Video[source]
Video object
- property embed_code
The embed code of the video
Note
Read only
- Type:
str
- property embed_type
The type of embeding such as embed, object, or iframe
Note
Read only
- Type:
str
- property height
The video height in pixels
Note
Read only
- Type:
int
- property provider
The video provider
Note
Read only
- Type:
str
- property src
The URL source of the video
Note
Read only
- Type:
str
- property width
The video width in pixels
Note
Read only
- Type:
int