regparser.layer package

Submodules

regparser.layer.def_finders module

Parsers for finding a term that’s being defined within a node

class regparser.layer.def_finders.DefinitionKeyterm(parent)[source]

Bases: object

Matches definitions identified by being a first-level paragraph in a section with a specific title

find(node)[source]
class regparser.layer.def_finders.ExplicitIncludes[source]

Bases: regparser.layer.def_finders.FinderBase

Definitions can be explicitly included in the settings. For example, say that a paragraph doesn’t indicate that a certain phrase is a definition; we can define INCLUDE_DEFINITIONS_IN in our settings file, which will be checked here.

find(node)[source]
class regparser.layer.def_finders.FinderBase[source]

Bases: object

Base class for all of the definition finder classes. Defines the interface they must implement

find(node)[source]

Given a Node, pull out any definitions it may contain as a list of Refs

class regparser.layer.def_finders.Ref[source]

Bases: regparser.layer.def_finders.Ref

A reference to a defined term. Keeps track of the term, where it was found and the term’s position in that node’s text

end
position
class regparser.layer.def_finders.ScopeMatch(finder)[source]

Bases: regparser.layer.def_finders.FinderBase

We know these will be definitions because the scope of the definition is spelled out. E.g. ‘for the purposes of XXX, the term YYY means’

find(node)[source]
class regparser.layer.def_finders.SmartQuotes(stack)[source]

Bases: regparser.layer.def_finders.FinderBase

Definitions indicated via smart quotes

find(node)[source]
has_def_indicator()[source]

With smart quotes, we catch some false positives, phrases in quotes that are not terms. This extra test lets us know that a parent of the node looks like it would contain definitions.

class regparser.layer.def_finders.XMLTermMeans(existing_refs=None)[source]

Bases: regparser.layer.def_finders.FinderBase

Namespace for a matcher for e.g. ‘<E>XXX</E> means YYY’

find(node)[source]
pos_start(needle, haystack)[source]

Search for the first instance of needle in the haystack excluding any overlaps from self.exclusions. Implicitly returns None if it can’t be found

regparser.layer.external_citations module

class regparser.layer.external_citations.ExternalCitationParser(tree, **context)[source]

Bases: regparser.layer.layer.Layer

External Citations are references to documents outside of eRegs. See external_types for specific types of external citations

process(node)[source]
shorthand = 'external-citations'

regparser.layer.external_types module

Parsers for various types of external citations. Consumed by the external citation layer

class regparser.layer.external_types.CFRFinder[source]

Bases: regparser.layer.external_types.FinderBase

Code of Federal Regulations. Explicitly ignore any references within this part

CITE_TYPE = 'CFR'
find(node)[source]
class regparser.layer.external_types.Cite(cite_type, start, end, components, url)

Bases: tuple

cite_type

Alias for field number 0

components

Alias for field number 3

end

Alias for field number 2

start

Alias for field number 1

url

Alias for field number 4

class regparser.layer.external_types.CustomFinder[source]

Bases: regparser.layer.external_types.FinderBase

Explicitly configured citations; part of settings

CITE_TYPE = 'OTHER'
find(node)[source]
class regparser.layer.external_types.FDSYSFinder[source]

Bases: object

Common parent class to Finders which generate an FDSYS url based on matching a PyParsing grammar

CONST_PARAMS

Constant parameters we pass to the FDSYS url; a dict

GRAMMAR

A pyparsing grammar with relevant components labeled

find(node)[source]
class regparser.layer.external_types.FinderBase[source]

Bases: object

Base class for all of the external citation parsers. Defines the interface they must implement.

CITE_TYPE

A constant to represent the citations this produces.

find(node)[source]

Give a Node, pull out any external citations it may contain as a generator of Cites

class regparser.layer.external_types.PublicLawFinder[source]

Bases: regparser.layer.external_types.FDSYSFinder, regparser.layer.external_types.FinderBase

Public Law

CITE_TYPE = 'PUBLIC_LAW'
CONST_PARAMS = {'collection': 'plaw', 'lawtype': 'public'}
GRAMMAR = QuickSearchable:({{{{Suppress:({{WordStart 'Public'} WordEnd}) Suppress:({{WordStart 'Law'} WordEnd})} W:(0123...)} Suppress:("-")} W:(0123...)})
class regparser.layer.external_types.StatutesFinder[source]

Bases: regparser.layer.external_types.FDSYSFinder, regparser.layer.external_types.FinderBase

Statutes at large

CITE_TYPE = 'STATUTES_AT_LARGE'
CONST_PARAMS = {'collection': 'statute'}
GRAMMAR = QuickSearchable:({{W:(0123...) Suppress:("Stat.")} W:(0123...)})
class regparser.layer.external_types.USCFinder[source]

Bases: regparser.layer.external_types.FDSYSFinder, regparser.layer.external_types.FinderBase

U.S. Code

CITE_TYPE = 'USC'
CONST_PARAMS = {'collection': 'uscode'}
GRAMMAR = QuickSearchable:({{{W:(0123...) "U.S.C."} Suppress:(["Chapter"])} W:(0123...)})
class regparser.layer.external_types.UrlFinder[source]

Bases: regparser.layer.external_types.FinderBase

Any raw urls in the text

CITE_TYPE = 'OTHER'
PUNCTUATION = '.,;?\'")-'
REGEX = <_sre.SRE_Pattern object>
find(node)[source]
regparser.layer.external_types.fdsys_url(**params)[source]

Generate a URL to an FDSYS redirect

regparser.layer.formatting module

Find and abstracts formatting information from the regulation tree. In many ways, this is like a markdown parser.

class regparser.layer.formatting.Dashes[source]

Bases: regparser.layer.formatting.PlaintextFormatData

E.g. Some text some text_____

REGEX = <_sre.SRE_Pattern object>
match_data(match)[source]
class regparser.layer.formatting.FencedData[source]

Bases: regparser.layer.formatting.PlaintextFormatData

E.g. `note Line 1 Line 2 `

REGEX = <_sre.SRE_Pattern object>
match_data(match)[source]
class regparser.layer.formatting.Footnotes[source]

Bases: regparser.layer.formatting.PlaintextFormatData

E.g. [^4](Contents of footnote) The footnote may also contain parens if they are escaped with a backslash

REGEX = <_sre.SRE_Pattern object>
match_data(match)[source]
class regparser.layer.formatting.Formatting(tree, **context)[source]

Bases: regparser.layer.layer.Layer

Layer responsible for tables, subscripts, and other formatting-related information

process(node)[source]
shorthand = 'formatting'
class regparser.layer.formatting.HeaderStack[source]

Bases: regparser.tree.priority_stack.PriorityStack

Used to determine Table Headers – indeed, they are complicated enough to warrant their own stack

unwind()[source]
class regparser.layer.formatting.PlaintextFormatData[source]

Bases: object

Base class for formatting information which can be derived from the plaintext of a regulation node

REGEX

Regular expression used to find matches in the plain text

match_data(match)[source]

Derive data structure (as a dict) from the regex match

process(text)[source]

Find all matches of self.REGEX, transform them into the appropriate data structure, return these as a list

class regparser.layer.formatting.Subscript[source]

Bases: regparser.layer.formatting.PlaintextFormatData

E.g. a_{0}

REGEX = <_sre.SRE_Pattern object>
match_data(match)[source]
class regparser.layer.formatting.Superscript[source]

Bases: regparser.layer.formatting.PlaintextFormatData

E.g. x^{2}

REGEX = <_sre.SRE_Pattern object>
match_data(match)[source]
class regparser.layer.formatting.TableHeaderNode(text, level)[source]

Bases: object

Represents a cell in a table’s header

height()[source]
width()[source]
regparser.layer.formatting.build_header(xml_nodes)[source]

Builds a TableHeaderNode tree, with an empty root. Each node in the tree includes its colspan/rowspan

regparser.layer.formatting.build_header_rowspans(tree_root, max_height)[source]

The following table is an example of why we need a relatively complicated approach to setting rowspan:

|R1C1 |R1C2 | |R2C1|R2C2|R2C3 |R2C4 | | | |R3C1|R3C2|R3C3|R3C4|

If we set the rowspan of each node to:

max_height - node.height() - node.level + 1

R1C1 will end up with a rowspan of 2 instead of 1, because of difficulties handling the implicit rowspans for R2C1 and R2C2.

Instead, we generate a list of the paths to each leaf and then set rowspan based on that.

Rowspan for leaves is max_height - node.height() - node.level + 1, and for root is simply 1. Other nodes’ rowspans are set to the level of the node after them minus their own level.

regparser.layer.formatting.node_to_table_xml_els(node)[source]

Search in a few places for GPOTABLE xml elements

regparser.layer.formatting.table_xml_to_data(xml_node)[source]

Construct a data structure of the table data. We provide a different structure than the native XML as the XML encodes too much logic. This structure can be used to generate semi-complex tables which could not be generated from the markdown above

regparser.layer.formatting.table_xml_to_plaintext(xml_node)[source]

Markdown representation of a table. Note that this doesn’t account for all the options needed to display the table properly, but works fine for simple tables. This gets included in the reg plain text

regparser.layer.graphics module

regparser.layer.internal_citations module

class regparser.layer.internal_citations.InternalCitationParser(tree, cfr_title, **context)[source]

Bases: regparser.layer.layer.Layer

parse(text, label, title=None)[source]

Parse the provided text, pulling out all the internal (self-referential) citations.

pre_process()[source]

As a preprocessing step, run through the entire tree, collecting all labels.

process(node)[source]
remove_missing_citations(citations, text)[source]

Remove any citations to labels we have not seen before (i.e. those collected in the pre_processing stage)

shorthand = 'internal-citations'
static strip_whitespace(text, citations)[source]

Modifies the offsets to exclude any trailing whitespace. Modifies the offsets in place.

regparser.layer.interpretations module

regparser.layer.key_terms module

class regparser.layer.key_terms.KeyTerms(tree, **context)[source]

Bases: regparser.layer.layer.Layer

static is_definition(node, keyterm)[source]

A definition might be masquerading as a keyterm. Do not allow this

classmethod keyterm_in_node(node, ignore_definitions=True)[source]
process(node)[source]

Get keyterms if we have text in the node that preserves the <E> tags.

shorthand = u'keyterms'
regparser.layer.key_terms.keyterm_in_text(tagged_text)[source]

Pull out the key term of the provided markup using a regex. The XML <E> tags that indicate keyterms are also used for italics, which means some non-key term phrases would be lumped in. We eliminate them here.

regparser.layer.layer module

class regparser.layer.layer.Layer(tree, **context)[source]

Bases: object

Base class for all of the Layer generators. Defines the interface they must implement

build(cache=None)[source]
builder(node, cache=None)[source]
static convert_to_search_replace(matches, text, start_fn, end_fn)[source]

We’ll often have a bunch of text matches based on offsets. To use the “search-replace” encoding (which is a bit more resilient to minor variations in text), we need to convert these offsets into “locations” – i.e. of all of the instances of a string in this text, which should be matched. Yields SearchReplace tuples

pre_process()[source]

Take the whole tree and do any pre-processing

process(node)[source]

Construct the element of the layer relevant to processing the given node, so it returns (pargraph_id, layer_content) or None if there is no relevant information.

shorthand

Unique identifier for this layer

class regparser.layer.layer.SearchReplace(text, locations, representative)

Bases: tuple

locations

Alias for field number 1

representative

Alias for field number 2

text

Alias for field number 0

regparser.layer.meta module

class regparser.layer.meta.Meta(tree, cfr_title, version, **context)[source]

Bases: regparser.layer.layer.Layer

effective_date()[source]
process(node)[source]

If this is the root element, add some ‘meta’ information about this regulation, including its cfr title, effective date, and any configured info

shorthand = 'meta'

regparser.layer.model_forms_text module

regparser.layer.paragraph_markers module

class regparser.layer.paragraph_markers.ParagraphMarkers(tree, **context)[source]

Bases: regparser.layer.layer.Layer

process(node)[source]

Look for any leading paragraph markers.

shorthand = 'paragraph-markers'
regparser.layer.paragraph_markers.marker_of(node)[source]

Try multiple potential marker formats. See if any apply to this node.

regparser.layer.scope_finder module

class regparser.layer.scope_finder.ScopeFinder[source]

Bases: object

Useful for determining the scope of a term

add_subparts(root)[source]

Document the relationship between sections and subparts

determine_scope(stack)[source]
scope_of_text(text, label_struct, verify_prefix=True)[source]

Given specific text, try to determine the definition scope it indicates. Implicit return None if none is found.

subpart_scope(label_parts)[source]

Given a label, determine which sections fall under the same subpart

regparser.layer.section_by_section module

class regparser.layer.section_by_section.SectionBySection(tree, notices, **context)[source]

Bases: regparser.layer.layer.Layer

process(node)[source]

Determine which (if any) section-by-section analyses would apply to this node.

shorthand = 'analyses'

regparser.layer.table_of_contents module

class regparser.layer.table_of_contents.TableOfContentsLayer(tree, **context)[source]

Bases: regparser.layer.layer.Layer

check_toc_candidacy(node)[source]

To be eligible to contain a table of contents, all of a node’s children must have a title element. If one of the children is an empty subpart, we check all it’s children.

process(node)[source]

Create a table of contents for this node, if it’s eligible. We ignore subparts.

shorthand = 'toc'

regparser.layer.terms module

class regparser.layer.terms.Inflected(singular, plural)

Bases: tuple

plural

Alias for field number 1

singular

Alias for field number 0

class regparser.layer.terms.ParentStack[source]

Bases: regparser.tree.priority_stack.PriorityStack

Used to keep track of the parents while processing nodes to find terms. This is needed as the definition may need to find its scope in parents.

parent_of(node)[source]
unwind()[source]

No collapsing needs to happen.

class regparser.layer.terms.Terms(*args, **kwargs)[source]

Bases: regparser.layer.layer.Layer

ENDS_WITH_WORDCHAR = <_sre.SRE_Pattern object>
STARTS_WITH_WORDCHAR = <_sre.SRE_Pattern object>
applicable_terms(label)[source]

Find all terms that might be applicable to nodes with this label. Note that we don’t have to deal with subparts as subpart_scope simply applies the definition to all sections in a subpart

calculate_offsets(text, applicable_terms, exclusions=None, inclusions=None)[source]

Search for defined terms in this text, including singular and plural forms of these terms, with a preference for all larger (i.e. containing) terms.

excluded_offsets(node)[source]

We explicitly exclude certain chunks of text (for example, words we are defining shouldn’t have links appear within the defined term.) More will be added in the future

ignored_offsets(cfr_part, text)[source]

Return a list of offsets corresponding to the presence of an “ignored” phrase in the text

inflected(term)[source]

Check the memoized Inflected version of the provided term

is_exclusion(term, node)[source]

Some definitions are exceptions/exclusions of a previously defined term. At the moment, we do not want to include these as they would replace previous (correct) definitions. We also remove terms which are inside an instance of the IGNORE_DEFINITIONS_IN setting

look_for_defs(node, stack=None)[source]

Check a node and recursively check its children for terms which are being defined. Add these definitions to self.scoped_terms.

node_definitions(node, stack=None)[source]

Find defined terms in this node’s text.

pre_process()[source]

Step through every node in the tree, finding definitions. Also keep track of which subpart we are in. Finally, document all defined terms.

process(node)[source]

Determine which (if any) definitions would apply to this node, then find if any of those terms appear in this node

shorthand = u'terms'

Module contents