regparser.tree package

Submodules

regparser.tree.build module

regparser.tree.interpretation module

regparser.tree.paragraph module

class regparser.tree.paragraph.ParagraphParser(p_regex, node_type)[source]
best_start(text, p_level, paragraph, starts, exclude=None)[source]

Given a list of potential paragraph starts, pick the best based on knowledge of subparagraph structure. Do this by checking if the id following the subparagraph (e.g. ii) is between the first match and the second. If so, skip it, as that implies the first match was a subparagraph.

build_tree(text, p_level=0, exclude=None, label=None, title='')[source]

Build a dict to represent the text hierarchy.

find_paragraph_start_match(text, p_level, paragraph, exclude=None)[source]

Find the positions for the start and end of the requested label. p_Level is one of 0,1,2,3; paragraph is the index within that label. Return None if not present. Does not return results in the exclude list (a list of start/stop indices).

static matching_subparagraph_ids(p_level, paragraph)[source]

Return a list of matches if this paragraph id matches one of the subparagraph ids (e.g. letter (i) and roman numeral (i).

paragraph_offsets(text, p_level, paragraph, exclude=None)[source]

Find the start/end of the requested paragraph. Assumes the text does not just up a p_level – see build_paragraph_tree below.

paragraphs(text, p_level, exclude=None)[source]

Return a list of paragraph offsets defined by the level param.

regparser.tree.paragraph.hash_for_paragraph(text)[source]

Hash a chunk of text and convert it into an integer for use with a MARKERLESS paragraph identifier. We’ll trim to just 8 hex characters for legibility. We don’t need to fear hash collisions as we’ll have 16**8 ~ 4 billion possibilities. The birthday paradox tells us we’d only expect collisions after ~ 60 thousand entries. We’re expecting at most a few hundred

regparser.tree.paragraph.p_level_of(marker)[source]

Given a marker(string), determine the possible paragraph levels it could fall into. This is useful for determining the order of paragraphs

regparser.tree.priority_stack module

class regparser.tree.priority_stack.PriorityStack[source]

Bases: object

add(node_level, node)[source]

Add a new node with level node_level to the stack. Unwind the stack when necessary. Returns self for chaining

lineage()[source]

Fetch the last element of each level of priorities. When the stack is used to keep track of a tree, this list includes a list of ‘parents’, as the last element of each level is the parent being processed.

lineage_with_level()[source]
peek()[source]
peek_last()[source]
peek_level(level)[source]

Find a whole level of nodes in the stack

pop()[source]
push(m)[source]
push_last(m)[source]
size()[source]
unwind()[source]

Combine nodes as needed while walking back up the stack. Intended to be overridden, as how to combine elements depends on the element type.

regparser.tree.reg_text module

regparser.tree.reg_text.build_empty_part(part)[source]

When a regulation doesn’t have a subpart, we give it an emptypart (a dummy subpart) so that the regulation tree is consistent.

regparser.tree.reg_text.build_subjgrp(title, part, letter_list)[source]

We’re constructing a fake “letter” here by taking the first letter of each word in the subjgrp’s title, or using the first two letters of the first word if there’s just one—we’re avoiding single letters to make sure we don’t duplicate an existing subpart, and we’re hoping that the initialisms created by this method are unique for this regulation. We can make this more robust by accepting a list of existing initialisms and returning both that list and the Node, and checking against the list as we construct them.

regparser.tree.reg_text.build_subpart(text, part)[source]
regparser.tree.reg_text.find_next_section_start(text, part)[source]

Find the start of the next section (e.g. 205.14)

regparser.tree.reg_text.find_next_subpart_start(text)[source]

Find the start of the next Subpart (e.g. Subpart B)

regparser.tree.reg_text.next_section_offsets(text, part)[source]

Find the start/end of the next section

regparser.tree.reg_text.next_subpart_offsets(text)[source]

Find the start,end of the next subpart

regparser.tree.reg_text.sections(text, part)[source]

Return a list of section offsets. Does not include appendices.

regparser.tree.reg_text.subjgrp_label(starting_title, letter_list)[source]
regparser.tree.reg_text.subparts(text)[source]

Return a list of subpart offset. Does not include appendices, supplements.

regparser.tree.struct module

class regparser.tree.struct.FrozenNode(text='', children=(), label=(), title='', node_type=u'regtext', tagged_text='')[source]

Bases: object

Immutable interface for nodes. No guarantees about internal state.

child_labels
children
clone(**kwargs)[source]

Implement a namedtuple _replace style functionality, copying all fields that aren’t explicitly replaced.

static from_node(node)[source]

Convert a struct.Node (or similar) into a struct.FrozenNode. This also checks if this node has already been instantiated. If so, it returns the instantiated version (i.e. only one of each identical node exists in memory)

hash
label
label_id
node_type
prototype()[source]

When we instantiate a FrozenNode, we add it to _pool if we’ve not seen an identical FrozenNode before. If we have, we want to work with that previously seen version instead. This method returns the _first_ FrozenNode with identical fields

tagged_text
text
title
class regparser.tree.struct.FullNodeEncoder(skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, encoding='utf-8', default=None)[source]

Bases: json.encoder.JSONEncoder

Encodes Nodes into JSON, not losing any of the fields

FIELDS = set(['tagged_text', 'title', 'text', 'source_xml', 'label', 'node_type', 'children'])
default(obj)[source]
class regparser.tree.struct.Node(text='', children=None, label=None, title=None, node_type=u'regtext', source_xml=None, tagged_text='')[source]

Bases: object

APPENDIX = u'appendix'
EMPTYPART = u'emptypart'
EXTRACT = u'extract'
INTERP = u'interp'
INTERP_MARK = 'Interp'
MARKERLESS_REGEX = <_sre.SRE_Pattern object>
NOTE = u'note'
REGTEXT = u'regtext'
SUBPART = u'subpart'
cfr_part
depth()[source]

Inspect the label and type to determine the node’s depth

is_markerless()[source]
classmethod is_markerless_label(label)[source]
is_section()[source]

Sections are contained within subparts/subject groups. They are not part of the appendix

label_id()[source]
walk(fn)[source]

See walk(node, fn)

class regparser.tree.struct.NodeEncoder(skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, encoding='utf-8', default=None)[source]

Bases: json.encoder.JSONEncoder

Custom JSON encoder to handle Node objects

default(obj)[source]
regparser.tree.struct.filter_walk(node, fn)[source]

Perform fn on the label for every node in the tree and return a list of nodes on which the function returns truthy.

regparser.tree.struct.find(root, label)[source]

Search through the tree to find the node with this label.

regparser.tree.struct.find_first(root, predicate)[source]

Walk the tree and find the first node which matches the predicate

regparser.tree.struct.find_parent(root, label)[source]

Search through the tree to find the _parent_ or a node with this label.

regparser.tree.struct.frozen_node_decode_hook(d)[source]

Convert a JSON object into a FrozenNode

regparser.tree.struct.full_node_decode_hook(d)[source]

Convert a JSON object into a full Node

regparser.tree.struct.merge_duplicates(nodes)[source]

Given a list of nodes with the same-length label, merge any duplicates (by combining their children)

regparser.tree.struct.treeify(nodes)[source]

Given a list of nodes, convert those nodes into the appropriate tree structure based on their labels. This assumes that all nodes will fall under a set of ‘root’ nodes, which have the min-length label.

regparser.tree.struct.walk(node, fn)[source]

Perform fn for every node in the tree. Pre-order traversal. fn must be a function that accepts a root node.

regparser.tree.supplement module

regparser.tree.supplement.find_supplement_start(text, supplement='I')[source]

Find the start of the supplement (e.g. Supplement I)

Module contents