pacer_lib.reader

class pacer_lib.reader.UTF8Recoder(f, encoding)[source]

Iterator that reads an encoded stream and reencodes the input to UTF-8

class pacer_lib.reader.UnicodeReader(f, dialect=<class csv.excel>, encoding='utf-8', **kwds)[source]

A CSV reader which will iterate over lines in the CSV file “f”, which is encoded in the given encoding.

class pacer_lib.reader.UnicodeWriter(f, dialect=<class csv.excel>, encoding='utf-8', **kwds)[source]

A CSV writer which will write rows to CSV file “f”, which is encoded in the given encoding.

class pacer_lib.reader.docket_parser(docket_path='./results/local_docket_archive', output_path='./results')[source]

Returns a docket_parser object that provides functions which allow you to quickly load .html PACER docket sheets from the specified docket_path parse metadata (about both the download of the docket as well as the characteristics of the case), and convert into a machine-readable format (CSV)

This object is built on top of BeautifulSoup 4.

Keyword Arguments:

  • docket_path: which specifies a relative path to the storage of dockets (i.e., input data); dockets shoudl be in .html format
  • output_path: which specifies a relative path to the folder where output should be written. If this folder does not exist, it will be created. If the two subfolders (/case_meta/ and /download_meta) do not exist within the output_path, then they will also be created.
extract_all_meta(data, debug=False)[source]

Returns two dictionaries, one that has download_meta and one that contains meta extracted from the docket. extract_all_meta() runs extract_case_meta(), extract_lawyer_meta() and extract_download_meta() on data (a string literal of an .html document). It returns two dictionaries (one containing download_meta and one containing both case_meta and lawyer_meta) because download_meta and case_meta have overlapping information.

If debug is not turned on, extract_all_meta will ignore any error output from the sub functions (e.g., if the functions cannot find the relevant sections).

Output Documentation See the output documentation of extract_case_meta(), extract_lawyer_meta() and extract_download_meta().

extract_case_meta(data)[source]

Returns a dictionary of case information (e.g., case_name, demand, nature of suit, jurisdiction, assigned judge, etc.) extracted from an .html docket (passed as a string literal through data). This information should be available in all dockets downloaded from PACER.

This information may overlap with information from extract_download_meta(), but it is technically extracted from a different source (the docket sheet, rather than the results page of the PACER Case Locator).

In consolidated cases, there is information about the lead case, and a link. We extract any links in the case_meta section of the document and store it in the dictionary with the key meta_links.

There are some encoding issues with characters such as à that we have tried to address, but may need to be improved in the future.

If extract_case_meta() cannot find the case_meta section of the docket, it will return a dictionary with a single key, Error_case_meta.

Output Documentation Please note that extract_case_meta does common cleaning and then treats each (text):(text) line as a key:value pair, so this documentation only documents the most common keys that we have observed.

These keys are, generally, self-explanatory and are only listed for convenience.

  • Case name
  • Assigned to
  • Referred to
  • Demand
  • Case in other court
  • Cause
  • Date Filed
  • Date Terminated
  • Jury Demand
  • Nature of Suit
  • Jurisdiction

Special keys:

  • Member case: the existence of this key indicates that this is probably the lead case of a consolidated case.
  • Lead case: the existence of this key indicates that this is probably a member case of a consolidated case.
  • meta_links: this will only exists if there are links in the case_meta section of the PACER docket.
extract_download_meta(data)[source]

Returns a dictionary that contains all of the downloadmeta that was stored by pacer_lib.scraper() at the time of download (i.e., the detailed_info json object that is commented out at the top of new downloads from PACER). This is meant to help improve reproducibility.

detailed_info is an add-on in later versions of pacer_lib that records case-level data from the search screen (date_closed, link, nature of suit, case-name, etc.) as well as the date and time of download.

In earlier versions of pacer_lib (i.e., released as pacer_scraper_library), this was stored as a list and did not include the date and time of download. extract_download_meta() can also handle these detailed_info objects.

If there is no detailed_info, the function returns a dictionary with the key ‘Error_download_meta’.

Keyword Arguments

  • data: should be a string, read from a .html file.

Output Documentation Unless otherwise noted, all of these are collected from the PACER Case Locator results page. This is documented as key: description of value.

These terms are found in documents downloaded by any version of pacer_lib:

  • searched_case_no: the case number that was passed to pacer_lib.scraper(), this is recorded to ensure reproducibility and comes from pacer_lib. This is not found on the PACER Case Locator results page.
  • court_id: the abbreviation for the court the case was located in
  • case_name: the name of the case, as recorded by PACER
  • nos: a code for “Nature of Suit”
  • date_filed: the date the case was filed, as recorded by PACER
  • date_closed: the date the case was closed, as recorded by PACER
  • link: a link to the docket

These are only in documents downloaded with newer versions of pacer_lib:

  • downloaded: string that describes the time the docket was downloaded by pacer_lib. This is not found on the PACER Case Locator results page. (Format: yyyy-mm-dd,hh:mm:ss)
  • listed_case_no: string that describes the preferred PACER case no for this case (as opposed to the query we submitted)
  • result_no: which result was the case on the PACER Case Locator results page.
extract_lawyer_meta(data)[source]

Returns a dictionary of information about the plaintiff, defendant and their lawyers extracted from an .html docket (passed as a string literal through data).

At the moment, extract_lawyer_meta() only handles the most common listing (i.e., if there is one listing for plaintiff and one listing for defendant). If there is more than one set of plaintiffs or defendants (e.g., in a class action suit), the function will return a dictionary with a single key Error_lawyer_meta. This function will not handle movants and will probably not handle class-action cases.

In dockets downloaded from older versions of pacer_lib (e.g., pacer_scraper_library), lawyer information was not requested so the dockets will not contain any lawyer_meta to be extracted.

Output Documentation This is documented as key: description of value.

  • plaintiffs: list of the names of plaintiffs
  • defendants: list of the names of defendants
  • plaintiffs_attorneys: list of the name of attorneys representing the plaintiffs
  • defendants_attorneys: list of the name of attorneys representing the defendants
  • plaintiffs_attorneys_details: string that contains the cleaned output of all plaintiff lawyer data (e.g., firm, address, email, etc.) that can be further cleaned in the future.
  • defendants_attorneys_details: string that contains the cleaned output of all defendant lawyer data (e.g., firm, address, email, etc.) that can be further cleaned in the future.
parse_data(data)[source]

Returns a list of all of the docket entries in data, which should be a string literal. BeautifulSoup is useed to parse a .html docket file (pass as a string literal through data) into a list of docket entries. Each docket entry is also a list.

This uses html.parser and, in the case of failure, switches to html5lib.

If it cannot find the table or entries, it will return a string as an error message.

Keyword Arguments

  • data: should be a string, read from a .html file.

Output Documentation

  1. date_filed
  2. document_number
  3. docket_description
  4. link_exist (this is a dummy to indicate the existence of a link)
  5. document_link (docket_number does not uniquely identify the docket entry so we also create a separate unique identifier)
  6. unique_id (document_number is not a unique identifier so we create one based on the placement in the .html docket sheet)
parse_dir(overwrite=True, get_meta=True)[source]

Run parse_data() and extract_all_meta() on each file in the docket_path folder and writes the output to the output_path.

Output Documentation This function returns nothing.

File documentation The docket entries of each docket are stored as a .csv in a folder ‘processed_dockets’. The filename of the csv indicates the source docket and the columns represent (in order):

  1. date_filed
  2. document_number
  3. docket_description
  4. link_exist (this is a dummy to indicate the existence of a link)
  5. document_link (docket_number does not uniquely identify the docket entry so we also create a separate unique identifier)
  6. unique_id (document_number is not a unique identifier so we create one based on the placement in the .html docket sheet)

The download meta and case and lawyer meta information of each docket is stored as a JSON-object in the sub-folders ‘processed_dockets_meta/download_meta/’ and ‘processed_dockets_meta/case_meta/’ within the output path. The files indicate the source docket and are prefixed by download_meta_ and case_meta_, respectively.

class pacer_lib.reader.docket_processor(processed_path='./results/parsed_dockets', output_path='./results/')[source]

Returns a docket_processor() object that allows for keyword and boolean searching of docket entries from dockets specified in processed_path. docket_processor relies on the use of docket_parser` to parse .html PACER dockets into structured .csv, although it is theoretically possible (but quite tedious) to independently bring dockets into compliance for use with docket_processor.

This will give you a set of documents (and their associated links) for download (and which can be passed to pacer_lib.scraper()).

The object then outputs a docket-level or consolidated .csv that describes all documents that meet the search criteria (stored in hit_list).

Keyword Arguments

  • processed_path points to the folder containing .csv docket files
  • output_path points to the folder where you would like output to be stored. Note that the output will actually be stored in a subfolder of the output_path called /docket_hits/. If the folders do not exist, they will be created.
search_dir(require_term=[], exclude_term=[], case_sensitive=False, within=0)[source]

Runs search_docket() on each docket in self.processed_path and adds hits to self.hit_list as a key value pair case_number : [docket entries], where case_number is taken from the filename and [docket_entries] is a list of docket entries (which are also lists) that meet the search criteria.

The search criteria is specified by require_term, exclude_term, case_sensitive and within, such that:

  • if within !=0, all searches are constrained to the first x characters of the text, where x = within
  • all strings in the list require_term are found in text (or the first x charactersm, if within is used)
  • and, no strings in the list exclude_term are found in text (or the first x charactersm, if within is used)
  • if case_sensitive =True, then the search is case sensitive

Returns nothing.

search_docket(docket, require_term=[], exclude_term=[], case_sensitive=False, within=0)[source]

Returns a lists of docket entries that match the search criteria. Docket entries are lists that should have the same structure as described in docket_parser, i.e. in order:

  1. date_filed
  2. document_number
  3. docket_description
  4. link_exist (this is a dummy to indicate the existence of a link)
  5. document_link (docket_number does not uniquely identify the docket entry so we also create a separate unique identifier)
  6. unique_id (document_number is not a unique identifier so we create one based on the placement in the .html docket sheet)

The docket is specified by the argument docket and searched for in the self.processed_path folder.

The search criteria is specified by require_term, exclude_term, case_sensitive and within, such that:

  • if within !=0, all searches are constrained to the first x characters of the text, where x = within
  • all strings in the list require_term are found in text (or the first x charactersm, if within is used)
  • and, no strings in the list exclude_term are found in text (or the first x charactersm, if within is used)
  • if case_sensitive =True, then the search is case sensitive
search_text(text, require_term=[], exclude_term=[], case_sensitive=False)[source]

Returns a boolean indicating if all criteria are satisified in text. The criteria are determined in this way:

  • all strings in the list require_term are found in text
  • and, no strings in the list exclude_term are found in text

If you pass a string instead of a list to either require_term or exclude_term, search_text() will convert it to a list.

This search is, by default case-insensitive, but you can turn on case-sensitive search through case_sensitive.

write_all_matches(suffix, overwrite_flag=False)[source]

Writes all of the matches found in the self.hit_list dictionary to a single .csv file (all_match__[suffix].csv) in the self.output_path. The columns of the .csv are (in order):

  1. case_number (as defined by the source .csv)
  2. date_filed
  3. document_number
  4. docket_description
  5. link_exist (this is a dummy to indicate the existence of a link)
  6. document_link (docket_number does not uniquely identify the docket entry so we also create a separate unique identifier)
  7. unique_id (document_number is not a unique identifier so we create one based on the placement in the .html docket sheet)

There is a flag for overwriting.

You cannot use / \  % * : | " < > . _ in the suffix.

Returns nothing.

write_individual_matches(suffix, overwrite_flag=False)[source]

Writes all of the matches in the self.hit_list dictionary to one .csv file per docket sheet (determined by the source .csv) in a folder named after the suffix. To distinguish from the source .csv, they are prefixed by a ^. They are also suffixed to allow for multiple searches of the same source .csv.

Suffix is required and if the same suffix is specified, it will overwrite previous searches if the overwrite flag is turned on. (It will delete all of the old files in the suffix folder.)

You cannot use / \  % * : | " < > . _ in the suffix.

Returns nothing.

class pacer_lib.reader.document_sorter(docket_path='./results/local_docket_archive', document_path='./results/local_document_archive', output_path='./results', searchable_criteria='court')[source]

Not implemented yet. Sorry.

convert_PDF_to_text(filename)[source]

Convert a file to text and save it in the text_output_path

convert_all(overwrite=False)[source]

For files in the document path, use convert_PDF_to_text if it has not been converted before. Determine if a file is searchable or not.

count()[source]

Count the file_index

export_file_index()[source]

Save the file_index to a file

flag_searchable()[source]

Flag according to self.flags() Move files to a folder (make this an option)

set_flag()[source]

Add a criteria to the flagging process.

docket_parser

class pacer_lib.reader.docket_parser(docket_path='./results/local_docket_archive', output_path='./results')[source]

Returns a docket_parser object that provides functions which allow you to quickly load .html PACER docket sheets from the specified docket_path parse metadata (about both the download of the docket as well as the characteristics of the case), and convert into a machine-readable format (CSV)

This object is built on top of BeautifulSoup 4.

Keyword Arguments:

  • docket_path: which specifies a relative path to the storage of dockets (i.e., input data); dockets shoudl be in .html format
  • output_path: which specifies a relative path to the folder where output should be written. If this folder does not exist, it will be created. If the two subfolders (/case_meta/ and /download_meta) do not exist within the output_path, then they will also be created.
extract_all_meta(data, debug=False)[source]

Returns two dictionaries, one that has download_meta and one that contains meta extracted from the docket. extract_all_meta() runs extract_case_meta(), extract_lawyer_meta() and extract_download_meta() on data (a string literal of an .html document). It returns two dictionaries (one containing download_meta and one containing both case_meta and lawyer_meta) because download_meta and case_meta have overlapping information.

If debug is not turned on, extract_all_meta will ignore any error output from the sub functions (e.g., if the functions cannot find the relevant sections).

Output Documentation See the output documentation of extract_case_meta(), extract_lawyer_meta() and extract_download_meta().

extract_case_meta(data)[source]

Returns a dictionary of case information (e.g., case_name, demand, nature of suit, jurisdiction, assigned judge, etc.) extracted from an .html docket (passed as a string literal through data). This information should be available in all dockets downloaded from PACER.

This information may overlap with information from extract_download_meta(), but it is technically extracted from a different source (the docket sheet, rather than the results page of the PACER Case Locator).

In consolidated cases, there is information about the lead case, and a link. We extract any links in the case_meta section of the document and store it in the dictionary with the key meta_links.

There are some encoding issues with characters such as à that we have tried to address, but may need to be improved in the future.

If extract_case_meta() cannot find the case_meta section of the docket, it will return a dictionary with a single key, Error_case_meta.

Output Documentation Please note that extract_case_meta does common cleaning and then treats each (text):(text) line as a key:value pair, so this documentation only documents the most common keys that we have observed.

These keys are, generally, self-explanatory and are only listed for convenience.

  • Case name
  • Assigned to
  • Referred to
  • Demand
  • Case in other court
  • Cause
  • Date Filed
  • Date Terminated
  • Jury Demand
  • Nature of Suit
  • Jurisdiction

Special keys:

  • Member case: the existence of this key indicates that this is probably the lead case of a consolidated case.
  • Lead case: the existence of this key indicates that this is probably a member case of a consolidated case.
  • meta_links: this will only exists if there are links in the case_meta section of the PACER docket.
extract_download_meta(data)[source]

Returns a dictionary that contains all of the downloadmeta that was stored by pacer_lib.scraper() at the time of download (i.e., the detailed_info json object that is commented out at the top of new downloads from PACER). This is meant to help improve reproducibility.

detailed_info is an add-on in later versions of pacer_lib that records case-level data from the search screen (date_closed, link, nature of suit, case-name, etc.) as well as the date and time of download.

In earlier versions of pacer_lib (i.e., released as pacer_scraper_library), this was stored as a list and did not include the date and time of download. extract_download_meta() can also handle these detailed_info objects.

If there is no detailed_info, the function returns a dictionary with the key ‘Error_download_meta’.

Keyword Arguments

  • data: should be a string, read from a .html file.

Output Documentation Unless otherwise noted, all of these are collected from the PACER Case Locator results page. This is documented as key: description of value.

These terms are found in documents downloaded by any version of pacer_lib:

  • searched_case_no: the case number that was passed to pacer_lib.scraper(), this is recorded to ensure reproducibility and comes from pacer_lib. This is not found on the PACER Case Locator results page.
  • court_id: the abbreviation for the court the case was located in
  • case_name: the name of the case, as recorded by PACER
  • nos: a code for “Nature of Suit”
  • date_filed: the date the case was filed, as recorded by PACER
  • date_closed: the date the case was closed, as recorded by PACER
  • link: a link to the docket

These are only in documents downloaded with newer versions of pacer_lib:

  • downloaded: string that describes the time the docket was downloaded by pacer_lib. This is not found on the PACER Case Locator results page. (Format: yyyy-mm-dd,hh:mm:ss)
  • listed_case_no: string that describes the preferred PACER case no for this case (as opposed to the query we submitted)
  • result_no: which result was the case on the PACER Case Locator results page.
extract_lawyer_meta(data)[source]

Returns a dictionary of information about the plaintiff, defendant and their lawyers extracted from an .html docket (passed as a string literal through data).

At the moment, extract_lawyer_meta() only handles the most common listing (i.e., if there is one listing for plaintiff and one listing for defendant). If there is more than one set of plaintiffs or defendants (e.g., in a class action suit), the function will return a dictionary with a single key Error_lawyer_meta. This function will not handle movants and will probably not handle class-action cases.

In dockets downloaded from older versions of pacer_lib (e.g., pacer_scraper_library), lawyer information was not requested so the dockets will not contain any lawyer_meta to be extracted.

Output Documentation This is documented as key: description of value.

  • plaintiffs: list of the names of plaintiffs
  • defendants: list of the names of defendants
  • plaintiffs_attorneys: list of the name of attorneys representing the plaintiffs
  • defendants_attorneys: list of the name of attorneys representing the defendants
  • plaintiffs_attorneys_details: string that contains the cleaned output of all plaintiff lawyer data (e.g., firm, address, email, etc.) that can be further cleaned in the future.
  • defendants_attorneys_details: string that contains the cleaned output of all defendant lawyer data (e.g., firm, address, email, etc.) that can be further cleaned in the future.
parse_data(data)[source]

Returns a list of all of the docket entries in data, which should be a string literal. BeautifulSoup is useed to parse a .html docket file (pass as a string literal through data) into a list of docket entries. Each docket entry is also a list.

This uses html.parser and, in the case of failure, switches to html5lib.

If it cannot find the table or entries, it will return a string as an error message.

Keyword Arguments

  • data: should be a string, read from a .html file.

Output Documentation

  1. date_filed
  2. document_number
  3. docket_description
  4. link_exist (this is a dummy to indicate the existence of a link)
  5. document_link (docket_number does not uniquely identify the docket entry so we also create a separate unique identifier)
  6. unique_id (document_number is not a unique identifier so we create one based on the placement in the .html docket sheet)
parse_dir(overwrite=True, get_meta=True)[source]

Run parse_data() and extract_all_meta() on each file in the docket_path folder and writes the output to the output_path.

Output Documentation This function returns nothing.

File documentation The docket entries of each docket are stored as a .csv in a folder ‘processed_dockets’. The filename of the csv indicates the source docket and the columns represent (in order):

  1. date_filed
  2. document_number
  3. docket_description
  4. link_exist (this is a dummy to indicate the existence of a link)
  5. document_link (docket_number does not uniquely identify the docket entry so we also create a separate unique identifier)
  6. unique_id (document_number is not a unique identifier so we create one based on the placement in the .html docket sheet)

The download meta and case and lawyer meta information of each docket is stored as a JSON-object in the sub-folders ‘processed_dockets_meta/download_meta/’ and ‘processed_dockets_meta/case_meta/’ within the output path. The files indicate the source docket and are prefixed by download_meta_ and case_meta_, respectively.

docket_processor

class pacer_lib.reader.docket_processor(processed_path='./results/parsed_dockets', output_path='./results/')[source]

Returns a docket_processor() object that allows for keyword and boolean searching of docket entries from dockets specified in processed_path. docket_processor relies on the use of docket_parser` to parse .html PACER dockets into structured .csv, although it is theoretically possible (but quite tedious) to independently bring dockets into compliance for use with docket_processor.

This will give you a set of documents (and their associated links) for download (and which can be passed to pacer_lib.scraper()).

The object then outputs a docket-level or consolidated .csv that describes all documents that meet the search criteria (stored in hit_list).

Keyword Arguments

  • processed_path points to the folder containing .csv docket files
  • output_path points to the folder where you would like output to be stored. Note that the output will actually be stored in a subfolder of the output_path called /docket_hits/. If the folders do not exist, they will be created.
search_dir(require_term=[], exclude_term=[], case_sensitive=False, within=0)[source]

Runs search_docket() on each docket in self.processed_path and adds hits to self.hit_list as a key value pair case_number : [docket entries], where case_number is taken from the filename and [docket_entries] is a list of docket entries (which are also lists) that meet the search criteria.

The search criteria is specified by require_term, exclude_term, case_sensitive and within, such that:

  • if within !=0, all searches are constrained to the first x characters of the text, where x = within
  • all strings in the list require_term are found in text (or the first x charactersm, if within is used)
  • and, no strings in the list exclude_term are found in text (or the first x charactersm, if within is used)
  • if case_sensitive =True, then the search is case sensitive

Returns nothing.

search_docket(docket, require_term=[], exclude_term=[], case_sensitive=False, within=0)[source]

Returns a lists of docket entries that match the search criteria. Docket entries are lists that should have the same structure as described in docket_parser, i.e. in order:

  1. date_filed
  2. document_number
  3. docket_description
  4. link_exist (this is a dummy to indicate the existence of a link)
  5. document_link (docket_number does not uniquely identify the docket entry so we also create a separate unique identifier)
  6. unique_id (document_number is not a unique identifier so we create one based on the placement in the .html docket sheet)

The docket is specified by the argument docket and searched for in the self.processed_path folder.

The search criteria is specified by require_term, exclude_term, case_sensitive and within, such that:

  • if within !=0, all searches are constrained to the first x characters of the text, where x = within
  • all strings in the list require_term are found in text (or the first x charactersm, if within is used)
  • and, no strings in the list exclude_term are found in text (or the first x charactersm, if within is used)
  • if case_sensitive =True, then the search is case sensitive
search_text(text, require_term=[], exclude_term=[], case_sensitive=False)[source]

Returns a boolean indicating if all criteria are satisified in text. The criteria are determined in this way:

  • all strings in the list require_term are found in text
  • and, no strings in the list exclude_term are found in text

If you pass a string instead of a list to either require_term or exclude_term, search_text() will convert it to a list.

This search is, by default case-insensitive, but you can turn on case-sensitive search through case_sensitive.

write_all_matches(suffix, overwrite_flag=False)[source]

Writes all of the matches found in the self.hit_list dictionary to a single .csv file (all_match__[suffix].csv) in the self.output_path. The columns of the .csv are (in order):

  1. case_number (as defined by the source .csv)
  2. date_filed
  3. document_number
  4. docket_description
  5. link_exist (this is a dummy to indicate the existence of a link)
  6. document_link (docket_number does not uniquely identify the docket entry so we also create a separate unique identifier)
  7. unique_id (document_number is not a unique identifier so we create one based on the placement in the .html docket sheet)

There is a flag for overwriting.

You cannot use / \  % * : | " < > . _ in the suffix.

Returns nothing.

write_individual_matches(suffix, overwrite_flag=False)[source]

Writes all of the matches in the self.hit_list dictionary to one .csv file per docket sheet (determined by the source .csv) in a folder named after the suffix. To distinguish from the source .csv, they are prefixed by a ^. They are also suffixed to allow for multiple searches of the same source .csv.

Suffix is required and if the same suffix is specified, it will overwrite previous searches if the overwrite flag is turned on. (It will delete all of the old files in the suffix folder.)

You cannot use / \  % * : | " < > . _ in the suffix.

Returns nothing.

document_sorter

class pacer_lib.reader.document_sorter(docket_path='./results/local_docket_archive', document_path='./results/local_document_archive', output_path='./results', searchable_criteria='court')[source]

Not implemented yet. Sorry.

convert_PDF_to_text(filename)[source]

Convert a file to text and save it in the text_output_path

convert_all(overwrite=False)[source]

For files in the document path, use convert_PDF_to_text if it has not been converted before. Determine if a file is searchable or not.

count()[source]

Count the file_index

export_file_index()[source]

Save the file_index to a file

flag_searchable()[source]

Flag according to self.flags() Move files to a folder (make this an option)

set_flag()[source]

Add a criteria to the flagging process.