systematic_review package

systematic_review.analysis module

Module: analysis This module contain code for generating info, diagrams and tables. It can be used to generate systematic review flow and citations information.

class systematic_review.analysis.Annotate(figure_axes, start_coordinate, end_coordinate, arrow_style='<|-')[source]

Bases: object

This class makes it easier to draw arrows into matplotlib.pyplot.axes figure

add_arrow(text='')[source]

This draw the arrow on matplotlib.pyplot.axes.

Parameters

text (str) – This takes test to put on the arrow.

class systematic_review.analysis.CitationAnalysis(dataframe)[source]

Bases: object

This takes any pandas dataframe containing citation details and produces analyses on various columns.

authors_analysis(authors_column_name='authors')[source]

generates the details based on pandas dataframe column of article authors. example- Number of authors, Articles with single authors, Articles per authors, Authors per articles

Parameters

authors_column_name (str) – Name of column containing authors details.

Returns

contains Number of authors, Articles with single authors, Articles per authors, Authors per articles

Return type

tuple

authors_info()[source]

prints the authors analysis details in nice format

extract_keywords(column_name: str = 'keywords')[source]

return dataframe with search_words_object column containing single keyword in row that are used in the articles.

Parameters

column_name (str) – column name of search_words_object detail in citation dataframe

keyword_diagram(column_name: str = 'keywords', top_result=None, method: str = 'seaborn', theme_style='darkgrid', xaxis_label_rotation=90, pandas_bar_kind: str = 'bar', diagram_fname: Optional[str] = None, **kwargs)[source]

generates chart showing how many articles are published by different publishers.

Parameters
  • pandas_bar_kind (str) – pandas plot option of kind of chart needed. defaults to ‘bar’ in this implementation

  • column_name (str) – column name of search_words_object detail in citation dataframe

  • theme_style (str) – name of the bar chart theme

  • xaxis_label_rotation (float) – rotate the column elements shown on x axis or horizontally.

  • top_result (int) – This limits the number of column unique elements to be shown

  • method (str) – provide option to plot chart using either ‘seaborn’ or ‘pandas’

  • diagram_fname (str) – filename or path of diagram image to be saved.

  • kwargs (dict) – kwargs are also given to matplotlib.pyplot.savefig(**kwargs)

keywords_info(column_name: str = 'keywords')[source]

return search_words_object and number of times they are used in the articles

Parameters

column_name (str) – column name of search_words_object detail in citation dataframe

publication_place_diagram(column_name: str = 'place_published', top_result=None, method: str = 'seaborn', theme_style='darkgrid', xaxis_label_rotation=90, pandas_bar_kind: str = 'bar', diagram_fname: Optional[str] = None, **kwargs)[source]

generates chart showing how many articles are published from different places or countries.

Parameters
  • pandas_bar_kind (str) – pandas plot option of kind of chart needed. defaults to ‘bar’ in this implementation

  • column_name (str) – column name of publication place detail in citation dataframe

  • theme_style (str) – name of the bar chart theme

  • xaxis_label_rotation (float) – rotate the column elements shown on x axis or horizontally.

  • top_result (int) – This limits the number of column unique elements to be shown

  • method (str) – provide option to plot chart using either ‘seaborn’ or ‘pandas’

  • diagram_fname (str) – filename or path of diagram image to be saved.

  • kwargs (dict) – kwargs are also given to matplotlib.pyplot.savefig(**kwargs)

publication_place_info(column_name: str = 'place_published')[source]

shows how many articles are published from different places or countries.

Parameters

column_name (str) – column name of publication place detail in citation dataframe

Returns

contains publication place and count of publications

Return type

object

publication_year_diagram(column_name: str = 'year', top_result=None, method: str = 'seaborn', theme_style='darkgrid', xaxis_label_rotation=90, pandas_bar_kind: str = 'bar', diagram_fname: Optional[str] = None, **kwargs)[source]

generates chart showing how many articles are published each year.

Parameters
  • pandas_bar_kind (str) – pandas plot option of kind of chart needed. defaults to ‘bar’ in this implementation

  • column_name (str) – column name of publication year detail in citation dataframe

  • theme_style (str) – name of the bar chart theme

  • xaxis_label_rotation (float) – rotate the column elements shown on x axis or horizontally.

  • top_result (int) – This limits the number of column unique elements to be shown

  • method (str) – provide option to plot chart using either ‘seaborn’ or ‘pandas’

  • diagram_fname (str) – filename or path of diagram image to be saved.

  • kwargs (dict) – kwargs are also given to matplotlib.pyplot.savefig(**kwargs)

publication_year_info(column_name: str = 'year')[source]

shows how many articles are published each year.

Parameters

column_name (str) – column name of publication year detail in citation dataframe

Returns

contains year and count of publications

Return type

object

publisher_diagram(column_name: str = 'publisher', top_result=None, method: str = 'seaborn', theme_style='darkgrid', xaxis_label_rotation=90, pandas_bar_kind: str = 'bar', diagram_fname: Optional[str] = None, **kwargs)[source]

generates chart showing how many articles are published by different publishers.

Parameters
  • pandas_bar_kind (str) – pandas plot option of kind of chart needed. defaults to ‘bar’ in this implementation

  • column_name (str) – column name of publisher detail in citation dataframe

  • theme_style (str) – name of the bar chart theme

  • xaxis_label_rotation (float) – rotate the column elements shown on x axis or horizontally.

  • top_result (int) – This limits the number of column unique elements to be shown

  • method (str) – provide option to plot chart using either ‘seaborn’ or ‘pandas’

  • diagram_fname (str) – filename or path of diagram image to be saved.

  • kwargs (dict) – kwargs are also given to matplotlib.pyplot.savefig(**kwargs)

publisher_info(column_name: str = 'publisher')[source]

shows how many articles are published by different publishers.

Parameters

column_name (str) – column name of publisher detail in citation dataframe.

Returns

contains publisher name and count of publications.

Return type

object

class systematic_review.analysis.SystematicReviewInfo(citations_files_parent_folder_path: Optional[str] = None, filter_sorted_citations_df: Optional[pandas.core.frame.DataFrame] = None, validated_research_papers_df: Optional[pandas.core.frame.DataFrame] = None, selected_research_papers_df: Optional[pandas.core.frame.DataFrame] = None)[source]

Bases: object

This analyse whole systematic review process and takes all produced file to generate tables, figure.

download_flag_column_name = 'downloaded'
file_validated_flag_name = 'yes'
get_text_list() List[str][source]

This produces the list of all analysis done in this class.

Returns

This contains systematic review information in sentences.

Return type

List[str]

info()[source]

This takes systematic review text list and create proper order to print.

systematic_review_diagram(fig_width=10, fig_height=10, diagram_fname: Optional[str] = None, color: bool = True, color_info: bool = True, auto_fig_size: bool = True, hide_border: bool = True, **kwargs)[source]

This outputs the systematic review diagram resembling PRISMA guidelines.

Parameters
  • kwargs (dict) – kwargs are also given to matplotlib.pyplot.savefig(**kwargs)

  • hide_border (bool) – border is line outside of diagram

  • auto_fig_size (bool) – this sets the figure size automatically based on given data.

  • color (bool) – This is color inside of diagram boxes. turn this off by putting False.

  • color_info (bool) – This show meaning of color in diagram.

  • diagram_fname (str) – filename or path of diagram image to be saved.

  • fig_width (float) – This is width of figure in inches.

  • fig_height (float) – This is height of figure in inches.

class systematic_review.analysis.TextInBox(figure_axes, x_coordinate, y_coordinate, text='')[source]

Bases: object

This is matplotlib text in box class to make it easier to use text boxes.

add_box(**kwargs: Union[dict, str, Any])[source]

It put the box on the matplotlib.pyplot.axes figure

Parameters

kwargs (Union[dict, str, Any]) – This taken any custom_text_manipulation_function options to be set into box.

systematic_review.analysis.analysis_of_multiple_ris_citations_files(citations_files_parent_folder_path: str) dict[source]

This function loads all ris citations files from folder and return the databases names and collected number of citations from the databases to dict.

Parameters

citations_files_parent_folder_path (str) – this is the path of parent folder of where citations files exists.

Returns

this is dict of databases name and number of records in ris files.

Return type

dict

systematic_review.analysis.creating_sample_review_file(selected_citation_df)[source]

This function outputs dataframe with including columns to make literature review easier.

Parameters

selected_citation_df (pandas.DataFrame object) – This dataframe is the result of last step of systematic-reviewpy. This contains records for manual literature review.

Returns

This is dataframe with additional columns for helping in adding details of literature review.

Return type

pandas.DataFrame object

systematic_review.analysis.custom_box(**kwargs) dict[source]

This is the option for matplotlib text in box.

Parameters

kwargs (dict) – Contains key word arguments

Returns

contains options

Return type

dict

systematic_review.analysis.duplicate_count(dataframe: pandas.core.frame.DataFrame) int[source]

return count of the duplicate articles.

Parameters

dataframe (pd.DataFrame) – Input pandas dataframe where we want to check numbers of duplicates.

Returns

number of duplicates records.

Return type

int

systematic_review.analysis.missed_article_count(filter_sorted_citations_df: pandas.core.frame.DataFrame, downloaded_articles_path: str, title_column_name: str = 'cleaned_title')[source]

return count of missed articles from downloading by checking original list of articles from filter_sorted_citations_df using downloaded articles path.

Parameters
  • title_column_name (str) – contains name of column which contain the name of article.

  • filter_sorted_citations_df (pd.DataFrame) – This dataframe contains records of selected articles including name of articles.

  • downloaded_articles_path (str) – contains parent folder of all the downloaded articles files.

Returns

count of the missed articles from downloading.

Return type

int

systematic_review.analysis.pandas_countplot_with_pandas_dataframe_column(dataframe, column_name, top_result, plot_kind: str = 'bar', diagram_fname: Optional[str] = None, **kwargs)[source]

generate pandas count chart using dataframe column.

Parameters
  • dataframe (pd.DataFrame) – dataframe which contains column whose value counts to be shown.

  • column_name (str) – Name of pandas column elements are supposed to be counted.

  • top_result (int) – This limits the number of column unique elements to be shown

  • plot_kind (str) – pandas plot option of kind of chart needed. defaults to ‘bar’ in this implementation

  • diagram_fname (str) – filename or path of diagram image to be saved.

  • kwargs (dict) – kwargs are also given to matplotlib.pyplot.savefig(**kwargs)

systematic_review.analysis.seaborn_countplot_with_pandas_dataframe_column(dataframe, column_name, theme_style='darkgrid', xaxis_label_rotation=90, top_result=None, diagram_fname: Optional[str] = None, **kwargs)[source]

generate seaborn count bar chart using dataframe column.

Parameters
  • diagram_fname (str) – filename or path of diagram image to be saved.

  • dataframe (pd.DataFrame) – dataframe which contains column whose value counts to be shown.

  • column_name (str) – Name of pandas column elements are supposed to be counted.

  • theme_style (str) – name of the bar chart theme

  • xaxis_label_rotation (float) – rotate the column elements shown on x axis or horizontally.

  • top_result (int) – This limits the number of column unique elements to be shown

  • kwargs (dict) – kwargs are also given to matplotlib.pyplot.savefig(**kwargs)

Returns

show the bar chart

Return type

object

systematic_review.analysis.text_padding_for_visualise(text: str, front_padding_space_multiple: int = 4, top_bottom_line_padding_multiple: int = 1)[source]

This add required space on all four side of text for better look.

Parameters
  • text (str) – This is the input word.

  • front_padding_space_multiple (int) – This multiply the left and right side of spaces for increased padding.

  • top_bottom_line_padding_multiple (int) – This multiply the top and down side of spaces for increased padding.

Returns

str - text with spaces on all four sides. int - height that is number of lines. int - width that is number of char in longest line.

Return type

tuple

systematic_review.analysis.vertical_dict_view(dictionary: dict) str[source]

convert dict to string with each element in new line.

Parameters

dictionary (dict) – Contains key and value which we want to print vertically.

Returns

This prints key1 : value1 and key2 : value2 … in vertical format

Return type

str

systematic_review.citation module

Module: citation This module contains functions which changes format or get details from citations. It also include functions to fix some typos.

class systematic_review.citation.Citations(citations_files_parent_folder_path, title_column_name: str = 'title', text_manipulation_method_name: str = 'preprocess_string_to_space_separated_words')[source]

Bases: object

create_citations_dataframe() pandas.core.frame.DataFrame[source]

Executes citation step. This function load all the citations from path, add required columns for next steps, and remove duplicates.

Returns

DataFrame with additional columns needed for next steps of systematic review and duplicates are removed

Return type

pandas.DataFrame object

get_dataframe()[source]

executes the create citations dataframe function and outputs the pd.DataFrame

Returns

outputs the citations data.

Return type

pd.DataFrame

get_records_list() List[Dict[str, Any]][source]

Executes citation step. This function load all the citations from path, add required columns for next steps, and remove duplicates.

Returns

list with additional columns needed for next steps of systematic review and duplicates are removed

Return type

List[Dict[str, Any]]

to_csv(output_filename: Optional[str] = 'output.csv', index: bool = True)[source]

This function saves pandas.DataFrame to csv file.

Parameters
  • output_filename (str) – This is the name of output file which should contains .csv extension

  • index (bool) – Define if index is needed in output csv file or not.

to_excel(output_filename: Optional[str] = 'output.csv', index: bool = True)[source]

This function saves pandas.DataFrame to excel file.

Parameters
  • output_filename (str) – This is the name of output file which should contains .xlsx extension

  • index (bool) – Define if index is needed in output excel file or not.

systematic_review.citation.add_citation_text_column(dataframe_object: pandas.core.frame.DataFrame, title_column_name: str = 'title', abstract_column_name: str = 'abstract', keyword_column_name: str = 'keywords') pandas.core.frame.DataFrame[source]

This takes dataframe of citations and return the full text comprises of “title”, “abstract”, “search_words_object”

Parameters
  • dataframe_object (pandas.DataFrame object) – this is the object of famous python library pandas. for more lemma_info: https://pandas.pydata.org/docs/

  • title_column_name (str) – This is the name of column which contain citation title

  • abstract_column_name (str) – This is the name of column which contain citation abstract

  • keyword_column_name (str) – This is the name of column which contain citation search_words_object

Returns

this is dataframe_object comprises of full text column.

Return type

pd.DataFrame

systematic_review.citation.add_multiple_sources_column(citation_dataframe: pandas.core.frame.DataFrame, group_by: list = ['title', 'year']) pandas.core.frame.DataFrame[source]

This function check if citations or article title is available at more than one sources and add column named ‘multiple_sources’ to the dataframe with list of name of sources names.

Parameters
  • citation_dataframe (pandas.DataFrame object) – Input dataset which contains citations or article title with sources more than one.

  • group_by (list) – column label or sequence of labels, optional Only consider certain columns for citations or article title with sources more than one, by default use all of the columns.

Returns

DataFrame with additional column with list of sources names

Return type

pandas.DataFrame object

systematic_review.citation.citations_to_ris_converter(input_file_path: str, output_filename: str = 'output_ris_file.ris', input_file_type: str = 'read_csv') None[source]

This asks for citations columns name from tabular data and then convert the data to ris format.

Parameters
  • input_file_path (str) – this is the path of input file

  • output_filename (str) – this is the name of the output ris file with extension. output file path is also valid choice.

  • input_file_type (str) – this function default is csv but other formats are also supported by putting ‘read_{file_type}’. such as input_file_type = ‘read_excel’ all file type supported by pandas can be used by putting pandas IO tools methods. for more info visit- https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html

Returns

Return type

None

systematic_review.citation.drop_columns_based_on_column_name_list(dataframe: pandas.core.frame.DataFrame, column_name_list: list) pandas.core.frame.DataFrame[source]

This function drop columns based on the column name in the list.

Parameters
  • dataframe (pandas.DataFrame object) – This dataframe contains columns which we want to drop or remove.

  • column_name_list (list) – This is the name of dataframe columns to be removed

Returns

DataFrame with columns mentioned in column_name_list removed.

Return type

pandas.DataFrame object

systematic_review.citation.drop_duplicates_citations(citation_dataframe: pandas.core.frame.DataFrame, subset: list = ['title', 'year'], keep: Literal[first, last, False] = 'first', index_reset: bool = True) pandas.core.frame.DataFrame[source]

Return DataFrame with duplicate rows removed. Considering certain columns is optional. Indexes, including time indexes are ignored.

Parameters
  • index_reset (bool) – It

  • citation_dataframe (pandas.DataFrame object) – Input dataset which contains duplicate rows

  • subset (list) – column label or sequence of labels, optional Only consider certain columns for identifying duplicates, by default use all of the columns.

  • keep (str) – options includes {‘first’, ‘last’, False}, default ‘first’. Determines which duplicates (if any) to keep. - first : Drop duplicates except for the first occurrence. - last : Drop duplicates except for the last occurrence. - False : Drop all duplicates.

Returns

DataFrame with duplicates removed

Return type

pandas.DataFrame object

systematic_review.citation.drop_search_words_count_columns(dataframe, search_words_object: systematic_review.search_count.SearchWords) pandas.core.frame.DataFrame[source]

removes columns created based on the keywords.

Parameters
  • dataframe (pandas.DataFrame object) – This dataframe contains keywords columns which we want to drop or remove.

  • search_words_object (dict) – This is the dictionary comprised of unique keywords in each keyword groups. It means keyword from first keyword group can not be found in any other keyword group. Example - {‘keyword_group_1’: [“management”, “investing”, “risk”, “pre”, “process”], ‘keyword_group_2’: [“corporate”, “pricing”],…}

Returns

DataFrame with keywords columns removed.

Return type

pandas.DataFrame object

systematic_review.citation.edit_ris_citation_paste_values_after_regex_pattern(input_file_path: str, output_filename: str = 'output_file.ris', edit_line_regex: str = '^DO ', paste_value: str = 'ER  - ') None[source]

This is created to edit ris files which doesn’t specify ER for ‘end of citations’ and paste ER after end point of citation, replace ‘DO’ with other ris classifiers such as TY, JO etc.

Parameters
  • input_file_path (str) – this is the path of input file

  • output_filename (str) – this is the name of the output ris file with extension.

  • edit_line_regex (str) – this is the regex to find ris classifiers lines such as DO, TY, JO etc.

  • paste_value (str) – this is value to be pasted, most helpful is ER ris classifier which signify citation end.

Returns

Return type

None

systematic_review.citation.get_details_of_all_article_name_from_citations(filtered_list_of_dict: list, sources_name_citations_path_list_of_dict: list, doi_url: bool = False, title_column_name: str = 'title') list[source]

This function searches source names, doi, and url for all articles in filtered_list_of_dict.

Parameters
  • filtered_list_of_dict (list) – This is the list of article citations dict after filtering it using min_limit on grouped_keywords_count

  • sources_name_citations_path_list_of_dict (list) – This is the list of all the sources names and it’s citations at dir_path. Examples - {‘sources_name’: ‘all source articles citations’, …}

  • doi_url (bool) – This signify if we want to get the value of url and doi from citation

  • title_column_name (str) – This is the name of column which contain citation title

Returns

This list contains all article names with source names. (optional url and doi)

Return type

list

systematic_review.citation.get_details_via_article_name_from_citations(article_name: str, sources_name_citations_path_list_of_dict: list, doi_url: bool = False, title_column_name: str = 'title') dict[source]

Iterate through citations and find article_name and put source_name in column, with doi and url being optional

Parameters
  • article_name (str) – This is the primary title of the citation or name of the article.

  • sources_name_citations_path_list_of_dict (list) – This is the list of all the sources names and it’s citations at dir_path. Examples - {‘sources_name’: ‘all source articles citations’, …}

  • doi_url (bool) – This signify if we want to get the value of url and doi from citation

  • title_column_name (str) – This is the name of column which contain citation title

Returns

This dict contains the article_name, source_name and optional url and doi

Return type

dict

systematic_review.citation.get_missed_articles_source_names(missed_articles_list: list, all_articles_title_source_name_list_of_dict: list, article_column_name: str = 'article_name', source_column_name: str = 'source_name') list[source]
Parameters
  • missed_articles_list (list) – This contains the list of articles that got missed while downloading.

  • all_articles_title_source_name_list_of_dict (list) – This list contains all article names with source names. (optional url and doi)

  • article_column_name (str) – This is the name of article column in the all_articles_title_source_name_list_of_dict.

  • source_column_name (str) – This is the name of source column in the all_articles_title_source_name_list_of_dict.

Returns

This list contains articles_name and sources name.

Return type

list

systematic_review.converter module

Module: converter This module contains functions related to files and data type conversion. such as list to txt file, pandas df to list of dicts and many more.

class systematic_review.converter.ASReview(data: Union[List[dict], pandas.core.frame.DataFrame])[source]

Bases: object

get_file(output_filename: str = 'output.csv', index: bool = True)[source]

Outputs the file needed to start project in ASReview.

Parameters
  • output_filename (str) – name or path of your needed file.

  • index (bool) – asks if you need index column in output file.

class systematic_review.converter.Reader(file_path: str)[source]

Bases: object

Contains functionality to read files.

get_text(pages: str = 'all')[source]

It understand the type of file and output the content of file.

Parameters

pages (str) – contain option to read ‘first’ or ‘all’ pages.

Returns

This is text in readable file.

Return type

str

pandas_reader(input_file_type)[source]

Read file using pandas IO https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html

Parameters

input_file_type (str) – check pandas IO for examples like read_csv, read_excel etc.

Returns

This is the required text from pandas IO.

Return type

str

pdf_pdftotext_reader(pages: str = 'all')[source]

Extract the text from pdf file via pdftotext. for more lemma_info, visit: https://pypi.org/project/pdftotext/

Parameters

pages (str) – This could be ‘all’ to get full text of pdf and ‘first’ for first page of pdf.

Returns

This is the required text from pdf file.

Return type

str

pdf_pymupdf_reader(pages: str = 'all')[source]

Extract the text from pdf file via fitz(PyMuPDF). for more lemma_info, visit: https://pypi.org/project/PyMuPDF/

Parameters

pages (str) – This could be ‘all’ to get full text of pdf and ‘first’ for first page of pdf.

Returns

This is the required text from pdf file.

Return type

str

systematic_review.converter.add_preprocess_column(dataframe_object: pandas.core.frame.DataFrame, column_name: str = 'title')[source]

Takes dataframe and column name to apply preprocess function from string_manipulation module.

Parameters
  • dataframe_object (pandas.DataFrame object) – This is object with column containing column which needs to be preprocessed.

  • column_name (str) – This is the name of the column of dataframe.

Returns

DataFrame with additional column with preprocessed column.

Return type

pandas.DataFrame object

systematic_review.converter.apply_custom_function_on_dataframe_column(dataframe: pandas.core.frame.DataFrame, column_name: str, custom_function, new_column_name: Optional[str] = None, *args, **kwargs) pandas.core.frame.DataFrame[source]

This apply custom_text_manipulation_function function to all element of dataframe column.

Parameters
  • new_column_name (str) – This is the new name you want to give your modified column and new column will be added to dataframe without modifying original column.

  • dataframe (pd.DataFrame) – This is the pandas dataframe consisting of column name with elements capable to be transformed with custom_text_manipulation_function function.

  • column_name (str) – name of dataframe column whose elements are needed to be transformed

  • custom_function – This is custom_text_manipulation_function function to be applied on each elements of the pandas column elements.

Returns

This is transformed dataframe.

Return type

pd.DataFrame

systematic_review.converter.dataframe_column_counts(dataframe, column_name)[source]

Equivalent to pandas.DataFrame.value_counts(), It return list with count of unique element in column

Parameters
  • dataframe (pd.DataFrame) – dataframe which contains column that is to be counted

  • column_name (str) – Name of pandas column elements are supposed to be counted.

Returns

unique column elements with counts

Return type

object

systematic_review.converter.dataframe_to_csv_file(dataframe_object: pandas.core.frame.DataFrame, output_filename: Optional[str] = 'output.csv', index: bool = True)[source]

This function saves pandas.DataFrame to csv file.

Parameters
  • dataframe_object (pandas.DataFrame object) – this is the object of python library pandas. for more lemma_info: https://pandas.pydata.org/docs/

  • output_filename (str) – This is the name of output file which should contains .csv extension

  • index (bool) – Define if index is needed in output csv file or not.

systematic_review.converter.dataframe_to_excel_file(dataframe_object: pandas.core.frame.DataFrame, output_filename: Optional[str] = 'output.csv', index: bool = True)[source]

This function saves pandas.DataFrame to excel file.

Parameters
  • dataframe_object (pandas.DataFrame object) – this is the object of python library pandas. for more lemma_info: https://pandas.pydata.org/docs/

  • output_filename (str) – This is the name of output file which should contains .xlsx extension

  • index (bool) – Define if index is needed in output excel file or not.

systematic_review.converter.dataframe_to_records_list(dataframe: pandas.core.frame.DataFrame) List[Dict[str, Any]][source]

converts pandas dataframe to the list of dictionaries (records).

Parameters

pd.DataFrame – This is the pandas dataframe consisted of all data from dictionaries converted into respective rows.

Returns

This list contains the dictionaries inside as elements. Example - [{‘primary_title’ : “this is first title”}, {‘primary_title’ : “this is second title”}, {‘primary_title’ : “this is third title”}]

Return type

List[Dict[str, Any]]

systematic_review.converter.dict_key_value_to_records(dictionary: dict, key_column_name: str, value_column_name: str)[source]

converts {‘key’:value, key1: value1},etc to record = [{‘key_column_name’: key, value_column_name: value}, etc]. that is used to convert to pd.DataFrame

Parameters
  • dictionary (dict) – hash map or dictionary that contains key and value pairs.

  • key_column_name (str) – name of records column

  • value_column_name (str) – name of records column

Returns

This list is in records format.

Return type

list

systematic_review.converter.dict_values_data_type(dictionary)[source]

This provide the data type of dictionary values by outputting dictionary.

Parameters

dictionary (dict) – This is the dictionary which contains different types of object in values. Example - {“first”: [2, 5], “sec”: 3}

Returns

This will output {“<class ‘list’>”: [“first”], “<class ‘int’>”: [“sec”]}

Return type

dict

systematic_review.converter.extract_pandas_df_column1_row_values_based_on_column2_value(pandas_dataframe, column2_value, column2name='source_name', column1name='article_name')[source]

extract the values of pandas dataframe column1’s row_values based on values of column2 value

Parameters
  • pandas_dataframe (pd.DataFrame) – This is the pandas dataframe containing at least two columns with values.

  • column2_value (object) – This should be str in normal cases but can be any object type supported in pandas for column value.

  • column2name (str) – This is the name of the column by which we are extracting the column1 values.

  • column1name (str) – This is the name of the column whose values we require.

Returns

This is the list of the resultant values from column1 rows.

Return type

list

systematic_review.converter.get_pdf_object_from_pdf_path(pdf_file_path: str)[source]

Extract text as pdf object from the pdf file where loop and indexing can show text per pages.

Parameters

pdf_file_path (str) – This is the path of pdf file.

Returns

Return type

This is pdf object with Extracted text.

systematic_review.converter.get_text_from_multiple_pdf_reader(pdf_file_path: str, pages: str = 'all') Union[str, bool][source]

This Function get text from pdf files using pdftotext. if failed then text comes from pymupdf.

Parameters
  • pdf_file_path (str) – This is the path of pdf file.

  • pages (str) – This could be ‘all’ to get full text of pdf and ‘first’ for first page of pdf.

Returns

This is the required text from pdf file.

Return type

str

systematic_review.converter.get_text_from_pdf(pdf_file_path: str, pages: str = 'all', pdf_reader: str = 'pdftotext') Union[str, bool][source]

This Function get text from pdf files using either pdftotext or pymupdf.

Parameters
  • pdf_reader (str) – This is python pdf reader package which convert pdf to text.

  • pdf_file_path (str) – This is the path of pdf file.

  • pages (str) – This could be ‘all’ to get full text of pdf and ‘first’ for first page of pdf.

Returns

This is the required text from pdf file.

Return type

str

systematic_review.converter.get_text_from_pdf_pdftotext(pdf_file_path: str, pages: str = 'all') str[source]

Extract the text from pdf file via pdftotext. for more lemma_info, visit: https://pypi.org/project/pdftotext/

Parameters
  • pdf_file_path (str) – This is the path of the pdf file.

  • pages (str) – This could be ‘all’ to get full text of pdf and ‘first’ for first page of pdf.

Returns

This is the required text from pdf file.

Return type

str

systematic_review.converter.get_text_from_pdf_pymupdf(pdf_file_path: str, pages: str = 'all') str[source]

Extract the text from pdf file via fitz(PyMuPDF). for more lemma_info, visit: https://pypi.org/project/PyMuPDF/

Parameters
  • pages (str) – This could be ‘all’ to get full text of pdf and ‘first’ for first page of pdf.

  • pdf_file_path (str) – This is the path of pdf file.

Returns

This is the required text from pdf file.

Return type

str

systematic_review.converter.json_file_to_dict(json_file_path: str) dict[source]

Read the json file from the path given. Convert json file data to the python dictionary.

Parameters

json_file_path (str) – This is the json file path which is needed to be converted.

Returns

This is the data in dict format converted from json file.

Return type

dict

systematic_review.converter.list_to_string(list_name)[source]

This converts list to text_string and put each element in new line.

Parameters

list_name (list) – This is the python data structure list which contains some data.

Returns

This is the text string comprises of all data of list.

Return type

str

systematic_review.converter.list_to_text_file(filename: str, list_name: str, permission: str = 'w')[source]

This converts list to text file and put each element in new line.

Parameters
  • filename (str) – This is the name to be given for text file.

  • list_name (list) – This is the python data structure list which contains some data.

  • permission (str) – These are the os permissions given for the file. check more lemma_info on python library ‘os’.

Returns

Return type

None

systematic_review.converter.load_multiple_ris_citations_files(citations_files_parent_folder_path: str) List[dict][source]

This function loads all ris citations files from folder

Parameters

citations_files_parent_folder_path (str) – this is the path of parent folder of where citations files exists.

Returns

this is list of citations dicts inclusive of all citation files.

Return type

List[dict]

systematic_review.converter.load_multiple_ris_citations_files_to_dataframe(citations_files_parent_folder_path: str) pandas.core.frame.DataFrame[source]

This function loads all ris citations files from folder

Parameters

citations_files_parent_folder_path (str) – this is the path of parent folder of where citations files exists.

Returns

this is dataframe of citations dicts inclusive of all citation files.

Return type

pd.DataFrame

systematic_review.converter.load_text_file(file_path: str, permission: str = 'r')[source]

This reads text file. get all line of text file by file object. for more info visit- https://docs.python.org/3/tutorial/inputoutput.html

Parameters
  • file_path (str) – This is the path or name of text file.

  • permission (str) – These are the os permissions given for the file.

Returns

This contains all lines loaded.

Return type

file object

systematic_review.converter.records_list_to_dataframe(list_of_dicts: List[Dict[str, Any]]) pandas.core.frame.DataFrame[source]

converts the list of dictionaries to pandas dataframe.

Parameters

list_of_dicts (List[Dict[str, Any]]) – This list contains the dictionaries inside as elements. Example - [{‘primary_title’ : “this is the title”}]

Returns

This is the pandas dataframe consisted of all data from dictionaries converted into respective rows.

Return type

pd.DataFrame

systematic_review.converter.remove_empty_lines(input_file_path: str, output_filename: str = 'output_file.ris') None[source]

This function removes the blank lines from the input file and output new file.

Parameters
  • input_file_path (str) – this is the path of input file

  • output_filename (str) – this is the name of the output ris file with extension.

Returns

Return type

None

systematic_review.converter.ris_file_to_pandas_dataframe(ris_file_path: str) pandas.core.frame.DataFrame[source]

This needs ‘rispy’ to read ris to list of dicts. It then convert list of dicts to pandas.DataFrame

Parameters

ris_file_path (str) – This is the path of ris citations file

Returns

dataframe object from pandas

Return type

pd.DataFrame

systematic_review.converter.ris_file_to_records_list(ris_file_path: str) List[Dict[str, Any]][source]

Converts .ris file to list of dictionaries of citations using rispy(https://pypi.org/project/rispy/). For more lemma_info on ris format, visit: https://en.wikipedia.org/wiki/RIS_(file_format)

Parameters

ris_file_path (str) – This is the filepath of the ris file.

Returns

This list contains dictionaries of citations in records format, same as in pandas.

Return type

List[Dict[str, Any]]

systematic_review.converter.text_file_to_list(file_path: str, permission: str = 'r')[source]

This converts text file to list and put each line in list as single element. get first line of text file by list[0].

Parameters
  • file_path (str) – This is the name to be given for text file.

  • permission (str) – These are the os permissions given for the file. check more lemma_info on python library ‘os’.

Returns

This contains all lines loaded into list with one line per list element. [first line, second line,…. ]

Return type

list

systematic_review.converter.try_convert_dataframe_column_elements_to_list(dataframe: pandas.core.frame.DataFrame, column_name: str) List[list][source]

try statement for converting each element of dataframe column to list object.

Parameters
  • dataframe (pd.DataFrame) – The dataframe with column to convert into list

  • column_name (str) – Name of column for conversion

Returns

This is list with each element of type list.

Return type

List[list]

systematic_review.converter.unpack_list_of_lists(list_of_lists)[source]

unpack list consisting of other list to output list which will include all elements from other lists.

Parameters

list_of_lists (list) – this is list consisting of elements and lists. example [“first_element”, [“second_element”]]

Returns

This is the resultant list consisting of only elements. example [“first_element”, “second_element”]

Return type

list

systematic_review.converter.unpack_list_of_lists_with_optional_apply_custom_function(list_of_lists: List[list], custom_function=None) list[source]

unpack lists inside of list to new list containing all the elements from list_of_lists with optional custom_function applied on all elements. example- [[1,2,3], [3,4,5]] to [1,2,3,3,4,5]

Parameters
  • list_of_lists (List[list]) – This list contains lists as elements which might contains other elements.

  • custom_function – This is optional function to be applied on each element of list_of_lists

Returns

list containing all the elements with any optional transformation using custom_function.

Return type

list

systematic_review.converter.write_json_file_with_dict(output_file_path: str, input_dict: dict) None[source]

Write json file at output_file_path with the help of input dictionary.

Parameters
  • output_file_path (str) – This is the path of output file we want, if only name is provided then it will export json to the script path.

  • input_dict (dict) – This is the python dictionary which we want to be saved in json file format.

Returns

Function doesn’t return anything but write a json file at output_file_path.

Return type

None

systematic_review.filter_sort module

Module: filter_sort Description for filter: each searched search_words_object group can be used to filter using conditions such as searched search_words_object >= some count values and filter them until you have required number of articles that can be manually read and filter.

Description for sort: This converts the data into sorted manner so it is easier for humans to understand.

class systematic_review.filter_sort.FilterSort(data: Union[List[dict], pandas.core.frame.DataFrame], search_words_object: systematic_review.search_count.SearchWords, required_number: int)[source]

Bases: object

This contains functionality to filter and sort the data.

filter_and_sort() pandas.core.frame.DataFrame[source]

Execute filter and sort step. creates sorting criterion list, sort the dataframe based on the sorting criterion list.

Returns

This is the sorted dataframe which contains columns in this sequential manner. It contains citation df,

total_keywords, group_keywords_counts, and keywords_counts in the last.

Return type

pd.DataFrame

get_dataframe()[source]

executes the filter and sort function and outputs the pd.DataFrame

Returns

outputs the filter and sorted data.

Return type

pd.DataFrame

get_records_list()[source]

executes the filter and sort function and outputs the records list file

Returns

outputs the filter and sorted data.

Return type

List[dict]

to_csv(output_filename: Optional[str] = 'output.csv', index: bool = True)[source]

This function saves pandas.DataFrame to csv file.

Parameters
  • output_filename (str) – This is the name of output file which should contains .csv extension

  • index (bool) – Define if index is needed in output csv file or not.

to_excel(output_filename: Optional[str] = 'output.csv', index: bool = True)[source]

This function saves pandas.DataFrame to excel file.

Parameters
  • output_filename (str) – This is the name of output file which should contains .xlsx extension

  • index (bool) – Define if index is needed in output excel file or not.

systematic_review.filter_sort.dataframe_sorting_criterion_list(citations_grouped_keywords_counts_df: pandas.core.frame.DataFrame, sorting_keywords_criterion_list: list, reverse: bool = False)[source]
Provide a sorting criterion list for dataframe columns. put citations columns to the left and search_words counts

on the right. On making reverse equal to true it put search_words on the left.

Parameters
  • reverse (bool) – default to False to output citations columns to the left and keyword counts on the right. On True it does opposite.

  • citations_grouped_keywords_counts_df (pd.DataFrame) – This dataframe contains all columns with counts of search_words_object.

  • sorting_keywords_criterion_list (list) – This is the sorting criterion list which contains column in logical manner we desire.It contains total_keywords, group_keywords_counts, and keywords_counts in the last.

Returns

This is the dataframe sorting criterion list which contains column in logical manner we desire. It contains citations details in the left while total_keywords, group_keywords_counts, and keywords_counts in the right.

Return type

list

systematic_review.filter_sort.filter_and_sort(citations_grouped_keywords_counts_df: pandas.core.frame.DataFrame, search_words_object: systematic_review.search_count.SearchWords, required_number: int) pandas.core.frame.DataFrame[source]

Execute filter and sort step. creates sorting criterion list, sort the dataframe based on the sorting criterion list.

Parameters
  • required_number (int) – This is the least number of documents we want.

  • citations_grouped_keywords_counts_df (pd.DataFrame) – This dataframe contains all columns with counts of search_words_object.

  • search_words_object (object) – search_words_object should contain dictionary comprised of unique search_words_object in each keyword groups. It means keyword from first keyword group can not be found in any other keyword group. Example - {‘keyword_group_1’: [“management”, “investing”, “risk”, “pre”, “process”], ‘keyword_group_2’: [“corporate”, “pricing”],…}

Returns

This is the sorted dataframe which contains columns in this sequential manner. It contains citation df,

total_keywords, group_keywords_counts, and keywords_counts in the last.

Return type

pd.DataFrame

systematic_review.filter_sort.filter_dataframe_on_keywords_group_name_count(citations_grouped_keywords_counts_df: pandas.core.frame.DataFrame, min_limit: int, common_word: str = '_count', method: str = 'suffix') List[dict][source]

This function gets columns name from pandas dataframe which contains given prefix or suffix. It then filter dataframe to the point where all prefix and suffix column name have values more than min_limit.

Parameters
  • citations_grouped_keywords_counts_df (pd.DataFrame) – This is input dataframe which contains some columns which have prefix or suffix in names.

  • min_limit (int) – This is the least value we want in all search_words_object group names.

  • common_word (str) – This is the similar word string in many column names.

  • method (str) – This is to specify if we are looking for prefix or suffix in column names.

Returns

This is the filtered citations list based on min_limit of grouped_keywords_counts.

Return type

List[dict]

systematic_review.filter_sort.finding_required_article_by_changing_min_limit_recursively(citations_grouped_keywords_counts_df: pandas.core.frame.DataFrame, required_number_of_articles: int, addition: int = 0, search: bool = True, prev_lower_total_articles_rows: int = 0)[source]

This function increases the min_limit value to reach up to required_number_of_articles. this function return the min_limit value of exact required_number_of_articles can be extracted from dataframe else it provide the lower and upper limit of min_limit

Parameters
  • citations_grouped_keywords_counts_df (pd.DataFrame) – This is input dataframe which contains some columns which have prefix or suffix in names.

  • required_number_of_articles (int) – This is the number of articles you want after filtration process.

  • addition (int) – This is the number by which you want to increase the min_limit on grouped keyword count.

  • search (bool) – This signify the status of searching for best value of min_limit

  • prev_lower_total_articles_rows (int) – This is the previous lower total articles rows

Returns

This prints the values rather than returning the values. It return search which is of no use.

Return type

bool

systematic_review.filter_sort.get_dataframe_sorting_criterion_list(citations_grouped_keywords_counts_df, unique_preprocessed_clean_grouped_keywords_dict)[source]

This sorting criteria list is based on the search_words_object got from the main input search_words_object. It contains total_keywords, group_keywords_counts, keywords_counts.

Parameters
  • citations_grouped_keywords_counts_df (pd.DataFrame) – This dataframe contains all columns with counts of search_words_object.

  • unique_preprocessed_clean_grouped_keywords_dict (dict) – his is the dictionary comprised of unique search_words_object in each keyword groups. It means keyword from first keyword group can not be found in any other keyword group. Example - {‘keyword_group_1’: [“management”, “investing”, “risk”, “pre”, “process”], ‘keyword_group_2’: [“corporate”, “pricing”],…}. ‘risk’ is removed from keyword_group_2.

Returns

This is the sorting criterion list which contains column in logical manner we desire. It contains total_keywords, group_keywords_counts, and keywords_counts in the last.

Return type

list

systematic_review.filter_sort.get_pd_df_columns_names_with_prefix_suffix(input_pandas_dataframe: pandas.core.frame.DataFrame, common_word: str = '_count', method: str = 'suffix') List[str][source]

Provide the columns name from pandas dataframe which contains given prefix or suffix.

Parameters
  • input_pandas_dataframe (pd.DataFrame) – This dataframe contains many columns some of which contains the common word we are looking for.

  • common_word (str) – This is the similar word string in many column names.

  • method (str) – This is to specify if we are looking for prefix or suffix in column names.

Returns

This list contains the name of columns which follow above criteria.

Return type

List[str]

systematic_review.filter_sort.manually_check_filter_by_min_limit_changes(citations_grouped_keywords_counts_df: pandas.core.frame.DataFrame, required_number_of_articles: int, min_limit: int = 1, iterations: int = 20, addition: int = 20)[source]

manual text_manipulation_method_name to check number of articles based on changing min_limit.

Parameters
  • citations_grouped_keywords_counts_df (pd.DataFrame) – This is input dataframe which contains some columns which have prefix or suffix in names.

  • required_number_of_articles (int) – This is the number of articles you want after filtration process.

  • min_limit (int) – This is the least value we want in all search_words_object group names.

  • iterations (int) – This is the number of iterations in for underling loop.

  • addition (int) – This is the number by which you want to increase the min_limit on grouped keyword count.

Returns

This prints the values rather than returning the values.

Return type

None

systematic_review.filter_sort.return_finding_near_required_article_by_changing_min_limit_while_loop(citations_grouped_keywords_counts_df: pandas.core.frame.DataFrame, required_number_of_articles: int)[source]

This function increases the min_limit value to reach unto required_number_of_articles. this function return the min_limit value of exact required_number_of_articles can be extracted from dataframe else it provide the lower and upper limit of min_limit

Parameters
  • citations_grouped_keywords_counts_df (pd.DataFrame) – This is input dataframe which contains some columns which have prefix or suffix in names.

  • required_number_of_articles (int) – This is the number of articles you want after filtration process.

Returns

This tuple consists of following values in same order exact match values: min_limit, total_articles_rows lower_info : min_limit, lower_total_articles_rows upper_info : min_limit, upper_total_articles_rows

Return type

tuple

systematic_review.filter_sort.return_finding_required_article_by_changing_min_limit_recursively(citations_grouped_keywords_counts_df: pandas.core.frame.DataFrame, required_number_of_articles: int, addition: int = 0, search: bool = True, prev_lower_total_articles_rows: int = 0, upper_info=(None, None))[source]

This function increases the min_limit value to reach unto required_number_of_articles. this function return the min_limit value of exact required_number_of_articles can be extracted from dataframe else it provide the lower and upper limit of min_limit

Parameters
  • citations_grouped_keywords_counts_df (pd.DataFrame) – This is input dataframe which contains some columns which have prefix or suffix in names.

  • required_number_of_articles (int) – This is the number of articles you want after filtration process.

  • addition (int) – This is the number by which you want to increase the min_limit on grouped keyword count.

  • search (bool) – This signify the status of searching for best value of min_limit

  • prev_lower_total_articles_rows (int) – This is the previous lower total articles rows

  • upper_info (list) – This is list consists of [min_limit, upper_total_articles_rows]

Returns

This tuple consists of following values in same order searching flag: True or False exact match values: min_limit, total_articles_rows lower_info : min_limit, lower_total_articles_rows upper_info : min_limit, upper_total_articles_rows

Return type

tuple

systematic_review.filter_sort.sort_citations_grouped_keywords_counts_df(citations_grouped_keywords_counts_df: pandas.core.frame.DataFrame, sorting_keywords_criterion_list: list) pandas.core.frame.DataFrame[source]

This function sort the dataframe based on the sorting criterion list.

Parameters
  • citations_grouped_keywords_counts_df (pd.DataFrame) – This dataframe contains all columns with counts of search_words_object.

  • sorting_keywords_criterion_list (list) – This is the sorting criterion list which contains column in logical manner we desire.It contains total_keywords, group_keywords_counts, and keywords_counts in the last.

Returns

  • pd.DataFrame – This is the sorted dataframe which contains columns in this sequential manner. It contains total_keywords,

  • group_keywords_counts, and keywords_counts in the last.

systematic_review.filter_sort.sort_dataframe_based_on_column(dataframe, column_name, ascending=True)[source]

sort the dataframe based on column values.

Parameters
  • ascending (bool) – This decide increasing or decreasing order of sort. default to ascending a-z, 1-9.

  • dataframe (pd.DataFrame) – This is unsorted dataframe.

  • column_name (str) – This is the name of column which is used to sort the dataframe.

Returns

This is sorted dataframe based on title_column_name.

Return type

pd.DataFrame

systematic_review.nlp module

Module: nlp (Natural language processing) This module contains functions related to removing stop words, Lemmatization, and stemming Approaches. Functions will import the supporting AI model only when they are executed. For more Examples and info visit: https://www.machinelearningplus.com/nlp/lemmatization-examples-python/ and https://www.machinelearningplus.com/nlp/lemmatization-examples-python/

systematic_review.nlp.nltk_lancaster_stemmer(input_text: str) str[source]

This function returns stemmed text. Uses nltk.stem LancasterStemmer

Parameters

input_text (str) – This may contains all words in dictionary.

Returns

This output text contains stems of a words. Example - “car” is matched with words like “cars” but not “automobile”.

Return type

str

systematic_review.nlp.nltk_porter_stemmer(input_text: str) str[source]

This function returns stemmed text. Uses nltk.stem PorterStemmer

Parameters

input_text (str) – This may contains all words in dictionary.

Returns

This output text contains stems of a words. Example - “car” is matched with words like “cars” but not “automobile”.

Return type

str

systematic_review.nlp.nltk_remove_stopwords(text: str) str[source]

Remove unnecessary words such as she, are, of, which, and in.

Parameters

text (str) – This may contains all words in dictionary.

Returns

This contains words other than stop words described in nltk english stop words.

Return type

str

systematic_review.nlp.nltk_remove_stopwords_spacy_lemma(string_list_lower: str) List[str][source]

This function returns lemmatize text of lowercase input string. Uses spacy en_core_web_sm

Parameters

string_list_lower (str) – This may contains all lowercase words in dictionary.

Returns

This output text contains word-forms which are linguistically valid lemmas.

Return type

List[str]

systematic_review.nlp.nltk_word_net_lemmatizer(input_text: str) str[source]

This function returns lemmatize text. Uses nltk.stem WordNetLemmatizer

Parameters

input_text (str) – This may contains all words in dictionary.

Returns

This output text contains word-forms which are linguistically valid lemmas. Example - “car” is matched with words like “cars” and “automobile”.

Return type

str

systematic_review.nlp.pattern_lemma_or_lemmatize_text(input_text: str, lemma_info: bool = False) str[source]

This return lemma if lemma_info is True else it returns lemmatize text. Uses pattern.en lemma

Parameters
  • input_text (str) – This may contains all words in dictionary.

  • lemma_info (bool) – This is the switch variable which define return either to lemma information or lemmatize text.

Returns

This output text contains word-forms which are linguistically valid lemmas. Example - “car” is matched with words like “cars” and “automobile”.

Return type

str

systematic_review.nlp.spacy_lemma(input_text: str) str[source]

This function returns lemmatize text. Uses spacy en_core_web_sm

Parameters

input_text (str) – This may contains all words in dictionary.

Returns

This output text contains word-forms which are linguistically valid lemmas. Example - “car” is matched with words like “cars” and “automobile”.

Return type

str

systematic_review.os_utils module

Module: os_utils This module contains functions related to getting directories, files, and filenames from os paths.

systematic_review.os_utils.extract_files_path_from_directories_or_subdirectories(directory_path: str) list[source]

Getting all files paths from the directory and its subdirectories.

Parameters

directory_path (str) – This is the directory path of files we require.

Returns

This list contains path of all the files contained in directory_path.

Return type

list

systematic_review.os_utils.extract_subdirectories_path_from_directory(directory_path: str) list[source]

Getting all sub directories paths from the directory.

Parameters

directory_path (str) – This is the directory path of sub directories we require.

Returns

This list contains path of all the sub directories contained in directory_path.

Return type

list

systematic_review.os_utils.get_all_filenames_in_dir(dir_path: str) List[str][source]

This provides all the names of files at dir_path.

Parameters

dir_path (str) – This is the path of folder we are searching files in.

Returns

This is the list of all the names of files at dir_path.

Return type

List[str]

systematic_review.os_utils.get_directory_file_name_and_path(dir_path: str) tuple[source]

Get file names and file paths from directory path.

Parameters

dir_path (str) – This is the path of the directory.

Returns

This tuple contains list of downloaded_articles_name_list and downloaded_articles_path_list.

Return type

tuple

systematic_review.os_utils.get_file_extension_from_path(file_path: str) str[source]

Returns the file extension from pdf filepath.

Parameters

file_path (str) – A path is a string of characters used to uniquely identify a location in a directory structure. for more info visit- https://en.wikipedia.org/wiki/Path_(computing)

Returns

A filename extension, file extension or file type is an identifier specified as a suffix to the name of a computer file. for more info visit- https://en.wikipedia.org/wiki/Filename_extension

Return type

str

systematic_review.os_utils.get_filename_from_path(file_path: str) str[source]

Returns the filename from pdf filepath.

Parameters

file_path (str) – A path is a string of characters used to uniquely identify a location in a directory structure. for more info visit- https://en.wikipedia.org/wiki/Path_(computing)

Returns

A filename or file name is a name used to uniquely identify a computer file in a directory structure. for more info visit- https://en.wikipedia.org/wiki/Filename

Return type

str

systematic_review.os_utils.get_path_leaf(file_path: str) str[source]
Extract file name from path. for more details visit:

https://stackoverflow.com/questions/8384737/extract-file-name-from-path-no-matter-what-the-os-path-format

Parameters

file_path (str) – This is the path of file.

Returns

This is name of file.

Return type

str

systematic_review.os_utils.get_sources_name_citations_mapping(dir_path: str) list[source]

This makes the list of {‘sources_name’: ‘all source articles citations’, …} from mentioning the dir path of ris files.

Parameters

dir_path (str) – This is the path of folder we are searching ris files in.

Returns

This is the list of all the sources names and it’s citations at dir_path.

Return type

list

systematic_review.search_count module

Module: search_count This module contains all necessary functions for searching the citations, articles text and count number of search_words_object present.

class systematic_review.search_count.SearchCount(data: Union[List[dict], pandas.core.frame.DataFrame], search_words_object: systematic_review.search_count.SearchWords, text_manipulation_method_name: str = 'preprocess_string', custom_text_manipulation_function=None, *args, **kwargs)[source]

Bases: object

Used to search search_words in citations and research papers. This can output both records list and pandas.DataFrame as well as can take both inputs.

citation_text_column_name = 'citation_text'
count_search_words_in_citations_text(citations_records_list: List[Dict[str, Any]]) List[Dict[str, Any]][source]

Loop over each citations to count search words (SearchWords instance) in citation data.

Parameters

citations_records_list (List[Dict[str, Any]]) – This list contains all the citations details with column named ‘full_text’ containing full text like article name, abstract and keyword.

Returns

This is the list of all citations search result which contains our all search_words_object count. Examples - [{‘title’: ‘name’, ‘total_keywords’: count, ‘keyword_group_1_count’: count, “management”: count, “investing: count”, “risk: count”, ‘keyword_group_2_count’: count, “corporate”: count, “pricing”: count,…}]

Return type

List[Dict[str, Any]]

count_search_words_in_research_paper_text(research_papers_records_list: List[Dict[str, Any]]) List[Dict[str, Any]][source]

Loop over validated research paper to count search words (SearchWords instance) in research papers data.

Parameters

research_papers_records_list (List[Dict[str, Any]]) – This list contains data of all the research papers files contained in directory_path.

Returns

This is the list of all citations search result which contains our all search_words_object count. Examples - [{‘article’: ‘article_name’, ‘total_keywords’: count, ‘keyword_group_1_count’: count, “management”: count, “investing: count”, “risk: count”, ‘keyword_group_2_count’: count, “corporate”: count, “pricing”: count,…}]

Return type

List[Dict[str, Any]]

counts() List[Dict[str, Any]][source]

This takes records list and return search counts based on type of citation data or research papers data.

Returns

records list containing the citation data or research papers data.

Return type

List[Dict[str, Any]]

download_flag_column_name = 'downloaded'
get_dataframe()[source]

Outputs the pandas.DataFrame containing counts results of input data.

Returns

This is the dataframe of all citations search result which contains our all search_words_object count. Examples - [{‘article’: ‘article_name’, ‘total_keywords’: count, ‘keyword_group_1_count’: count, “management”: count, “investing: count”, “risk: count”, ‘keyword_group_2_count’: count, “corporate”: count, “pricing”: count,…}]

Return type

pandas.DataFrame

get_records_list() List[Dict[str, Any]][source]

Outputs the records list containing counts results of input data.

Returns

This is the list of records which contains all search_words_object count from input data. Examples - [{‘article’: ‘article_name’, ‘total_keywords’: count, ‘keyword_group_1_count’: count, “management”: count, “investing: count”, “risk: count”, ‘keyword_group_2_count’: count, “corporate”: count, “pricing”: count,…}]

Return type

List[Dict[str, Any]]

research_paper_file_location_column_name = 'file location'
to_csv(output_filename: Optional[str] = 'output.csv', index: bool = True)[source]

This function saves pandas.DataFrame to csv file.

Parameters
  • output_filename (str) – This is the name of output file which should contains .csv extension

  • index (bool) – Define if index is needed in output csv file or not.

to_excel(output_filename: Optional[str] = 'output.csv', index: bool = True)[source]

This function saves pandas.DataFrame to excel file.

Parameters
  • output_filename (str) – This is the name of output file which should contains .xlsx extension

  • index (bool) – Define if index is needed in output excel file or not.

class systematic_review.search_count.SearchWords(search_words, text_manipulation_method_name: str = 'preprocess_string', custom_text_manipulation_function=None, default_search_words_group_name: str = 'search_words_group_', all_unique_keywords: bool = False, unique_keywords: bool = True, *args, **kwargs)[source]

Bases: object

This class contains all functionalities related to search words.

construct_search_words_from_list() dict[source]

This takes keywords_list which contains search_words_object as [‘keyword1 keyword2 keyword3’, ‘keyword1 keyword2’] and function construct dict as {‘keyword_group_1’: ‘keyword1 keyword2 keyword3’, ‘keyword_group_2’: ‘keyword1 keyword2’}

Returns

the dictionary contains the group name and search_words_object paired as value Examples - {‘keyword_group_1’: ‘keyword1 keyword2 keyword3’, ‘keyword_group_2’: ‘keyword1 keyword2’}

Return type

dict

creating_default_keyword_count_dict()[source]

Initialise keyword count dict with value 0 for every keyword.

Returns

This contains key as keyword and value as 0.

Return type

dict

generate_keywords_count_dictionary(text)[source]
Parameters

text

get_sample_search_words_json(output_file_path: str = 'sample_search_words_template.json') None[source]

Outputs the json sample search_words_object file template as example which can be edited by user to upload search_words_object.

Parameters

output_file_path (str) – this is optional output file path for json template

Returns

function create the file on the root folder unless specified in output_file_path

Return type

None

get_sorting_keywords_criterion_list() List[str][source]

This sorting criteria list is based on the search_words_object got from the main input search_words_object. It contains total_keywords, group_keywords_counts, keywords_counts.

Returns

This is the sorting criterion list which contains column in logical manner we desire. It contains total_keywords, group_keywords_counts, and search_words_object in the last.

Return type

List[str]

preprocess_search_keywords_dictionary(grouped_keywords_dictionary: dict) dict[source]

This takes search_words_object from {keyword_group_name: search_words_object,…} dict and remove symbols with spaces. it then convert them to lowercase and remove any duplicate keyword inside of search_words_object. outputs the {keyword_group_name: [clean_keywords],…}

Parameters

grouped_keywords_dictionary (dict) – This is the input dictionary of search_words_object used for systematic review. Example - {‘keyword_group_name’: “Management investing corporate pricing risk Risk Pre-process”,…}

Returns

This is output dictionary which contains processed non-duplicate search_words_object dict. Example - {‘keyword_group_name’: [“management”, “investing”, “corporate”, “pricing”, “risk”, “pre”, “process”],…}

Return type

dict

preprocess_searched_keywords(grouped_keywords_dictionary: dict) dict[source]

Remove duplicate instances of search_words_object in other search_words_object groups.

Parameters

grouped_keywords_dictionary (dict) – This is the input dictionary of search_words_object used for systematic review. Example - {‘keyword_group_name’: “Management investing corporate pricing risk Risk Pre-process”,…}

Returns

This is the dictionary comprised of unique search_words_object in each keyword groups. It means keyword from first keyword group can not be found in any other keyword group. Example - {‘keyword_group_1’: [“management”, “investing”, “risk”, “pre”, “process”], ‘keyword_group_2’: [“corporate”, “pricing”],…} ‘risk’ is removed from keyword_group_2.

Return type

dict

sample_dict = {'keywords_common_words': 'accuracy classification cross sectional cross-section expected metrics prediction predict expert system', 'keywords_finance': 'Management investing corporate pricing risk', 'keywords_machine_learning': 'neural fuzzy inference system artificial intelligence artificial computational neural networks'}
unique_keywords_in_preprocessed_clean_keywords_dict() set[source]

Return set of unique search_words_object from the preprocessed_clean_keywords_dict.

Returns

This is set of unique search_words_object from all of search_words_object groups.

Return type

set

systematic_review.search_count.adding_citation_details_with_keywords_count_in_pdf_full_text(filter_sorted_citations_df: pandas.core.frame.DataFrame, pdf_full_text_search_count: list, unique_preprocessed_clean_grouped_keywords_dict: dict, first_column_name: str = 'cleaned_title', second_column_name: str = 'cleaned_title_pdf') pandas.core.frame.DataFrame[source]

Combining the pdf search_words_object counts with the citation details from filtered and sorted citation full text dataframe.

Parameters
  • second_column_name (str) – This is the name of column which contain pdf article title.

  • first_column_name (str) – This is the name of column which contain citation title.

  • filter_sorted_citations_df (pandas.DataFrame object) –

    This is the sorted dataframe which contains columns in this sequential manner. It contains citation df,

    total_keywords, group_keywords_counts, and keywords_counts in the last.

  • pdf_full_text_search_count (list) – This is the list of all citations search result which contains our all search_words_object count. Examples - [{‘article’: ‘article_name’, ‘total_keywords’: count, ‘keyword_group_1_count’: count, “management”: count, “investing: count”, “risk: count”, ‘keyword_group_2_count’: count, “corporate”: count, “pricing”: count,…}]

  • unique_preprocessed_clean_grouped_keywords_dict (dict) – This is the dictionary comprised of unique search_words_object in each keyword groups. It means keyword from first keyword group can not be found in any other keyword group. Example - {‘keyword_group_1’: [“management”, “investing”, “risk”, “pre”, “process”], ‘keyword_group_2’: [“corporate”, “pricing”],…}. ‘risk’ is removed from keyword_group_2.

Returns

This dataframe contains citations details from filtered and sorted citation full text dataframe and search_words_object counts from searching in pdf file text.

Return type

pandas.DataFrame object

systematic_review.search_count.adding_dict_key_or_increasing_value(input_dict: dict, dict_key: str, step: int = 1, default_dict_value: int = 1)[source]

Increase the value of dict(key:value) by step using key. If key not present then it get initialised with default dict value

Parameters
  • input_dict (dict) – This is the dictionary which we want to modify.

  • dict_key (str) – This is the key of dictionary

  • step (int) – This is the addition number by which value of dictionary needed to be increased.

  • default_dict_value (int) – If key is not available in dictionary then this default value is used to add new key.

Returns

This is the modified dictionary

Return type

dict

systematic_review.search_count.citation_list_of_dict_search_count_to_df(citations_list: list, keywords: dict, title_column_name: str = 'title', method: str = 'preprocess_string', custom=None) pandas.core.frame.DataFrame[source]

Loop over articles to calculate search_words_object counts and return dataframe.

Parameters
  • title_column_name (str) – This is the name of column which contain citation title

  • custom (function) – This is optional custom_text_manipulation_function function if you want to implement this yourself. pass as custom_text_manipulation_function = function_name. it will take text as parameter with no default preprocess_string operation.

  • method (str) – provides the options to use any text manipulation function. preprocess_string (default and applied before all other implemented functions) custom_text_manipulation_function - for putting your custom_text_manipulation_function function to preprocess the text nltk_remove_stopwords, pattern_lemma_or_lemmatize_text, nltk_word_net_lemmatizer, nltk_porter_stemmer, nltk_lancaster_stemmer, spacy_lemma, nltk_remove_stopwords_spacy_lemma, convert_string_to_lowercase

  • citations_list (list) – list with additional columns needed for next steps of systematic review and duplicates are removed

  • keywords (dict) – This is output dictionary which contains processed non-duplicate search_words_object dict. Example - {‘keyword_group_name’: [“management”, “investing”, “corporate”, “pricing”, “risk”, “pre”, “process”],…}

Returns

This is pandas object of all citations search result which contains our all search_words_object count. Examples - [{‘primary_title’: ‘name’, ‘total_keywords’: count, ‘keyword_group_1_count’: count, “management”: count, “investing: count”, “risk: count”, ‘keyword_group_2_count’: count, “corporate”: count, “pricing”: count,…}]

Return type

pandas.DataFrame object

systematic_review.search_count.citation_search_count_dataframe(citations_df: pandas.core.frame.DataFrame, keywords: dict, title_column_name: str = 'title', method: str = 'preprocess_string', custom=None) pandas.core.frame.DataFrame[source]

Loop over articles to calculate keywords counts and return dataframe.

Parameters
  • title_column_name (str) – This is the name of column which contain citation title

  • custom (function) – This is optional custom function if you want to implement this yourself. pass as custom = function_name. it will take text as parameter with no default preprocess_string operation.

  • method (str) – provides the options to use any text manipulation function. preprocess_string (default and applied before all other implemented functions) custom - for putting your custom function to preprocess the text nltk_remove_stopwords, pattern_lemma_or_lemmatize_text, nltk_word_net_lemmatizer, nltk_porter_stemmer, nltk_lancaster_stemmer, spacy_lemma, nltk_remove_stopwords_spacy_lemma, convert_string_to_lowercase

  • citations_df (pandas.DataFrame object) – DataFrame with additional columns needed for next steps of systematic review and duplicates are removed

  • keywords (dict) – This is output dictionary which contains processed non-duplicate keywords dict. Example - {‘keyword_group_name’: [“management”, “investing”, “corporate”, “pricing”, “risk”, “pre”, “process”],…}

Returns

This is pandas object of all citations search result which contains our all keywords count. Examples - [{‘primary_title’: ‘name’, ‘total_keywords’: count, ‘keyword_group_1_count’: count, “management”: count, “investing: count”, “risk: count”, ‘keyword_group_2_count’: count, “corporate”: count, “pricing”: count,…}]

Return type

pandas.DataFrame object

systematic_review.search_count.count_keywords_in_citations_full_text(dataframe_citations_with_fulltext: pandas.core.frame.DataFrame, unique_preprocessed_clean_grouped_keywords_dict: dict, title_column_name: str = 'title', method: str = 'preprocess_string', custom=None) list[source]

Loop over articles to calculate keywords counts

Parameters
  • custom (function) – This is optional custom function if you want to implement this yourself. pass as custom = function_name. it will take text as parameter with no default preprocess_string operation.

  • method (str) – provides the options to use any text manipulation function. preprocess_string (default and applied before all other implemented functions) custom - for putting your custom function to preprocess the text nltk_remove_stopwords, pattern_lemma_or_lemmatize_text, nltk_word_net_lemmatizer, nltk_porter_stemmer, nltk_lancaster_stemmer, spacy_lemma, nltk_remove_stopwords_spacy_lemma, convert_string_to_lowercase

  • dataframe_citations_with_fulltext (pd.DataFrame) – This dataframe contains all the citations details with column named ‘full_text’ containing full text like article name, abstract and keyword.

  • unique_preprocessed_clean_grouped_keywords_dict (dict) –

    looks like this {‘keyword_group_1’: [“management”, “investing”, “risk”, “pre”, “process”],

    ’keyword_group_2’: [“corporate”, “pricing”],…}

  • title_column_name (str) – This is the name of column which contain citation title

Returns

This is the list of all citations search result which contains our all keywords count. Examples - [{‘primary_title’: ‘name’, ‘total_keywords’: count, ‘keyword_group_1_count’: count, “management”: count, “investing: count”, “risk: count”, ‘keyword_group_2_count’: count, “corporate”: count, “pricing”: count,…}]

Return type

list

systematic_review.search_count.count_keywords_in_citations_full_text_list(citations_with_fulltext_list: list, unique_preprocessed_clean_grouped_keywords_dict: dict, title_column_name: str = 'title', method: str = 'preprocess_string', custom=None) list[source]

Loop over articles to calculate search_words_object counts

Parameters
  • custom (function) – This is optional custom_text_manipulation_function function if you want to implement this yourself. pass as custom_text_manipulation_function = function_name. it will take text as parameter with no default preprocess_string operation.

  • method (str) – provides the options to use any text manipulation function. preprocess_string (default and applied before all other implemented functions) custom_text_manipulation_function - for putting your custom_text_manipulation_function function to preprocess the text nltk_remove_stopwords, pattern_lemma_or_lemmatize_text, nltk_word_net_lemmatizer, nltk_porter_stemmer, nltk_lancaster_stemmer, spacy_lemma, nltk_remove_stopwords_spacy_lemma, convert_string_to_lowercase

  • citations_with_fulltext_list (list) – This list contains all the citations details with column named ‘full_text’ containing full text like article name, abstract and keyword.

  • unique_preprocessed_clean_grouped_keywords_dict (dict) –

    looks like this {‘keyword_group_1’: [“management”, “investing”, “risk”, “pre”, “process”],

    ’keyword_group_2’: [“corporate”, “pricing”],…}

  • title_column_name (str) – This is the name of column which contain citation title

Returns

This is the list of all citations search result which contains our all search_words_object count. Examples - [{‘primary_title’: ‘name’, ‘total_keywords’: count, ‘keyword_group_1_count’: count, “management”: count, “investing: count”, “risk: count”, ‘keyword_group_2_count’: count, “corporate”: count, “pricing”: count,…}]

Return type

list

systematic_review.search_count.count_keywords_in_pdf_full_text(list_of_downloaded_articles_path: list, unique_preprocessed_clean_grouped_keywords_dict: dict, title_column_name: str = 'cleaned_title_pdf', method: str = 'preprocess_string', custom=None) list[source]

Loop over articles pdf files to calculate keywords counts.

Parameters
  • custom (function) – This is optional custom function if you want to implement this yourself. pass as custom = function_name. it will take text as parameter with no default preprocess_string operation.

  • method (str) – provides the options to use any text manipulation function. preprocess_string (default and applied before all other implemented functions) custom - for putting your custom function to preprocess the text nltk_remove_stopwords, pattern_lemma_or_lemmatize_text, nltk_word_net_lemmatizer, nltk_porter_stemmer, nltk_lancaster_stemmer, spacy_lemma, nltk_remove_stopwords_spacy_lemma, convert_string_to_lowercase

  • title_column_name (str) – This is the name of column which contain citation title

  • list_of_downloaded_articles_path (list) – This list contains path of all the pdf files contained in directory_path.

  • unique_preprocessed_clean_grouped_keywords_dict (dict) –

    looks like this {‘keyword_group_1’: [“management”, “investing”, “risk”, “pre”, “process”],

    ’keyword_group_2’: [“corporate”, “pricing”],…}

Returns

This is the list of all citations search result which contains our all keywords count. Examples - [{‘article’: ‘article_name’, ‘total_keywords’: count, ‘keyword_group_1_count’: count, “management”: count, “investing: count”, “risk: count”, ‘keyword_group_2_count’: count, “corporate”: count, “pricing”: count,…}]

Return type

list

systematic_review.search_count.count_search_words_in_citations_text(citations_with_fulltext_list: list, search_words_object: systematic_review.search_count.SearchWords, text_column_name: str = "'citation_text'", text_manipulation_method_name: str = 'preprocess_string', custom=None, custom_text_manipulation_function=None, *args, **kwargs) list[source]

Loop over articles to calculate search_words_object counts

Parameters
  • custom_text_manipulation_function (function) – This is optional custom_text_manipulation_function function if you want to implement this yourself. pass as custom_text_manipulation_function = function_name. it will take text as parameter with no default preprocess_string operation.

  • text_manipulation_method_name (str) – provides the options to use any text manipulation function. preprocess_string (default and applied before all other implemented functions) custom_text_manipulation_function - for putting your custom_text_manipulation_function function to preprocess the text nltk_remove_stopwords, pattern_lemma_or_lemmatize_text, nltk_word_net_lemmatizer, nltk_porter_stemmer, nltk_lancaster_stemmer, spacy_lemma, nltk_remove_stopwords_spacy_lemma, convert_string_to_lowercase

  • citations_with_fulltext_list (list) – This list contains all the citations details with column named ‘full_text’ containing full text like article name, abstract and keyword.

  • search_words_object (object) –

    looks like this {‘keyword_group_1’: [“management”, “investing”, “risk”, “pre”, “process”],

    ’keyword_group_2’: [“corporate”, “pricing”],…}

  • text_column_name (str) – This is the name of column which contain citation text

Returns

This is the list of all citations search result which contains our all search_words_object count. Examples - [{‘primary_title’: ‘name’, ‘total_keywords’: count, ‘keyword_group_1_count’: count, “management”: count, “investing: count”, “risk: count”, ‘keyword_group_2_count’: count, “corporate”: count, “pricing”: count,…}]

Return type

list

systematic_review.search_count.count_words_in_list_of_lists(list_of_lists: List[list]) dict[source]

count words in list containing other lists with words.

Parameters

list_of_lists (List[list]) – This list contains each element of type list.

Returns

dictionary with key as words and value as counts

Return type

dict

systematic_review.search_count.creating_keyword_count_dict(unique_preprocessed_clean_grouped_keywords_dict: dict)[source]

Initialise keyword count dict with value 0 for every keyword.

Parameters

unique_preprocessed_clean_grouped_keywords_dict (dict) –

looks like this {‘keyword_group_1’: [“management”, “investing”, “risk”, “pre”, “process”],

’keyword_group_2’: [“corporate”, “pricing”],…}

Returns

This contains key as keyword and value as 0.

Return type

dict

systematic_review.search_count.get_sorting_keywords_criterion_list(unique_preprocessed_clean_grouped_keywords_dict: dict) list[source]

This sorting criteria list is based on the keywords got from the main input keywords. It contains total_keywords, group_keywords_counts, keywords_counts.

Parameters

unique_preprocessed_clean_grouped_keywords_dict (dict) – his is the dictionary comprised of unique keywords in each keyword groups. It means keyword from first keyword group can not be found in any other keyword group. Example - {‘keyword_group_1’: [“management”, “investing”, “risk”, “pre”, “process”], ‘keyword_group_2’: [“corporate”, “pricing”],…}. ‘risk’ is removed from keyword_group_2.

Returns

This is the sorting criterion list which contains column in logical manner we desire. It contains total_keywords, group_keywords_counts, and keywords in the last.

Return type

list

systematic_review.search_count.pdf_full_text_search_count_dataframe(list_of_downloaded_articles_path: list, unique_preprocessed_clean_grouped_keywords_dict: dict, title_column_name: str = 'cleaned_title', method: str = 'preprocess_string', custom=None) pandas.core.frame.DataFrame[source]

Loop over articles pdf files to calculate keywords counts.

Parameters
  • custom (function) – This is optional custom function if you want to implement this yourself. pass as custom = function_name. it will take text as parameter with no default preprocess_string operation.

  • method (str) – provides the options to use any text manipulation function. preprocess_string (default and applied before all other implemented functions) custom - for putting your custom function to preprocess the text nltk_remove_stopwords, pattern_lemma_or_lemmatize_text, nltk_word_net_lemmatizer, nltk_porter_stemmer, nltk_lancaster_stemmer, spacy_lemma, nltk_remove_stopwords_spacy_lemma, convert_string_to_lowercase

  • title_column_name (str) – This is the name of column which contain citation title

  • list_of_downloaded_articles_path (list) – This list contains path of all the pdf files contained in directory_path.

  • unique_preprocessed_clean_grouped_keywords_dict (dict) –

    looks like this {‘keyword_group_1’: [“management”, “investing”, “risk”, “pre”, “process”],

    ’keyword_group_2’: [“corporate”, “pricing”],…}

Returns

This is the dataframe of all citations search result which contains our all keywords count. Examples - [{‘article’: ‘article_name’, ‘total_keywords’: count, ‘keyword_group_1_count’: count, “management”: count, “investing: count”, “risk: count”, ‘keyword_group_2_count’: count, “corporate”: count, “pricing”: count,…}]

Return type

pandas.DataFrame object

systematic_review.search_count.remove_duplicates_keywords_from_next_groups(preprocessed_clean_grouped_keywords_dict: dict) dict[source]

Execute search_words_object step. This takes search_words_object from {keyword_group_name: search_words_object,…} dict and remove symbols with spaces. it then convert them to lowercase and remove any duplicate keyword inside of search_words_object. outputs the {keyword_group_name: [clean_keywords],…} and then Remove duplicate instances of search_words_object in other search_words_object groups.

Parameters

preprocessed_clean_grouped_keywords_dict (dict) – This is output dictionary which contains processed non-duplicate search_words_object dict. Example - {‘keyword_group_1’: [“management”, “investing”, “risk”, “pre”, “process”], ‘keyword_group_2’: [“corporate”, “pricing”, “risk”],…}

Returns

This is the dictionary comprised of unique search_words_object in each keyword groups. It means keyword from first keyword group can not be found in any other keyword group. Example - {‘keyword_group_1’: [“management”, “investing”, “risk”, “pre”, “process”], ‘keyword_group_2’: [“corporate”, “pricing”],…} ‘risk’ is removed from keyword_group_2.

Return type

dict

systematic_review.string_manipulation module

Module: string_manipulation This module contains functions related to string case change, preprocess, and removing some part of it.

systematic_review.string_manipulation.convert_string_to_lowercase(string: str) str[source]

Lowercase the given input string.

Parameters

string (str) – The string which might have uppercase characters in it.

Returns

This is all lowercase character string.

Return type

str

systematic_review.string_manipulation.pdf_filename_from_filepath(article_path: str) str[source]

This takes the pdf path as input and clean the name of pdf by applying preprocess function from string_manipulation module.

Parameters

article_path (str) – This is the path of the pdf file.

Returns

This is the cleaned filename of the pdf.

Return type

str

systematic_review.string_manipulation.preprocess_string(string: str) str[source]

replace symbols in string with spaces and Lowercase the given input string. Example - ‘Df%$df’ -> ‘df df’

Parameters

string (str) – This is input word string which contains unwanted symbols and might have uppercase characters in it.

Returns

This is cleaned string from symbols and contains only alpha characters.

Return type

str

systematic_review.string_manipulation.preprocess_string_to_space_separated_words(string: str) str[source]

replace symbols in string with spaces and Lowercase the given input string. Example - ‘Df%$df’ -> ‘df df’ and convert ‘df df’ to single spaced ‘df df’.

Parameters

string (str) – This can contain string words mixed with spaces and symbols.

Returns

remove the spaces and symbols and arrange the words single spaces.

Return type

str

systematic_review.string_manipulation.remove_non_ascii(string_list: list) list[source]

Remove non-ASCII characters from list of tokenized words

Parameters

string_list (list) – this list contains the words which contains the non-ASCII characters

Returns

this is modified list after removing the non-ASCII characters

Return type

list

systematic_review.string_manipulation.replace_symbols_with_space(string: str) str[source]

replace symbols in string with spaces. Example - ‘df%$df’ -> ‘df df’

Parameters

string (str) – This is input word string which contains unwanted symbols.

Returns

This is cleaned string from symbols and contains only alpha characters and all lowercase character string.

Return type

str

systematic_review.string_manipulation.split_preprocess_string(text: str) list[source]

This splits the words into list after applying preprocess function from string_manipulation module.

Parameters

text (str) – This is input word string which contains unwanted symbols and might have uppercase characters in it.

Returns

This is cleaned list of strings from symbols and contains only alpha characters.

Return type

list

systematic_review.string_manipulation.split_words_remove_duplicates(string_list: list) list[source]

this function takes a list of words or sentences and split them to individual words. It also removes any repeating word in the list.

Parameters

string_list (list) – this is the input list which contains words and group of words inside. Example - [‘one’, ‘one two’]

Returns

this is the output list which contains only unique individual words using set(). Example - [‘one’, ‘two’]

Return type

list

systematic_review.string_manipulation.string_dict_to_lower(string_map: dict) dict[source]

this convert the values into lowercase. similar function for list is available as string_list_to_lower()

Parameters

string_map (dict) – these are key:values pairs needed to be converted.

Returns

output by converting input to key: lowercase values.

Return type

dict

systematic_review.string_manipulation.string_list_to_lower(string_list: list) list[source]

this convert the values into lowercase. similar function for dict is available as string_dict_to_lower()

Parameters

string_list (list) – this list contains input string need to be converted to lowercase.

Returns

this is the output list which contains original input strings but in lowercase

Return type

list

systematic_review.string_manipulation.string_to_space_separated_words(text: str) str[source]

takes text string and outputs space separated words.

Parameters

text (str) – This text contains multiple spaces or trailing whitespaces

Returns

This is space separated word string with no trailing whitespaces.

Return type

str

systematic_review.string_manipulation.strip_string_from_right_side(string: str, value_to_be_stripped: str = '.pdf') str[source]

Function removes the substring from the right of string.

Parameters
  • string (str) – This is the complete word or string. Example - ‘monster.pdf’

  • value_to_be_stripped (str) – This is the value which is needed to be removed from right side. Example - ‘.pdf’

Returns

This is the trimmed string that contains the left part after some part removed from the right. Example - ‘monster’

Return type

str

systematic_review.string_manipulation.text_manipulation_methods(text: str, text_manipulation_method_name: str = 'preprocess_string', custom_text_manipulation_function: Optional[Callable[[str, Any, Any], str]] = None, *args, **kwargs) str[source]

This convert text or string using options like preprocess, nlp module function, for more info each respective methods methods implemented. args and kwargs will go into custom_text_manipulation_function

Parameters
  • kwargs (Dict[str, Any]) – These key = word or {key: word} arguments are for custom_text_manipulation_function

  • args (Tuple) – These arguments are for custom_text_manipulation_function

  • custom_text_manipulation_function (Callable[[str, Any, Any], str]) – This is optional custom_text_manipulation_function function if you want to implement this yourself. pass as custom_text_manipulation_function = function_name. it will take text as parameter with no default preprocess_string operation.

  • text (str) – string type text which is needed to be converted.

  • text_manipulation_method_name (str) – provides the options to use any text manipulation function. preprocess_string (default and applied before all other implemented functions) custom_text_manipulation_function - for putting your custom_text_manipulation_function function to preprocess the text nltk_remove_stopwords, pattern_lemma_or_lemmatize_text, nltk_word_net_lemmatizer, nltk_porter_stemmer, nltk_lancaster_stemmer, spacy_lemma, nltk_remove_stopwords_spacy_lemma, convert_string_to_lowercase, preprocess_string_to_space_separated_words

Returns

this return the converted text

Return type

str

systematic_review.validation module

Module: validation This module contains functions related validating our downloaded articles if they’re same as ones we require. It also contains functions to get articles source name and create list of missed or duplicate articles.

class systematic_review.validation.ValidateWordsInText(words_string: str, text_string: str, words_percentage_checker_in_text_validation_limit: float = 70, jumbled_words_percentage_checker_in_text_validation_limit: float = 70, jumbled_words_percentage_checker_in_text_wrong_word_limit: int = 2)[source]

Bases: object

This checks words in given Text.

exact_words_checker_in_text() bool[source]

This checks for exact match of substring in string and return True or False based on success.

Returns

This returns True if exact words_string found in text_string else False.

Return type

bool

jumbled_words_percentage_checker_in_text() tuple[source]

start calculating percentage if half of words are found in sequence. This also takes in consideration of words which got jumbled up due to pdf reading operation.

Returns

This returns True if exact words_string found in text_string else False. This also returns matched substring percentage.

Return type

tuple

multiple_methods() tuple[source]

This text_manipulation_method_name uses different methods to validate the article_name(substring) in text. Example - exact_words, words_percentage, jumbled_words_percentage.

Returns

True and False value depicting validated article with True value. This also shows percentage matched Last it shows the text_manipulation_method_name used. like exact_words, words_percentage, jumbled_words_percentage, all if every text_manipulation_method_name is executed to validate.

Return type

tuple

words_percentage_checker_in_text() tuple[source]

This checks for exact match of substring in string and return True or False based on success. It also returns matched word percentage. words_percentage_checker_in_text_validation_limit: this doesn’t work properly if words_string have duplicate words.

Returns

This returns True if exact words_string found in text_string else False. This also returns matched substring percentage.

Return type

tuple

class systematic_review.validation.Validation(citations_data: Union[List[dict], pandas.core.frame.DataFrame], parents_directory_of_research_papers_files: str, text_file_path_of_inaccessible_research_papers: Optional[str] = None, text_manipulation_method_name: str = 'preprocess_string_to_space_separated_words', words_percentage_checker_in_text_validation_limit: float = 70, jumbled_words_percentage_checker_in_text_validation_limit: float = 70, jumbled_words_percentage_checker_in_text_wrong_word_limit: int = 2)[source]

Bases: object

This is used to validate the downloaded files.

add_downloaded_flag_column_and_file_location_column()[source]

add empty columns based on research_paper_file_location_column_name and download_flag_column_name

Returns

data contains new columns.

Return type

List[dict]

check()[source]

Executes the validation of research articles in citation data by checking the research paper files and validating if the research articles are correct.

Returns

data contains columns with downloaded, validation method and file path columns

Return type

List[dict]

cleaned_article_column_name = 'cleaned_title'
download_flag_column_name = 'downloaded'
file_invalidated_flag_name = 'wrong'
file_manual_check_flag_name = 'unreadable'
file_name_and_path_dict()[source]

contains mapping of filename to file paths

Returns

key is filename and value is file paths.

Return type

dict

file_not_accessible_flag_name = 'no access'
file_not_downloaded_flag_name = 'no'
file_validated_flag_name = 'yes'
get_dataframe()[source]

Outputs the pandas.DataFrame containing validation results of input data.

Returns

This is the dataframe which contains validation Flags column downloaded with values- “yes”, “no”, “wrong”, “no access”, “unreadable” and file location column if downloaded column contains “yes”.

Return type

pandas.DataFrame

get_records_list() List[Dict[str, Any]][source]

Outputs the records list containing validation results of input data.

Returns

This is the list of records which contains validation Flags column downloaded with values- “yes”, “no”, “wrong”, “no access”, “unreadable” and file location column if downloaded column contains “yes”.

Return type

List[Dict[str, Any]]

info()[source]

Equivalent to pandas.DataFrame.value_counts(), It return list with count of unique element in column

Returns

unique download_flag_column_name elements with counts

Return type

object

research_paper_file_location_column_name = 'file location'
to_csv(output_filename: Optional[str] = 'output.csv', index: bool = True)[source]

This function saves pandas.DataFrame to csv file.

Parameters
  • output_filename (str) – This is the name of output file which should contains .csv extension

  • index (bool) – Define if index is needed in output csv file or not.

to_excel(output_filename: Optional[str] = 'output.csv', index: bool = True)[source]

This function saves pandas.DataFrame to excel file.

Parameters
  • output_filename (str) – This is the name of output file which should contains .xlsx extension

  • index (bool) – Define if index is needed in output excel file or not.

validation_manual_method_name = 'manual'
validation_method_column_name = 'validation method'
systematic_review.validation.add_dict_element_with_count(dictionary: dict, key: str) dict[source]

It increase the value by checking the key of dictionary or initialise new key with value 1. Works as collections module default dict with value 1.

Parameters
  • dictionary (dict) – This is the dictionary where we want to add element.

  • key (str) – This is the key of dictionary {key: value}

Returns

This is the edited dict with new elements counts.

Return type

dict

systematic_review.validation.amount_by_percentage(number: float, percentage: float) float[source]

get the amount of number based on percentage. example- 5% (percentage) of 10 (number) is 0.5 (result).

Parameters
  • number (float) – This is the input number from which we want some percent amount

  • percentage (float) – This is equivalent to math percentage.

Returns

This is resultant number.

Return type

float

systematic_review.validation.calculate_percentage(value: float, total: float) float[source]

calculate percentage of value in total.

Parameters
  • value (float) – It is input number, normally smaller than total.

  • total (float) – It is the larger number from which we want to know percentage

Returns

This is calculated percentage. Example 98.36065573770492 that means 98.35%

Return type

float

systematic_review.validation.compare_two_dict_members_via_percent_similarity(first_dict: dict, second_dict: dict) float[source]

Compare elements in 2 dictionaries and return percentage similarity.

Parameters
  • first_dict (dict) – Example - first_dict = {‘mixed’:1, ‘modified’:1, ‘fruit’:1, ‘fly’:1, ‘optimization’:1}

  • second_dict (dict) – Example - second_dict = {‘mixed’:1, ‘modified’:1, ‘fruit’:1, ‘fly’:1, ‘optimization’:1, ‘algorithm’: 1}

Returns

This is percentage represented as decimal number. Example 98.36065573770492 that means 98.35%

Return type

float

systematic_review.validation.compare_two_list_members_via_percent_similarity(words_list: list, boolean_membership_list: list) float[source]

Compare elements in 2 lists and return percentage similarity.

Parameters
  • words_list (list) – This contains elements whose elements to be checked for similarity.

  • boolean_membership_list (list) – This list contains True and False values.

Returns

This is percentage represented as decimal number. Example 98.36065573770492 that means 98.35%

Return type

float

systematic_review.validation.deep_validate_column_details_between_two_record_list(first_list_of_dict: list, second_list_of_dict: list, first_column_name: str = 'cleaned_title', second_column_name: str = 'cleaned_title_pdf') tuple[source]

It produce list of matched columns rows and unmatched column rows based on same column from both.

Parameters
  • second_column_name (str) – This is the name of column which contain pdf article title.

  • first_list_of_dict (list) – Iterable object pandas.DataFrame or list which contains first_column_name

  • second_list_of_dict (list) – Iterable object pandas.DataFrame or list which contains first_column_name

  • first_column_name (str) – This is the name of column which contain citation title.

Returns

matched_list - It contains column’s row which are matched in both data object. unmatched_list - It contains column’s row which are unmatched in both data object.

Return type

tuple

systematic_review.validation.dict_from_list_with_element_count(input_list)[source]

Put input list elements into dictionary with count.

Parameters

input_list (list) – This is the list with elements with some duplicates present.

Returns

This is dictionary key as list elements and value as list each element count

Return type

dict

systematic_review.validation.exact_words_checker_in_text(words_string: str, text_string: str) bool[source]

This checks for exact match of substring in string and return True or False based on success.

Parameters
  • words_string (str) – This is the word we are searching for.

  • text_string (str) – This is query string or lengthy text.

Returns

This returns True if exact words_string found in text_string else False.

Return type

bool

systematic_review.validation.finding_missed_articles_from_downloading(validated_pdf_list: list, original_articles_list: list) tuple[source]

Checks how many articles are not downloaded yet from original list of articles.

Parameters
  • validated_pdf_list (list) – Contains name of pdf files whose filename is in the pdf text.

  • original_articles_list (list) – This is original list from where we started downloading the articles.

Returns

Missing_articles - these are the articles which are missed from downloading. Validated_articles - This is list of validated downloaded articles list.

Return type

tuple

systematic_review.validation.get_dataframe_column_as_list(dataframe: pandas.core.frame.DataFrame, column_name: str = 'primary_title')[source]

Get pandas dataframe column values as list.

Parameters
  • dataframe (pd.DataFrame) – This is the dataframe which contains column whose details we want as list.

  • column_name (str) – This is the name of the column.

Returns

This is the list containing the dataframe one column values.

Return type

list

systematic_review.validation.get_missed_articles_dataframe(filter_sorted_citations_df: pandas.core.frame.DataFrame, downloaded_articles_path: str, title_column_name: str = 'cleaned_title') list[source]

return list of missed articles from downloading by checking original list of articles from filter_sorted_citations_df using downloaded articles path.

Parameters
  • title_column_name (str) – contains name of column which contain the name of article.

  • filter_sorted_citations_df (pd.DataFrame) – This dataframe contains records of selected articles including name of articles.

  • downloaded_articles_path (str) – contains parent folder of all the downloaded articles files.

Returns

list of the missed articles from downloading.

Return type

list

systematic_review.validation.get_missed_original_articles_list(original_article_list: list, downloaded_article_list: list) list[source]

This check elemets of the original_article_list in downloaded_article_list and return missed articles list.

Parameters
  • original_article_list (list) – This list elements are checked if they are present in other list.

  • downloaded_article_list (list) – This list is checked if it consists elements of other list

Returns

This contains missing elements of original_article_list in downloaded_article_list.

Return type

list

systematic_review.validation.getting_article_paths_from_validation_detail(list_of_validation: list) list[source]

Getting the first element from list of lists.

Parameters

list_of_validation (list) – This list contain list of three values where the first is article path.

Returns

This output list contains the articles paths

Return type

list

systematic_review.validation.jumbled_words_percentage_checker_in_text(words_string: str, text_string: str, validation_limit: float = 70, wrong_word_limit: int = 2) tuple[source]

start calculating percentage if half of words are found in sequence. This also takes in consideration of words which got jumbled up due to pdf reading operation.

Parameters
  • words_string (str) – This is the word we are searching for.

  • text_string (str) – This is query string or lengthy text.

  • validation_limit (float) – This is the limit on similarity of checked substring. Example - 0.5 will return true if half of word found same.

  • wrong_word_limit (int) – This is the limit unto which algorithm ignore the wrong word in sequence.

Returns

This returns True if exact words_string found in text_string else False. This also returns matched substring percentage.

Return type

tuple

systematic_review.validation.manual_validating_of_pdf(articles_path_list: list, manual_index: int) tuple[source]

This is mostly a manually used function to validate some pdfs at the end of validation process. It makes it easy to search and validate pdf and store in a list. Advice: convert these lists as text file using function in converter module to avoid data loss.

Parameters
  • articles_path_list (list) – These are the list of articles which skipped our automated screening and validation algorithms. mostly due to pdf to text conversions errors.

  • manual_index (list) – This is the index from where you will start checking in article_path_list. Normally in many tries.

Returns

external_validation_list - This is the list to be saved externally for validated articles. external_invalidated_list - This is the list to be saved externally for invalidated articles.

Return type

tuple

systematic_review.validation.multiple_methods_validating_pdf_via_filename(pdf_file_path: str, pages: str = 'first', pdf_reader: str = 'pdftotext') tuple[source]

This function checks name of file and find the name in the text of pdf file. if it become successful then pdf is validated as downloaded else not downloaded. Example - pdf file name -> check in -> text of pdf file. pdf_reader options are pdftotext or pymupdf.

Parameters
  • pdf_reader (str) – This is python pdf reader package which convert pdf to text.

  • pdf_file_path (str) – the path of the pdf file.

  • pages (str) – This could be ‘all’ to get full text of pdf and ‘first’ for first page of pdf.

Returns

True and False value depicting validated article with True value. This also shows percentage matched Last it shows the text_manipulation_method_name used. like exact_words, words_percentage, jumbled_words_percentage, all if every text_manipulation_method_name is executed to validate.

Return type

tuple

systematic_review.validation.multiple_methods_validating_words_string_in_text(article_name: str, text: str, words_percentage_checker_in_text_validation_limit: float = 70, jumbled_words_percentage_checker_in_text_validation_limit: float = 70, jumbled_words_percentage_checker_in_text_wrong_word_limit: int = 2) tuple[source]

This text_manipulation_method_name uses different methods to validate the article_name(substring) in text. Example - exact_words, words_percentage, jumbled_words_percentage.

Parameters
  • jumbled_words_percentage_checker_in_text_wrong_word_limit (float) – This is the limit on similarity of checked substring. Example - 0.5 will return true if half of word found same.

  • jumbled_words_percentage_checker_in_text_validation_limit (int) – This is the limit unto which algorithm ignore the wrong word in sequence.

  • words_percentage_checker_in_text_validation_limit (float) – This is the limit on similarity of checked substring. Example - 0.5 will return true if half of word found same.

  • article_name (str) – This is input string which we want to validate in text.

  • text (str) – This is query string or lengthy text.

Returns

True and False value depicting validated article with True value. This also shows percentage matched Last it shows the text_manipulation_method_name used. like exact_words, words_percentage, jumbled_words_percentage, all if every text_manipulation_method_name is executed to validate.

Return type

tuple

systematic_review.validation.similarity_sequence_matcher(string_a: str, string_b: str) float[source]

Shows the percentage similarity between two strings like 0.9836065573770492 that means 98.35%

Parameters
  • string_a (str) – This is first string

  • string_b (str) – This is second string

Returns

This is the result of SequenceMatcher Example 0.9836065573770492 that means 98.35%

Return type

float

systematic_review.validation.validate_column_details_between_two_record_list(first_list_of_dict: list, second_list_of_dict: list, first_column_name: str = 'cleaned_title', second_column_name: str = 'cleaned_title_pdf') tuple[source]

It produce list of matched columns rows and unmatched column rows based on same column from first list of dict. Note- emphasis on first list as function check all records of first list of dict in second list of dict. title column of second_list_of_dict is kept by merging with first.

Parameters
  • second_column_name (str) – This is the name of column which contain pdf article title.

  • first_list_of_dict (list) – Iterable object pandas.DataFrame or list which contains first_column_name

  • second_list_of_dict (list) – Iterable object pandas.DataFrame or list which contains first_column_name

  • first_column_name (str) – This is the name of column which contain citation title.

Returns

matched_list - It contains column’s row which are matched in both data object. unmatched_list - It contains column’s row which are unmatched in both data object.

Return type

tuple

systematic_review.validation.validating_multiple_pdfs_via_filenames(list_of_pdf_files_path: list, pages: str = 'first', pdf_reader: str = 'pdftotext') tuple[source]

This function checks pdf files in list_of_pdf_files_path and validate them with function named ‘validating_pdf_via_filename’. Example - multiple pdf file name -> check in -> text of pdf file. pdf_reader options are pdftotext or pymupdf.

Parameters
  • pages (str) – This could be ‘all’ to get full text of pdf and ‘first’ for first page of pdf.

  • pdf_reader (str) – This is python pdf reader package which convert pdf to text.

  • list_of_pdf_files_path (list) – the list of the path of the pdf file.

Returns

validated_pdf_list - contains name of pdf files whose filename is in the pdf text invalidated_pdf_list - list of name of files which can’t be included in validated_pdf_list manual_pdf_list - list of files which can’t be opened using python pdf reader or errors opening them.

Return type

tuple

systematic_review.validation.validating_pdf_via_filename(pdf_file_path: str, pages: str = 'first', method: str = 'exact_words') bool[source]

This function checks name of file and find the name in the text of pdf file. if it become successful then pdf is validated as downloaded else not downloaded. Example - pdf file name -> check in -> text of pdf file. pdf_reader options are pdftotext or pymupdf.

Parameters
  • pdf_file_path (str) – the path of the pdf file.

  • pages (str) – This could be ‘all’ to get full text of pdf and ‘first’ for first page of pdf.

  • method (str) – This is the switch option to select text_manipulation_method_name from exact_words, words_percentage, jumbled_words_percentage.

Returns

True and False value depicting validated article with True value.

Return type

bool

systematic_review.validation.validating_pdfs_using_multiple_pdf_reader(pdfs_parent_dir_path: str) tuple[source]

This function uses two python readers pdftotext and pymupdf for validating if the filename are present inside of pdf file text.

Parameters

pdfs_parent_dir_path (str) – This is the parent directory of all the downloaded pdfs.

Returns

validated_pdf_list - contains name of pdf files whose filename is in the pdf text invalidated_pdf_list - list of name of files which can’t be included in validated_pdf_list manual_pdf_list - list of files which can’t be opened using python pdf reader or errors opening them.

Return type

tuple

systematic_review.validation.words_percentage_checker_in_text(words_string: str, text_string: str, validation_limit: float = 70) tuple[source]

This checks for exact match of substring in string and return True or False based on success. It also returns matched word percentage. Limit: this doesn’t work properly if words_string have duplicate words.

Parameters
  • words_string (str) – This is the word we are searching for.

  • text_string (str) – This is query string or lengthy text.

  • validation_limit (float) – This is the limit on similarity of checked substring. Example - 0.5 will return true if half of word found same.

Returns

This returns True if exact words_string found in text_string else False. This also returns matched substring percentage.

Return type

tuple