systematic_review package¶

systematic_review.analysis module¶

Module: analysis This module contain code for generating info, diagrams and tables. It can be used to generate systematic review flow and citations information.

class systematic_review.analysis.Annotate(figure_axes, start_coordinate, end_coordinate, arrow_style='<|-')[source]¶

Bases: object

This class makes it easier to draw arrows into matplotlib.pyplot.axes figure

add_arrow(text='')[source]¶

This draw the arrow on matplotlib.pyplot.axes.

Parameters: text (str) – This takes test to put on the arrow.

class systematic_review.analysis.CitationAnalysis(dataframe)[source]¶

Bases: object

This takes any pandas dataframe containing citation details and produces analyses on various columns.

authors_analysis(authors_column_name='authors')[source]¶

generates the details based on pandas dataframe column of article authors. example- Number of authors, Articles with single authors, Articles per authors, Authors per articles

Parameters: authors_column_name (str) – Name of column containing authors details.
Returns: contains Number of authors, Articles with single authors, Articles per authors, Authors per articles
Return type: tuple

authors_info()[source]¶: prints the authors analysis details in nice format

extract_keywords(column_name: str = 'keywords')[source]¶

return dataframe with search_words_object column containing single keyword in row that are used in the articles.

Parameters: column_name (str) – column name of search_words_object detail in citation dataframe

keyword_diagram(column_name: str = 'keywords', top_result=None, method: str = 'seaborn', theme_style='darkgrid', xaxis_label_rotation=90, pandas_bar_kind: str = 'bar', diagram_fname: Optional[str] = None, **kwargs)[source]¶

generates chart showing how many articles are published by different publishers.

Parameters

pandas_bar_kind (str) – pandas plot option of kind of chart needed. defaults to ‘bar’ in this implementation
column_name (str) – column name of search_words_object detail in citation dataframe
theme_style (str) – name of the bar chart theme
xaxis_label_rotation (float) – rotate the column elements shown on x axis or horizontally.
top_result (int) – This limits the number of column unique elements to be shown
method (str) – provide option to plot chart using either ‘seaborn’ or ‘pandas’
diagram_fname (str) – filename or path of diagram image to be saved.
kwargs (dict) – kwargs are also given to matplotlib.pyplot.savefig(**kwargs)

keywords_info(column_name: str = 'keywords')[source]¶

return search_words_object and number of times they are used in the articles

Parameters: column_name (str) – column name of search_words_object detail in citation dataframe

publication_place_diagram(column_name: str = 'place_published', top_result=None, method: str = 'seaborn', theme_style='darkgrid', xaxis_label_rotation=90, pandas_bar_kind: str = 'bar', diagram_fname: Optional[str] = None, **kwargs)[source]¶

generates chart showing how many articles are published from different places or countries.

Parameters

pandas_bar_kind (str) – pandas plot option of kind of chart needed. defaults to ‘bar’ in this implementation
column_name (str) – column name of publication place detail in citation dataframe
theme_style (str) – name of the bar chart theme
xaxis_label_rotation (float) – rotate the column elements shown on x axis or horizontally.
top_result (int) – This limits the number of column unique elements to be shown
method (str) – provide option to plot chart using either ‘seaborn’ or ‘pandas’
diagram_fname (str) – filename or path of diagram image to be saved.
kwargs (dict) – kwargs are also given to matplotlib.pyplot.savefig(**kwargs)

publication_place_info(column_name: str = 'place_published')[source]¶

shows how many articles are published from different places or countries.

Parameters: column_name (str) – column name of publication place detail in citation dataframe
Returns: contains publication place and count of publications
Return type: object

publication_year_diagram(column_name: str = 'year', top_result=None, method: str = 'seaborn', theme_style='darkgrid', xaxis_label_rotation=90, pandas_bar_kind: str = 'bar', diagram_fname: Optional[str] = None, **kwargs)[source]¶

generates chart showing how many articles are published each year.

Parameters

pandas_bar_kind (str) – pandas plot option of kind of chart needed. defaults to ‘bar’ in this implementation
column_name (str) – column name of publication year detail in citation dataframe
theme_style (str) – name of the bar chart theme
xaxis_label_rotation (float) – rotate the column elements shown on x axis or horizontally.
top_result (int) – This limits the number of column unique elements to be shown
method (str) – provide option to plot chart using either ‘seaborn’ or ‘pandas’
diagram_fname (str) – filename or path of diagram image to be saved.
kwargs (dict) – kwargs are also given to matplotlib.pyplot.savefig(**kwargs)

publication_year_info(column_name: str = 'year')[source]¶

shows how many articles are published each year.

Parameters: column_name (str) – column name of publication year detail in citation dataframe
Returns: contains year and count of publications
Return type: object

publisher_diagram(column_name: str = 'publisher', top_result=None, method: str = 'seaborn', theme_style='darkgrid', xaxis_label_rotation=90, pandas_bar_kind: str = 'bar', diagram_fname: Optional[str] = None, **kwargs)[source]¶

generates chart showing how many articles are published by different publishers.

Parameters

pandas_bar_kind (str) – pandas plot option of kind of chart needed. defaults to ‘bar’ in this implementation
column_name (str) – column name of publisher detail in citation dataframe
theme_style (str) – name of the bar chart theme
xaxis_label_rotation (float) – rotate the column elements shown on x axis or horizontally.
top_result (int) – This limits the number of column unique elements to be shown
method (str) – provide option to plot chart using either ‘seaborn’ or ‘pandas’
diagram_fname (str) – filename or path of diagram image to be saved.
kwargs (dict) – kwargs are also given to matplotlib.pyplot.savefig(**kwargs)

publisher_info(column_name: str = 'publisher')[source]¶

shows how many articles are published by different publishers.

Parameters: column_name (str) – column name of publisher detail in citation dataframe.
Returns: contains publisher name and count of publications.
Return type: object

class systematic_review.analysis.SystematicReviewInfo(citations_files_parent_folder_path: Optional[str] = None, filter_sorted_citations_df: Optional[pandas.core.frame.DataFrame] = None, validated_research_papers_df: Optional[pandas.core.frame.DataFrame] = None, selected_research_papers_df: Optional[pandas.core.frame.DataFrame] = None)[source]¶

Bases: object

This analyse whole systematic review process and takes all produced file to generate tables, figure.

download_flag_column_name = 'downloaded'¶

file_validated_flag_name = 'yes'¶

get_text_list() → List[str][source]¶

This produces the list of all analysis done in this class.

Returns: This contains systematic review information in sentences.
Return type: List[str]

info()[source]¶: This takes systematic review text list and create proper order to print.

systematic_review_diagram(fig_width=10, fig_height=10, diagram_fname: Optional[str] = None, color: bool = True, color_info: bool = True, auto_fig_size: bool = True, hide_border: bool = True, **kwargs)[source]¶

This outputs the systematic review diagram resembling PRISMA guidelines.

Parameters

kwargs (dict) – kwargs are also given to matplotlib.pyplot.savefig(**kwargs)
hide_border (bool) – border is line outside of diagram
auto_fig_size (bool) – this sets the figure size automatically based on given data.
color (bool) – This is color inside of diagram boxes. turn this off by putting False.
color_info (bool) – This show meaning of color in diagram.
diagram_fname (str) – filename or path of diagram image to be saved.
fig_width (float) – This is width of figure in inches.
fig_height (float) – This is height of figure in inches.

class systematic_review.analysis.TextInBox(figure_axes, x_coordinate, y_coordinate, text='')[source]¶

Bases: object

This is matplotlib text in box class to make it easier to use text boxes.

add_box(**kwargs: Union[dict, str, Any])[source]¶

It put the box on the matplotlib.pyplot.axes figure

Parameters: kwargs (Union[dict, str, Any]) – This taken any custom_text_manipulation_function options to be set into box.

systematic_review.analysis.analysis_of_multiple_ris_citations_files(citations_files_parent_folder_path: str) → dict[source]¶

This function loads all ris citations files from folder and return the databases names and collected number of citations from the databases to dict.

Parameters: citations_files_parent_folder_path (str) – this is the path of parent folder of where citations files exists.
Returns: this is dict of databases name and number of records in ris files.
Return type: dict

systematic_review.analysis.creating_sample_review_file(selected_citation_df)[source]¶

This function outputs dataframe with including columns to make literature review easier.

Parameters: selected_citation_df (pandas.DataFrame object) – This dataframe is the result of last step of systematic-reviewpy. This contains records for manual literature review.
Returns: This is dataframe with additional columns for helping in adding details of literature review.
Return type: pandas.DataFrame object

systematic_review.analysis.custom_box(**kwargs) → dict[source]¶

This is the option for matplotlib text in box.

Parameters: kwargs (dict) – Contains key word arguments
Returns: contains options
Return type: dict

systematic_review.analysis.duplicate_count(dataframe: pandas.core.frame.DataFrame) → int[source]¶

return count of the duplicate articles.

Parameters: dataframe (pd.DataFrame) – Input pandas dataframe where we want to check numbers of duplicates.
Returns: number of duplicates records.
Return type: int

systematic_review.analysis.missed_article_count(filter_sorted_citations_df: pandas.core.frame.DataFrame, downloaded_articles_path: str, title_column_name: str = 'cleaned_title')[source]¶

return count of missed articles from downloading by checking original list of articles from filter_sorted_citations_df using downloaded articles path.

Parameters

title_column_name (str) – contains name of column which contain the name of article.
filter_sorted_citations_df (pd.DataFrame) – This dataframe contains records of selected articles including name of articles.
downloaded_articles_path (str) – contains parent folder of all the downloaded articles files.

Returns

count of the missed articles from downloading.

Return type

int

systematic_review.analysis.pandas_countplot_with_pandas_dataframe_column(dataframe, column_name, top_result, plot_kind: str = 'bar', diagram_fname: Optional[str] = None, **kwargs)[source]¶

generate pandas count chart using dataframe column.

Parameters

dataframe (pd.DataFrame) – dataframe which contains column whose value counts to be shown.
column_name (str) – Name of pandas column elements are supposed to be counted.
top_result (int) – This limits the number of column unique elements to be shown
plot_kind (str) – pandas plot option of kind of chart needed. defaults to ‘bar’ in this implementation
diagram_fname (str) – filename or path of diagram image to be saved.
kwargs (dict) – kwargs are also given to matplotlib.pyplot.savefig(**kwargs)

systematic_review.analysis.seaborn_countplot_with_pandas_dataframe_column(dataframe, column_name, theme_style='darkgrid', xaxis_label_rotation=90, top_result=None, diagram_fname: Optional[str] = None, **kwargs)[source]¶

generate seaborn count bar chart using dataframe column.

Parameters

diagram_fname (str) – filename or path of diagram image to be saved.
dataframe (pd.DataFrame) – dataframe which contains column whose value counts to be shown.
column_name (str) – Name of pandas column elements are supposed to be counted.
theme_style (str) – name of the bar chart theme
xaxis_label_rotation (float) – rotate the column elements shown on x axis or horizontally.
top_result (int) – This limits the number of column unique elements to be shown
kwargs (dict) – kwargs are also given to matplotlib.pyplot.savefig(**kwargs)

Returns

show the bar chart

Return type

object

systematic_review.analysis.text_padding_for_visualise(text: str, front_padding_space_multiple: int = 4, top_bottom_line_padding_multiple: int = 1)[source]¶

This add required space on all four side of text for better look.

Parameters

text (str) – This is the input word.
front_padding_space_multiple (int) – This multiply the left and right side of spaces for increased padding.
top_bottom_line_padding_multiple (int) – This multiply the top and down side of spaces for increased padding.

Returns

str - text with spaces on all four sides. int - height that is number of lines. int - width that is number of char in longest line.

Return type

tuple

systematic_review.analysis.vertical_dict_view(dictionary: dict) → str[source]¶

convert dict to string with each element in new line.

Parameters: dictionary (dict) – Contains key and value which we want to print vertically.
Returns: This prints key1 : value1 and key2 : value2 … in vertical format
Return type: str

systematic_review.citation module¶

Module: citation This module contains functions which changes format or get details from citations. It also include functions to fix some typos.

class systematic_review.citation.Citations(citations_files_parent_folder_path, title_column_name: str = 'title', text_manipulation_method_name: str = 'preprocess_string_to_space_separated_words')[source]¶

Bases: object

create_citations_dataframe() → pandas.core.frame.DataFrame[source]¶

Executes citation step. This function load all the citations from path, add required columns for next steps, and remove duplicates.

Returns: DataFrame with additional columns needed for next steps of systematic review and duplicates are removed
Return type: pandas.DataFrame object

get_dataframe()[source]¶

executes the create citations dataframe function and outputs the pd.DataFrame

Returns: outputs the citations data.
Return type: pd.DataFrame

get_records_list() → List[Dict[str, Any]][source]¶

Executes citation step. This function load all the citations from path, add required columns for next steps, and remove duplicates.

Returns: list with additional columns needed for next steps of systematic review and duplicates are removed
Return type: List[Dict[str, Any]]

to_csv(output_filename: Optional[str] = 'output.csv', index: bool = True)[source]¶

This function saves pandas.DataFrame to csv file.

Parameters

output_filename (str) – This is the name of output file which should contains .csv extension
index (bool) – Define if index is needed in output csv file or not.

to_excel(output_filename: Optional[str] = 'output.csv', index: bool = True)[source]¶

This function saves pandas.DataFrame to excel file.

Parameters

output_filename (str) – This is the name of output file which should contains .xlsx extension
index (bool) – Define if index is needed in output excel file or not.

systematic_review.citation.add_citation_text_column(dataframe_object: pandas.core.frame.DataFrame, title_column_name: str = 'title', abstract_column_name: str = 'abstract', keyword_column_name: str = 'keywords') → pandas.core.frame.DataFrame[source]¶

This takes dataframe of citations and return the full text comprises of “title”, “abstract”, “search_words_object”

Parameters

dataframe_object (pandas.DataFrame object) – this is the object of famous python library pandas. for more lemma_info: https://pandas.pydata.org/docs/
title_column_name (str) – This is the name of column which contain citation title
abstract_column_name (str) – This is the name of column which contain citation abstract
keyword_column_name (str) – This is the name of column which contain citation search_words_object

Returns

this is dataframe_object comprises of full text column.

Return type

pd.DataFrame

systematic_review.citation.add_multiple_sources_column(citation_dataframe: pandas.core.frame.DataFrame, group_by: list = ['title', 'year']) → pandas.core.frame.DataFrame[source]¶

This function check if citations or article title is available at more than one sources and add column named ‘multiple_sources’ to the dataframe with list of name of sources names.

Parameters

citation_dataframe (pandas.DataFrame object) – Input dataset which contains citations or article title with sources more than one.
group_by (list) – column label or sequence of labels, optional Only consider certain columns for citations or article title with sources more than one, by default use all of the columns.

Returns

DataFrame with additional column with list of sources names

Return type

pandas.DataFrame object

systematic_review.citation.citations_to_ris_converter(input_file_path: str, output_filename: str = 'output_ris_file.ris', input_file_type: str = 'read_csv') → None[source]¶

This asks for citations columns name from tabular data and then convert the data to ris format.

Parameters

input_file_path (str) – this is the path of input file
output_filename (str) – this is the name of the output ris file with extension. output file path is also valid choice.
input_file_type (str) – this function default is csv but other formats are also supported by putting ‘read_{file_type}’. such as input_file_type = ‘read_excel’ all file type supported by pandas can be used by putting pandas IO tools methods. for more info visit- https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html

Returns

Return type

None

systematic_review.citation.drop_columns_based_on_column_name_list(dataframe: pandas.core.frame.DataFrame, column_name_list: list) → pandas.core.frame.DataFrame[source]¶

This function drop columns based on the column name in the list.

Parameters

dataframe (pandas.DataFrame object) – This dataframe contains columns which we want to drop or remove.
column_name_list (list) – This is the name of dataframe columns to be removed

Returns

DataFrame with columns mentioned in column_name_list removed.

Return type

pandas.DataFrame object

systematic_review.citation.drop_duplicates_citations(citation_dataframe: pandas.core.frame.DataFrame, subset: list = ['title', 'year'], keep: Literal[first, last, False] = 'first', index_reset: bool = True) → pandas.core.frame.DataFrame[source]¶

Return DataFrame with duplicate rows removed. Considering certain columns is optional. Indexes, including time indexes are ignored.

Parameters

index_reset (bool) – It
citation_dataframe (pandas.DataFrame object) – Input dataset which contains duplicate rows
subset (list) – column label or sequence of labels, optional Only consider certain columns for identifying duplicates, by default use all of the columns.
keep (str) – options includes {‘first’, ‘last’, False}, default ‘first’. Determines which duplicates (if any) to keep. - first : Drop duplicates except for the first occurrence. - last : Drop duplicates except for the last occurrence. - False : Drop all duplicates.

Returns

DataFrame with duplicates removed

Return type

pandas.DataFrame object

systematic_review.citation.drop_search_words_count_columns(dataframe, search_words_object: systematic_review.search_count.SearchWords) → pandas.core.frame.DataFrame[source]¶

removes columns created based on the keywords.

Parameters

dataframe (pandas.DataFrame object) – This dataframe contains keywords columns which we want to drop or remove.
search_words_object (dict) – This is the dictionary comprised of unique keywords in each keyword groups. It means keyword from first keyword group can not be found in any other keyword group. Example - {‘keyword_group_1’: [“management”, “investing”, “risk”, “pre”, “process”], ‘keyword_group_2’: [“corporate”, “pricing”],…}

Returns

DataFrame with keywords columns removed.

Return type

pandas.DataFrame object

systematic_review.citation.edit_ris_citation_paste_values_after_regex_pattern(input_file_path: str, output_filename: str = 'output_file.ris', edit_line_regex: str = '^DO ', paste_value: str = 'ER - ') → None[source]¶

This is created to edit ris files which doesn’t specify ER for ‘end of citations’ and paste ER after end point of citation, replace ‘DO’ with other ris classifiers such as TY, JO etc.

Parameters

input_file_path (str) – this is the path of input file
output_filename (str) – this is the name of the output ris file with extension.
edit_line_regex (str) – this is the regex to find ris classifiers lines such as DO, TY, JO etc.
paste_value (str) – this is value to be pasted, most helpful is ER ris classifier which signify citation end.

Returns

Return type

None

systematic_review.citation.get_details_of_all_article_name_from_citations(filtered_list_of_dict: list, sources_name_citations_path_list_of_dict: list, doi_url: bool = False, title_column_name: str = 'title') → list[source]¶

This function searches source names, doi, and url for all articles in filtered_list_of_dict.

Parameters

filtered_list_of_dict (list) – This is the list of article citations dict after filtering it using min_limit on grouped_keywords_count
sources_name_citations_path_list_of_dict (list) – This is the list of all the sources names and it’s citations at dir_path. Examples - {‘sources_name’: ‘all source articles citations’, …}
doi_url (bool) – This signify if we want to get the value of url and doi from citation
title_column_name (str) – This is the name of column which contain citation title

Returns

This list contains all article names with source names. (optional url and doi)

Return type

list

systematic_review.citation.get_details_via_article_name_from_citations(article_name: str, sources_name_citations_path_list_of_dict: list, doi_url: bool = False, title_column_name: str = 'title') → dict[source]¶

Iterate through citations and find article_name and put source_name in column, with doi and url being optional

Parameters

article_name (str) – This is the primary title of the citation or name of the article.
sources_name_citations_path_list_of_dict (list) – This is the list of all the sources names and it’s citations at dir_path. Examples - {‘sources_name’: ‘all source articles citations’, …}
doi_url (bool) – This signify if we want to get the value of url and doi from citation
title_column_name (str) – This is the name of column which contain citation title

Returns

This dict contains the article_name, source_name and optional url and doi

Return type

dict

systematic_review.citation.get_missed_articles_source_names(missed_articles_list: list, all_articles_title_source_name_list_of_dict: list, article_column_name: str = 'article_name', source_column_name: str = 'source_name') → list[source]¶

Parameters

missed_articles_list (list) – This contains the list of articles that got missed while downloading.
all_articles_title_source_name_list_of_dict (list) – This list contains all article names with source names. (optional url and doi)
article_column_name (str) – This is the name of article column in the all_articles_title_source_name_list_of_dict.
source_column_name (str) – This is the name of source column in the all_articles_title_source_name_list_of_dict.

Returns

This list contains articles_name and sources name.

Return type

list

systematic_review.converter module¶

Module: converter This module contains functions related to files and data type conversion. such as list to txt file, pandas df to list of dicts and many more.

class systematic_review.converter.ASReview(data: Union[List[dict], pandas.core.frame.DataFrame])[source]¶

Bases: object

get_file(output_filename: str = 'output.csv', index: bool = True)[source]¶

Outputs the file needed to start project in ASReview.

Parameters

output_filename (str) – name or path of your needed file.
index (bool) – asks if you need index column in output file.

class systematic_review.converter.Reader(file_path: str)[source]¶

Bases: object

Contains functionality to read files.

get_text(pages: str = 'all')[source]¶

It understand the type of file and output the content of file.

Parameters: pages (str) – contain option to read ‘first’ or ‘all’ pages.
Returns: This is text in readable file.
Return type: str

pandas_reader(input_file_type)[source]¶

Read file using pandas IO https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html

Parameters: input_file_type (str) – check pandas IO for examples like read_csv, read_excel etc.
Returns: This is the required text from pandas IO.
Return type: str

pdf_pdftotext_reader(pages: str = 'all')[source]¶

Extract the text from pdf file via pdftotext. for more lemma_info, visit: https://pypi.org/project/pdftotext/

Parameters: pages (str) – This could be ‘all’ to get full text of pdf and ‘first’ for first page of pdf.
Returns: This is the required text from pdf file.
Return type: str

pdf_pymupdf_reader(pages: str = 'all')[source]¶

Extract the text from pdf file via fitz(PyMuPDF). for more lemma_info, visit: https://pypi.org/project/PyMuPDF/

Parameters: pages (str) – This could be ‘all’ to get full text of pdf and ‘first’ for first page of pdf.
Returns: This is the required text from pdf file.
Return type: str

systematic_review.converter.add_preprocess_column(dataframe_object: pandas.core.frame.DataFrame, column_name: str = 'title')[source]¶

Takes dataframe and column name to apply preprocess function from string_manipulation module.

Parameters

dataframe_object (pandas.DataFrame object) – This is object with column containing column which needs to be preprocessed.
column_name (str) – This is the name of the column of dataframe.

Returns

DataFrame with additional column with preprocessed column.

Return type

pandas.DataFrame object

systematic_review.converter.apply_custom_function_on_dataframe_column(dataframe: pandas.core.frame.DataFrame, column_name: str, custom_function, new_column_name: Optional[str] = None, *args, **kwargs) → pandas.core.frame.DataFrame[source]¶

This apply custom_text_manipulation_function function to all element of dataframe column.

Parameters

new_column_name (str) – This is the new name you want to give your modified column and new column will be added to dataframe without modifying original column.
dataframe (pd.DataFrame) – This is the pandas dataframe consisting of column name with elements capable to be transformed with custom_text_manipulation_function function.
column_name (str) – name of dataframe column whose elements are needed to be transformed
custom_function – This is custom_text_manipulation_function function to be applied on each elements of the pandas column elements.

Returns

This is transformed dataframe.

Return type

pd.DataFrame

systematic_review.converter.dataframe_column_counts(dataframe, column_name)[source]¶

Equivalent to pandas.DataFrame.value_counts(), It return list with count of unique element in column

Parameters

dataframe (pd.DataFrame) – dataframe which contains column that is to be counted
column_name (str) – Name of pandas column elements are supposed to be counted.

Returns

unique column elements with counts

Return type

object

systematic_review.converter.dataframe_to_csv_file(dataframe_object: pandas.core.frame.DataFrame, output_filename: Optional[str] = 'output.csv', index: bool = True)[source]¶

This function saves pandas.DataFrame to csv file.

Parameters

dataframe_object (pandas.DataFrame object) – this is the object of python library pandas. for more lemma_info: https://pandas.pydata.org/docs/
output_filename (str) – This is the name of output file which should contains .csv extension
index (bool) – Define if index is needed in output csv file or not.

systematic_review.converter.dataframe_to_excel_file(dataframe_object: pandas.core.frame.DataFrame, output_filename: Optional[str] = 'output.csv', index: bool = True)[source]¶

This function saves pandas.DataFrame to excel file.

Parameters

dataframe_object (pandas.DataFrame object) – this is the object of python library pandas. for more lemma_info: https://pandas.pydata.org/docs/
output_filename (str) – This is the name of output file which should contains .xlsx extension
index (bool) – Define if index is needed in output excel file or not.

systematic_review.converter.dataframe_to_records_list(dataframe: pandas.core.frame.DataFrame) → List[Dict[str, Any]][source]¶

converts pandas dataframe to the list of dictionaries (records).

Parameters: pd.DataFrame – This is the pandas dataframe consisted of all data from dictionaries converted into respective rows.
Returns: This list contains the dictionaries inside as elements. Example - [{‘primary_title’ : “this is first title”}, {‘primary_title’ : “this is second title”}, {‘primary_title’ : “this is third title”}]
Return type: List[Dict[str, Any]]

systematic_review.converter.dict_key_value_to_records(dictionary: dict, key_column_name: str, value_column_name: str)[source]¶

converts {‘key’:value, key1: value1},etc to record = [{‘key_column_name’: key, value_column_name: value}, etc]. that is used to convert to pd.DataFrame

Parameters

dictionary (dict) – hash map or dictionary that contains key and value pairs.
key_column_name (str) – name of records column
value_column_name (str) – name of records column

Returns

This list is in records format.

Return type

list

systematic_review.converter.dict_values_data_type(dictionary)[source]¶

This provide the data type of dictionary values by outputting dictionary.

Parameters: dictionary (dict) – This is the dictionary which contains different types of object in values. Example - {“first”: [2, 5], “sec”: 3}
Returns: This will output {“<class ‘list’>”: [“first”], “<class ‘int’>”: [“sec”]}
Return type: dict

systematic_review.converter.extract_pandas_df_column1_row_values_based_on_column2_value(pandas_dataframe, column2_value, column2name='source_name', column1name='article_name')[source]¶

extract the values of pandas dataframe column1’s row_values based on values of column2 value

Parameters

pandas_dataframe (pd.DataFrame) – This is the pandas dataframe containing at least two columns with values.
column2_value (object) – This should be str in normal cases but can be any object type supported in pandas for column value.
column2name (str) – This is the name of the column by which we are extracting the column1 values.
column1name (str) – This is the name of the column whose values we require.

Returns

This is the list of the resultant values from column1 rows.

Return type

list

systematic_review.converter.get_pdf_object_from_pdf_path(pdf_file_path: str)[source]¶

Extract text as pdf object from the pdf file where loop and indexing can show text per pages.

Parameters: pdf_file_path (str) – This is the path of pdf file.
Returns
Return type: This is pdf object with Extracted text.

systematic_review.converter.get_text_from_multiple_pdf_reader(pdf_file_path: str, pages: str = 'all') → Union[str, bool][source]¶

This Function get text from pdf files using pdftotext. if failed then text comes from pymupdf.

Parameters

pdf_file_path (str) – This is the path of pdf file.
pages (str) – This could be ‘all’ to get full text of pdf and ‘first’ for first page of pdf.

Returns

This is the required text from pdf file.

Return type

str

systematic_review.converter.get_text_from_pdf(pdf_file_path: str, pages: str = 'all', pdf_reader: str = 'pdftotext') → Union[str, bool][source]¶

This Function get text from pdf files using either pdftotext or pymupdf.

Parameters

pdf_reader (str) – This is python pdf reader package which convert pdf to text.
pdf_file_path (str) – This is the path of pdf file.
pages (str) – This could be ‘all’ to get full text of pdf and ‘first’ for first page of pdf.

Returns

This is the required text from pdf file.

Return type

str

systematic_review.converter.get_text_from_pdf_pdftotext(pdf_file_path: str, pages: str = 'all') → str[source]¶

Extract the text from pdf file via pdftotext. for more lemma_info, visit: https://pypi.org/project/pdftotext/

Parameters

pdf_file_path (str) – This is the path of the pdf file.
pages (str) – This could be ‘all’ to get full text of pdf and ‘first’ for first page of pdf.

Returns

This is the required text from pdf file.

Return type

str

systematic_review.converter.get_text_from_pdf_pymupdf(pdf_file_path: str, pages: str = 'all') → str[source]¶

Extract the text from pdf file via fitz(PyMuPDF). for more lemma_info, visit: https://pypi.org/project/PyMuPDF/

Parameters

pages (str) – This could be ‘all’ to get full text of pdf and ‘first’ for first page of pdf.
pdf_file_path (str) – This is the path of pdf file.

Returns

This is the required text from pdf file.

Return type

str

systematic_review.converter.json_file_to_dict(json_file_path: str) → dict[source]¶

Read the json file from the path given. Convert json file data to the python dictionary.

Parameters: json_file_path (str) – This is the json file path which is needed to be converted.
Returns: This is the data in dict format converted from json file.
Return type: dict

systematic_review.converter.list_to_string(list_name)[source]¶

This converts list to text_string and put each element in new line.

Parameters: list_name (list) – This is the python data structure list which contains some data.
Returns: This is the text string comprises of all data of list.
Return type: str

systematic_review.converter.list_to_text_file(filename: str, list_name: str, permission: str = 'w')[source]¶

This converts list to text file and put each element in new line.

Parameters

filename (str) – This is the name to be given for text file.
list_name (list) – This is the python data structure list which contains some data.
permission (str) – These are the os permissions given for the file. check more lemma_info on python library ‘os’.

Returns

Return type

None

systematic_review.converter.load_multiple_ris_citations_files(citations_files_parent_folder_path: str) → List[dict][source]¶

This function loads all ris citations files from folder

Parameters: citations_files_parent_folder_path (str) – this is the path of parent folder of where citations files exists.
Returns: this is list of citations dicts inclusive of all citation files.
Return type: List[dict]

systematic_review.converter.load_multiple_ris_citations_files_to_dataframe(citations_files_parent_folder_path: str) → pandas.core.frame.DataFrame[source]¶

This function loads all ris citations files from folder

Parameters: citations_files_parent_folder_path (str) – this is the path of parent folder of where citations files exists.
Returns: this is dataframe of citations dicts inclusive of all citation files.
Return type: pd.DataFrame

systematic_review.converter.load_text_file(file_path: str, permission: str = 'r')[source]¶

This reads text file. get all line of text file by file object. for more info visit- https://docs.python.org/3/tutorial/inputoutput.html

Parameters

file_path (str) – This is the path or name of text file.
permission (str) – These are the os permissions given for the file.

Returns

This contains all lines loaded.

Return type

file object

systematic_review.converter.records_list_to_dataframe(list_of_dicts: List[Dict[str, Any]]) → pandas.core.frame.DataFrame[source]¶

converts the list of dictionaries to pandas dataframe.

Parameters: list_of_dicts (List[Dict[str, Any]]) – This list contains the dictionaries inside as elements. Example - [{‘primary_title’ : “this is the title”}]
Returns: This is the pandas dataframe consisted of all data from dictionaries converted into respective rows.
Return type: pd.DataFrame

systematic_review.converter.remove_empty_lines(input_file_path: str, output_filename: str = 'output_file.ris') → None[source]¶

This function removes the blank lines from the input file and output new file.

Parameters

input_file_path (str) – this is the path of input file
output_filename (str) – this is the name of the output ris file with extension.

Returns

Return type

None

systematic_review.converter.ris_file_to_pandas_dataframe(ris_file_path: str) → pandas.core.frame.DataFrame[source]¶

This needs ‘rispy’ to read ris to list of dicts. It then convert list of dicts to pandas.DataFrame

Parameters: ris_file_path (str) – This is the path of ris citations file
Returns: dataframe object from pandas
Return type: pd.DataFrame

systematic_review.converter.ris_file_to_records_list(ris_file_path: str) → List[Dict[str, Any]][source]¶

Converts .ris file to list of dictionaries of citations using rispy(https://pypi.org/project/rispy/). For more lemma_info on ris format, visit: https://en.wikipedia.org/wiki/RIS_(file_format)

Parameters: ris_file_path (str) – This is the filepath of the ris file.
Returns: This list contains dictionaries of citations in records format, same as in pandas.
Return type: List[Dict[str, Any]]

systematic_review.converter.text_file_to_list(file_path: str, permission: str = 'r')[source]¶

This converts text file to list and put each line in list as single element. get first line of text file by list[0].

Parameters

file_path (str) – This is the name to be given for text file.
permission (str) – These are the os permissions given for the file. check more lemma_info on python library ‘os’.

Returns

This contains all lines loaded into list with one line per list element. [first line, second line,…. ]

Return type

list

systematic_review.converter.try_convert_dataframe_column_elements_to_list(dataframe: pandas.core.frame.DataFrame, column_name: str) → List[list][source]¶

try statement for converting each element of dataframe column to list object.

Parameters

dataframe (pd.DataFrame) – The dataframe with column to convert into list
column_name (str) – Name of column for conversion

Returns

This is list with each element of type list.

Return type

List[list]

systematic_review.converter.unpack_list_of_lists(list_of_lists)[source]¶

unpack list consisting of other list to output list which will include all elements from other lists.

Parameters: list_of_lists (list) – this is list consisting of elements and lists. example [“first_element”, [“second_element”]]
Returns: This is the resultant list consisting of only elements. example [“first_element”, “second_element”]
Return type: list

systematic_review.converter.unpack_list_of_lists_with_optional_apply_custom_function(list_of_lists: List[list], custom_function=None) → list[source]¶

unpack lists inside of list to new list containing all the elements from list_of_lists with optional custom_function applied on all elements. example- [[1,2,3], [3,4,5]] to [1,2,3,3,4,5]

Parameters

list_of_lists (List[list]) – This list contains lists as elements which might contains other elements.
custom_function – This is optional function to be applied on each element of list_of_lists

Returns

list containing all the elements with any optional transformation using custom_function.

Return type

list

systematic_review.converter.write_json_file_with_dict(output_file_path: str, input_dict: dict) → None[source]¶

Write json file at output_file_path with the help of input dictionary.

Parameters

output_file_path (str) – This is the path of output file we want, if only name is provided then it will export json to the script path.
input_dict (dict) – This is the python dictionary which we want to be saved in json file format.

Returns

Function doesn’t return anything but write a json file at output_file_path.

Return type

None

systematic_review.filter_sort module¶

Module: filter_sort Description for filter: each searched search_words_object group can be used to filter using conditions such as searched search_words_object >= some count values and filter them until you have required number of articles that can be manually read and filter.

Description for sort: This converts the data into sorted manner so it is easier for humans to understand.

class systematic_review.filter_sort.FilterSort(data: Union[List[dict], pandas.core.frame.DataFrame], search_words_object: systematic_review.search_count.SearchWords, required_number: int)[source]¶

Bases: object

This contains functionality to filter and sort the data.

filter_and_sort() → pandas.core.frame.DataFrame[source]¶

Execute filter and sort step. creates sorting criterion list, sort the dataframe based on the sorting criterion list.

Returns

This is the sorted dataframe which contains columns in this sequential manner. It contains citation df,: total_keywords, group_keywords_counts, and keywords_counts in the last.

Return type

pd.DataFrame

get_dataframe()[source]¶

executes the filter and sort function and outputs the pd.DataFrame

Returns: outputs the filter and sorted data.
Return type: pd.DataFrame

get_records_list()[source]¶

executes the filter and sort function and outputs the records list file

Returns: outputs the filter and sorted data.
Return type: List[dict]

to_csv(output_filename: Optional[str] = 'output.csv', index: bool = True)[source]¶

This function saves pandas.DataFrame to csv file.

Parameters

output_filename (str) – This is the name of output file which should contains .csv extension
index (bool) – Define if index is needed in output csv file or not.

to_excel(output_filename: Optional[str] = 'output.csv', index: bool = True)[source]¶

This function saves pandas.DataFrame to excel file.

Parameters

output_filename (str) – This is the name of output file which should contains .xlsx extension
index (bool) – Define if index is needed in output excel file or not.

systematic_review.filter_sort.dataframe_sorting_criterion_list(citations_grouped_keywords_counts_df: pandas.core.frame.DataFrame, sorting_keywords_criterion_list: list, reverse: bool = False)[source]¶

Provide a sorting criterion list for dataframe columns. put citations columns to the left and search_words counts: on the right. On making reverse equal to true it put search_words on the left.

Parameters

reverse (bool) – default to False to output citations columns to the left and keyword counts on the right. On True it does opposite.
citations_grouped_keywords_counts_df (pd.DataFrame) – This dataframe contains all columns with counts of search_words_object.
sorting_keywords_criterion_list (list) – This is the sorting criterion list which contains column in logical manner we desire.It contains total_keywords, group_keywords_counts, and keywords_counts in the last.

Returns

This is the dataframe sorting criterion list which contains column in logical manner we desire. It contains citations details in the left while total_keywords, group_keywords_counts, and keywords_counts in the right.

Return type

list

systematic_review.filter_sort.filter_and_sort(citations_grouped_keywords_counts_df: pandas.core.frame.DataFrame, search_words_object: systematic_review.search_count.SearchWords, required_number: int) → pandas.core.frame.DataFrame[source]¶

Execute filter and sort step. creates sorting criterion list, sort the dataframe based on the sorting criterion list.

Parameters

required_number (int) – This is the least number of documents we want.
citations_grouped_keywords_counts_df (pd.DataFrame) – This dataframe contains all columns with counts of search_words_object.
search_words_object (object) – search_words_object should contain dictionary comprised of unique search_words_object in each keyword groups. It means keyword from first keyword group can not be found in any other keyword group. Example - {‘keyword_group_1’: [“management”, “investing”, “risk”, “pre”, “process”], ‘keyword_group_2’: [“corporate”, “pricing”],…}

Returns

This is the sorted dataframe which contains columns in this sequential manner. It contains citation df,: total_keywords, group_keywords_counts, and keywords_counts in the last.

Return type

pd.DataFrame

systematic_review.filter_sort.filter_dataframe_on_keywords_group_name_count(citations_grouped_keywords_counts_df: pandas.core.frame.DataFrame, min_limit: int, common_word: str = '_count', method: str = 'suffix') → List[dict][source]¶

This function gets columns name from pandas dataframe which contains given prefix or suffix. It then filter dataframe to the point where all prefix and suffix column name have values more than min_limit.

Parameters

citations_grouped_keywords_counts_df (pd.DataFrame) – This is input dataframe which contains some columns which have prefix or suffix in names.
min_limit (int) – This is the least value we want in all search_words_object group names.
common_word (str) – This is the similar word string in many column names.
method (str) – This is to specify if we are looking for prefix or suffix in column names.

Returns

This is the filtered citations list based on min_limit of grouped_keywords_counts.

Return type

List[dict]

systematic_review.filter_sort.finding_required_article_by_changing_min_limit_recursively(citations_grouped_keywords_counts_df: pandas.core.frame.DataFrame, required_number_of_articles: int, addition: int = 0, search: bool = True, prev_lower_total_articles_rows: int = 0)[source]¶

This function increases the min_limit value to reach up to required_number_of_articles. this function return the min_limit value of exact required_number_of_articles can be extracted from dataframe else it provide the lower and upper limit of min_limit

Parameters

citations_grouped_keywords_counts_df (pd.DataFrame) – This is input dataframe which contains some columns which have prefix or suffix in names.
required_number_of_articles (int) – This is the number of articles you want after filtration process.
addition (int) – This is the number by which you want to increase the min_limit on grouped keyword count.
search (bool) – This signify the status of searching for best value of min_limit
prev_lower_total_articles_rows (int) – This is the previous lower total articles rows

Returns

This prints the values rather than returning the values. It return search which is of no use.

Return type

bool

systematic_review.filter_sort.get_dataframe_sorting_criterion_list(citations_grouped_keywords_counts_df, unique_preprocessed_clean_grouped_keywords_dict)[source]¶

This sorting criteria list is based on the search_words_object got from the main input search_words_object. It contains total_keywords, group_keywords_counts, keywords_counts.

Parameters

citations_grouped_keywords_counts_df (pd.DataFrame) – This dataframe contains all columns with counts of search_words_object.
unique_preprocessed_clean_grouped_keywords_dict (dict) – his is the dictionary comprised of unique search_words_object in each keyword groups. It means keyword from first keyword group can not be found in any other keyword group. Example - {‘keyword_group_1’: [“management”, “investing”, “risk”, “pre”, “process”], ‘keyword_group_2’: [“corporate”, “pricing”],…}. ‘risk’ is removed from keyword_group_2.

Returns

This is the sorting criterion list which contains column in logical manner we desire. It contains total_keywords, group_keywords_counts, and keywords_counts in the last.

Return type

list

systematic_review.filter_sort.get_pd_df_columns_names_with_prefix_suffix(input_pandas_dataframe: pandas.core.frame.DataFrame, common_word: str = '_count', method: str = 'suffix') → List[str][source]¶

Provide the columns name from pandas dataframe which contains given prefix or suffix.

Parameters

input_pandas_dataframe (pd.DataFrame) – This dataframe contains many columns some of which contains the common word we are looking for.
common_word (str) – This is the similar word string in many column names.
method (str) – This is to specify if we are looking for prefix or suffix in column names.

Returns

This list contains the name of columns which follow above criteria.

Return type

List[str]

systematic_review.filter_sort.manually_check_filter_by_min_limit_changes(citations_grouped_keywords_counts_df: pandas.core.frame.DataFrame, required_number_of_articles: int, min_limit: int = 1, iterations: int = 20, addition: int = 20)[source]¶

manual text_manipulation_method_name to check number of articles based on changing min_limit.

Parameters

citations_grouped_keywords_counts_df (pd.DataFrame) – This is input dataframe which contains some columns which have prefix or suffix in names.
required_number_of_articles (int) – This is the number of articles you want after filtration process.
min_limit (int) – This is the least value we want in all search_words_object group names.
iterations (int) – This is the number of iterations in for underling loop.
addition (int) – This is the number by which you want to increase the min_limit on grouped keyword count.

Returns

This prints the values rather than returning the values.

Return type

None

systematic_review.filter_sort.return_finding_near_required_article_by_changing_min_limit_while_loop(citations_grouped_keywords_counts_df: pandas.core.frame.DataFrame, required_number_of_articles: int)[source]¶

This function increases the min_limit value to reach unto required_number_of_articles. this function return the min_limit value of exact required_number_of_articles can be extracted from dataframe else it provide the lower and upper limit of min_limit

Parameters

citations_grouped_keywords_counts_df (pd.DataFrame) – This is input dataframe which contains some columns which have prefix or suffix in names.
required_number_of_articles (int) – This is the number of articles you want after filtration process.

Returns

This tuple consists of following values in same order exact match values: min_limit, total_articles_rows lower_info : min_limit, lower_total_articles_rows upper_info : min_limit, upper_total_articles_rows

Return type

tuple

systematic_review.filter_sort.return_finding_required_article_by_changing_min_limit_recursively(citations_grouped_keywords_counts_df: pandas.core.frame.DataFrame, required_number_of_articles: int, addition: int = 0, search: bool = True, prev_lower_total_articles_rows: int = 0, upper_info=(None, None))[source]¶

This function increases the min_limit value to reach unto required_number_of_articles. this function return the min_limit value of exact required_number_of_articles can be extracted from dataframe else it provide the lower and upper limit of min_limit

Parameters

citations_grouped_keywords_counts_df (pd.DataFrame) – This is input dataframe which contains some columns which have prefix or suffix in names.
required_number_of_articles (int) – This is the number of articles you want after filtration process.
addition (int) – This is the number by which you want to increase the min_limit on grouped keyword count.
search (bool) – This signify the status of searching for best value of min_limit
prev_lower_total_articles_rows (int) – This is the previous lower total articles rows
upper_info (list) – This is list consists of [min_limit, upper_total_articles_rows]

Returns

This tuple consists of following values in same order searching flag: True or False exact match values: min_limit, total_articles_rows lower_info : min_limit, lower_total_articles_rows upper_info : min_limit, upper_total_articles_rows

Return type

tuple

systematic_review.filter_sort.sort_citations_grouped_keywords_counts_df(citations_grouped_keywords_counts_df: pandas.core.frame.DataFrame, sorting_keywords_criterion_list: list) → pandas.core.frame.DataFrame[source]¶

This function sort the dataframe based on the sorting criterion list.

Parameters

citations_grouped_keywords_counts_df (pd.DataFrame) – This dataframe contains all columns with counts of search_words_object.
sorting_keywords_criterion_list (list) – This is the sorting criterion list which contains column in logical manner we desire.It contains total_keywords, group_keywords_counts, and keywords_counts in the last.

Returns

pd.DataFrame – This is the sorted dataframe which contains columns in this sequential manner. It contains total_keywords,
group_keywords_counts, and keywords_counts in the last.

systematic_review.filter_sort.sort_dataframe_based_on_column(dataframe, column_name, ascending=True)[source]¶

sort the dataframe based on column values.

Parameters

ascending (bool) – This decide increasing or decreasing order of sort. default to ascending a-z, 1-9.
dataframe (pd.DataFrame) – This is unsorted dataframe.
column_name (str) – This is the name of column which is used to sort the dataframe.

Returns

This is sorted dataframe based on title_column_name.

Return type

pd.DataFrame

systematic_review.nlp module¶

Module: nlp (Natural language processing) This module contains functions related to removing stop words, Lemmatization, and stemming Approaches. Functions will import the supporting AI model only when they are executed. For more Examples and info visit: https://www.machinelearningplus.com/nlp/lemmatization-examples-python/ and https://www.machinelearningplus.com/nlp/lemmatization-examples-python/

systematic_review.nlp.nltk_lancaster_stemmer(input_text: str) → str[source]¶

This function returns stemmed text. Uses nltk.stem LancasterStemmer

Parameters: input_text (str) – This may contains all words in dictionary.
Returns: This output text contains stems of a words. Example - “car” is matched with words like “cars” but not “automobile”.
Return type: str

systematic_review.nlp.nltk_porter_stemmer(input_text: str) → str[source]¶

This function returns stemmed text. Uses nltk.stem PorterStemmer

Parameters: input_text (str) – This may contains all words in dictionary.
Returns: This output text contains stems of a words. Example - “car” is matched with words like “cars” but not “automobile”.
Return type: str

systematic_review.nlp.nltk_remove_stopwords(text: str) → str[source]¶

Remove unnecessary words such as she, are, of, which, and in.

Parameters: text (str) – This may contains all words in dictionary.
Returns: This contains words other than stop words described in nltk english stop words.
Return type: str

systematic_review.nlp.nltk_remove_stopwords_spacy_lemma(string_list_lower: str) → List[str][source]¶

This function returns lemmatize text of lowercase input string. Uses spacy en_core_web_sm

Parameters: string_list_lower (str) – This may contains all lowercase words in dictionary.
Returns: This output text contains word-forms which are linguistically valid lemmas.
Return type: List[str]

systematic_review.nlp.nltk_word_net_lemmatizer(input_text: str) → str[source]¶

This function returns lemmatize text. Uses nltk.stem WordNetLemmatizer

Parameters: input_text (str) – This may contains all words in dictionary.
Returns: This output text contains word-forms which are linguistically valid lemmas. Example - “car” is matched with words like “cars” and “automobile”.
Return type: str

systematic_review.nlp.pattern_lemma_or_lemmatize_text(input_text: str, lemma_info: bool = False) → str[source]¶

This return lemma if lemma_info is True else it returns lemmatize text. Uses pattern.en lemma

Parameters

input_text (str) – This may contains all words in dictionary.
lemma_info (bool) – This is the switch variable which define return either to lemma information or lemmatize text.

Returns

This output text contains word-forms which are linguistically valid lemmas. Example - “car” is matched with words like “cars” and “automobile”.

Return type

str

systematic_review.nlp.spacy_lemma(input_text: str) → str[source]¶

This function returns lemmatize text. Uses spacy en_core_web_sm

Parameters: input_text (str) – This may contains all words in dictionary.
Returns: This output text contains word-forms which are linguistically valid lemmas. Example - “car” is matched with words like “cars” and “automobile”.
Return type: str

systematic_review.os_utils module¶

Module: os_utils This module contains functions related to getting directories, files, and filenames from os paths.

systematic_review.os_utils.extract_files_path_from_directories_or_subdirectories(directory_path: str) → list[source]¶

Getting all files paths from the directory and its subdirectories.

Parameters: directory_path (str) – This is the directory path of files we require.
Returns: This list contains path of all the files contained in directory_path.
Return type: list

systematic_review.os_utils.extract_subdirectories_path_from_directory(directory_path: str) → list[source]¶

Getting all sub directories paths from the directory.

Parameters: directory_path (str) – This is the directory path of sub directories we require.
Returns: This list contains path of all the sub directories contained in directory_path.
Return type: list

systematic_review.os_utils.get_all_filenames_in_dir(dir_path: str) → List[str][source]¶

This provides all the names of files at dir_path.

Parameters: dir_path (str) – This is the path of folder we are searching files in.
Returns: This is the list of all the names of files at dir_path.
Return type: List[str]

systematic_review.os_utils.get_directory_file_name_and_path(dir_path: str) → tuple[source]¶

Get file names and file paths from directory path.

Parameters: dir_path (str) – This is the path of the directory.
Returns: This tuple contains list of downloaded_articles_name_list and downloaded_articles_path_list.
Return type: tuple

systematic_review.os_utils.get_file_extension_from_path(file_path: str) → str[source]¶

Returns the file extension from pdf filepath.

Parameters: file_path (str) – A path is a string of characters used to uniquely identify a location in a directory structure. for more info visit- https://en.wikipedia.org/wiki/Path_(computing)
Returns: A filename extension, file extension or file type is an identifier specified as a suffix to the name of a computer file. for more info visit- https://en.wikipedia.org/wiki/Filename_extension
Return type: str

systematic_review.os_utils.get_filename_from_path(file_path: str) → str[source]¶

Returns the filename from pdf filepath.

Parameters: file_path (str) – A path is a string of characters used to uniquely identify a location in a directory structure. for more info visit- https://en.wikipedia.org/wiki/Path_(computing)
Returns: A filename or file name is a name used to uniquely identify a computer file in a directory structure. for more info visit- https://en.wikipedia.org/wiki/Filename
Return type: str

systematic_review.os_utils.get_path_leaf(file_path: str) → str[source]¶

Extract file name from path. for more details visit:: https://stackoverflow.com/questions/8384737/extract-file-name-from-path-no-matter-what-the-os-path-format

Parameters: file_path (str) – This is the path of file.
Returns: This is name of file.
Return type: str

systematic_review.os_utils.get_sources_name_citations_mapping(dir_path: str) → list[source]¶

This makes the list of {‘sources_name’: ‘all source articles citations’, …} from mentioning the dir path of ris files.

Parameters: dir_path (str) – This is the path of folder we are searching ris files in.
Returns: This is the list of all the sources names and it’s citations at dir_path.
Return type: list

systematic_review.search_count module¶

Module: search_count This module contains all necessary functions for searching the citations, articles text and count number of search_words_object present.

class systematic_review.search_count.SearchCount(data: Union[List[dict], pandas.core.frame.DataFrame], search_words_object: systematic_review.search_count.SearchWords, text_manipulation_method_name: str = 'preprocess_string', custom_text_manipulation_function=None, *args, **kwargs)[source]¶

Bases: object

Used to search search_words in citations and research papers. This can output both records list and pandas.DataFrame as well as can take both inputs.

citation_text_column_name = 'citation_text'¶

count_search_words_in_citations_text(citations_records_list: List[Dict[str, Any]]) → List[Dict[str, Any]][source]¶

Loop over each citations to count search words (SearchWords instance) in citation data.

Parameters: citations_records_list (List[Dict[str, Any]]) – This list contains all the citations details with column named ‘full_text’ containing full text like article name, abstract and keyword.
Returns: This is the list of all citations search result which contains our all search_words_object count. Examples - [{‘title’: ‘name’, ‘total_keywords’: count, ‘keyword_group_1_count’: count, “management”: count, “investing: count”, “risk: count”, ‘keyword_group_2_count’: count, “corporate”: count, “pricing”: count,…}]
Return type: List[Dict[str, Any]]

count_search_words_in_research_paper_text(research_papers_records_list: List[Dict[str, Any]]) → List[Dict[str, Any]][source]¶

Loop over validated research paper to count search words (SearchWords instance) in research papers data.

Parameters: research_papers_records_list (List[Dict[str, Any]]) – This list contains data of all the research papers files contained in directory_path.
Returns: This is the list of all citations search result which contains our all search_words_object count. Examples - [{‘article’: ‘article_name’, ‘total_keywords’: count, ‘keyword_group_1_count’: count, “management”: count, “investing: count”, “risk: count”, ‘keyword_group_2_count’: count, “corporate”: count, “pricing”: count,…}]
Return type: List[Dict[str, Any]]

counts() → List[Dict[str, Any]][source]¶

This takes records list and return search counts based on type of citation data or research papers data.

Returns: records list containing the citation data or research papers data.
Return type: List[Dict[str, Any]]

download_flag_column_name = 'downloaded'¶

get_dataframe()[source]¶

Outputs the pandas.DataFrame containing counts results of input data.

Returns: This is the dataframe of all citations search result which contains our all search_words_object count. Examples - [{‘article’: ‘article_name’, ‘total_keywords’: count, ‘keyword_group_1_count’: count, “management”: count, “investing: count”, “risk: count”, ‘keyword_group_2_count’: count, “corporate”: count, “pricing”: count,…}]
Return type: pandas.DataFrame

get_records_list() → List[Dict[str, Any]][source]¶

Outputs the records list containing counts results of input data.

Returns: This is the list of records which contains all search_words_object count from input data. Examples - [{‘article’: ‘article_name’, ‘total_keywords’: count, ‘keyword_group_1_count’: count, “management”: count, “investing: count”, “risk: count”, ‘keyword_group_2_count’: count, “corporate”: count, “pricing”: count,…}]
Return type: List[Dict[str, Any]]

research_paper_file_location_column_name = 'file location'¶

to_csv(output_filename: Optional[str] = 'output.csv', index: bool = True)[source]¶

This function saves pandas.DataFrame to csv file.

Parameters

output_filename (str) – This is the name of output file which should contains .csv extension
index (bool) – Define if index is needed in output csv file or not.

to_excel(output_filename: Optional[str] = 'output.csv', index: bool = True)[source]¶

This function saves pandas.DataFrame to excel file.

Parameters

output_filename (str) – This is the name of output file which should contains .xlsx extension
index (bool) – Define if index is needed in output excel file or not.

class systematic_review.search_count.SearchWords(search_words, text_manipulation_method_name: str = 'preprocess_string', custom_text_manipulation_function=None, default_search_words_group_name: str = 'search_words_group_', all_unique_keywords: bool = False, unique_keywords: bool = True, *args, **kwargs)[source]¶

Bases: object

This class contains all functionalities related to search words.

construct_search_words_from_list() → dict[source]¶

This takes keywords_list which contains search_words_object as [‘keyword1 keyword2 keyword3’, ‘keyword1 keyword2’] and function construct dict as {‘keyword_group_1’: ‘keyword1 keyword2 keyword3’, ‘keyword_group_2’: ‘keyword1 keyword2’}

Returns: the dictionary contains the group name and search_words_object paired as value Examples - {‘keyword_group_1’: ‘keyword1 keyword2 keyword3’, ‘keyword_group_2’: ‘keyword1 keyword2’}
Return type: dict

creating_default_keyword_count_dict()[source]¶

Initialise keyword count dict with value 0 for every keyword.

Returns: This contains key as keyword and value as 0.
Return type: dict

generate_keywords_count_dictionary(text)[source]¶

Parameters: text –

get_sample_search_words_json(output_file_path: str = 'sample_search_words_template.json') → None[source]¶

Outputs the json sample search_words_object file template as example which can be edited by user to upload search_words_object.

Parameters: output_file_path (str) – this is optional output file path for json template
Returns: function create the file on the root folder unless specified in output_file_path
Return type: None

get_sorting_keywords_criterion_list() → List[str][source]¶

This sorting criteria list is based on the search_words_object got from the main input search_words_object. It contains total_keywords, group_keywords_counts, keywords_counts.

Returns: This is the sorting criterion list which contains column in logical manner we desire. It contains total_keywords, group_keywords_counts, and search_words_object in the last.
Return type: List[str]

preprocess_search_keywords_dictionary(grouped_keywords_dictionary: dict) → dict[source]¶

This takes search_words_object from {keyword_group_name: search_words_object,…} dict and remove symbols with spaces. it then convert them to lowercase and remove any duplicate keyword inside of search_words_object. outputs the {keyword_group_name: [clean_keywords],…}

Parameters: grouped_keywords_dictionary (dict) – This is the input dictionary of search_words_object used for systematic review. Example - {‘keyword_group_name’: “Management investing corporate pricing risk Risk Pre-process”,…}
Returns: This is output dictionary which contains processed non-duplicate search_words_object dict. Example - {‘keyword_group_name’: [“management”, “investing”, “corporate”, “pricing”, “risk”, “pre”, “process”],…}
Return type: dict

preprocess_searched_keywords(grouped_keywords_dictionary: dict) → dict[source]¶

Remove duplicate instances of search_words_object in other search_words_object groups.

Parameters: grouped_keywords_dictionary (dict) – This is the input dictionary of search_words_object used for systematic review. Example - {‘keyword_group_name’: “Management investing corporate pricing risk Risk Pre-process”,…}
Returns: This is the dictionary comprised of unique search_words_object in each keyword groups. It means keyword from first keyword group can not be found in any other keyword group. Example - {‘keyword_group_1’: [“management”, “investing”, “risk”, “pre”, “process”], ‘keyword_group_2’: [“corporate”, “pricing”],…} ‘risk’ is removed from keyword_group_2.
Return type: dict

sample_dict = {'keywords_common_words': 'accuracy classification cross sectional cross-section expected metrics prediction predict expert system', 'keywords_finance': 'Management investing corporate pricing risk', 'keywords_machine_learning': 'neural fuzzy inference system artificial intelligence artificial computational neural networks'}¶

unique_keywords_in_preprocessed_clean_keywords_dict() → set[source]¶

Return set of unique search_words_object from the preprocessed_clean_keywords_dict.

Returns: This is set of unique search_words_object from all of search_words_object groups.
Return type: set

systematic_review.search_count.adding_citation_details_with_keywords_count_in_pdf_full_text(filter_sorted_citations_df: pandas.core.frame.DataFrame, pdf_full_text_search_count: list, unique_preprocessed_clean_grouped_keywords_dict: dict, first_column_name: str = 'cleaned_title', second_column_name: str = 'cleaned_title_pdf') → pandas.core.frame.DataFrame[source]¶

Combining the pdf search_words_object counts with the citation details from filtered and sorted citation full text dataframe.

Parameters

second_column_name (str) – This is the name of column which contain pdf article title.
first_column_name (str) – This is the name of column which contain citation title.
filter_sorted_citations_df (pandas.DataFrame object) –

This is the sorted dataframe which contains columns in this sequential manner. It contains citation df,
total_keywords, group_keywords_counts, and keywords_counts in the last.
pdf_full_text_search_count (list) – This is the list of all citations search result which contains our all search_words_object count. Examples - [{‘article’: ‘article_name’, ‘total_keywords’: count, ‘keyword_group_1_count’: count, “management”: count, “investing: count”, “risk: count”, ‘keyword_group_2_count’: count, “corporate”: count, “pricing”: count,…}]
unique_preprocessed_clean_grouped_keywords_dict (dict) – This is the dictionary comprised of unique search_words_object in each keyword groups. It means keyword from first keyword group can not be found in any other keyword group. Example - {‘keyword_group_1’: [“management”, “investing”, “risk”, “pre”, “process”], ‘keyword_group_2’: [“corporate”, “pricing”],…}. ‘risk’ is removed from keyword_group_2.

Returns

This dataframe contains citations details from filtered and sorted citation full text dataframe and search_words_object counts from searching in pdf file text.

Return type

pandas.DataFrame object

systematic_review.search_count.adding_dict_key_or_increasing_value(input_dict: dict, dict_key: str, step: int = 1, default_dict_value: int = 1)[source]¶

Increase the value of dict(key:value) by step using key. If key not present then it get initialised with default dict value

Parameters

input_dict (dict) – This is the dictionary which we want to modify.
dict_key (str) – This is the key of dictionary
step (int) – This is the addition number by which value of dictionary needed to be increased.
default_dict_value (int) – If key is not available in dictionary then this default value is used to add new key.

Returns

This is the modified dictionary

Return type

dict

systematic_review.search_count.citation_list_of_dict_search_count_to_df(citations_list: list, keywords: dict, title_column_name: str = 'title', method: str = 'preprocess_string', custom=None) → pandas.core.frame.DataFrame[source]¶

Loop over articles to calculate search_words_object counts and return dataframe.

Parameters

title_column_name (str) – This is the name of column which contain citation title
custom (function) – This is optional custom_text_manipulation_function function if you want to implement this yourself. pass as custom_text_manipulation_function = function_name. it will take text as parameter with no default preprocess_string operation.
method (str) – provides the options to use any text manipulation function. preprocess_string (default and applied before all other implemented functions) custom_text_manipulation_function - for putting your custom_text_manipulation_function function to preprocess the text nltk_remove_stopwords, pattern_lemma_or_lemmatize_text, nltk_word_net_lemmatizer, nltk_porter_stemmer, nltk_lancaster_stemmer, spacy_lemma, nltk_remove_stopwords_spacy_lemma, convert_string_to_lowercase
citations_list (list) – list with additional columns needed for next steps of systematic review and duplicates are removed
keywords (dict) – This is output dictionary which contains processed non-duplicate search_words_object dict. Example - {‘keyword_group_name’: [“management”, “investing”, “corporate”, “pricing”, “risk”, “pre”, “process”],…}

Returns

This is pandas object of all citations search result which contains our all search_words_object count. Examples - [{‘primary_title’: ‘name’, ‘total_keywords’: count, ‘keyword_group_1_count’: count, “management”: count, “investing: count”, “risk: count”, ‘keyword_group_2_count’: count, “corporate”: count, “pricing”: count,…}]

Return type

pandas.DataFrame object

systematic_review.search_count.citation_search_count_dataframe(citations_df: pandas.core.frame.DataFrame, keywords: dict, title_column_name: str = 'title', method: str = 'preprocess_string', custom=None) → pandas.core.frame.DataFrame[source]¶

Loop over articles to calculate keywords counts and return dataframe.

Parameters

title_column_name (str) – This is the name of column which contain citation title
custom (function) – This is optional custom function if you want to implement this yourself. pass as custom = function_name. it will take text as parameter with no default preprocess_string operation.
method (str) – provides the options to use any text manipulation function. preprocess_string (default and applied before all other implemented functions) custom - for putting your custom function to preprocess the text nltk_remove_stopwords, pattern_lemma_or_lemmatize_text, nltk_word_net_lemmatizer, nltk_porter_stemmer, nltk_lancaster_stemmer, spacy_lemma, nltk_remove_stopwords_spacy_lemma, convert_string_to_lowercase
citations_df (pandas.DataFrame object) – DataFrame with additional columns needed for next steps of systematic review and duplicates are removed
keywords (dict) – This is output dictionary which contains processed non-duplicate keywords dict. Example - {‘keyword_group_name’: [“management”, “investing”, “corporate”, “pricing”, “risk”, “pre”, “process”],…}

Returns

This is pandas object of all citations search result which contains our all keywords count. Examples - [{‘primary_title’: ‘name’, ‘total_keywords’: count, ‘keyword_group_1_count’: count, “management”: count, “investing: count”, “risk: count”, ‘keyword_group_2_count’: count, “corporate”: count, “pricing”: count,…}]

Return type

pandas.DataFrame object

systematic_review.search_count.count_keywords_in_citations_full_text(dataframe_citations_with_fulltext: pandas.core.frame.DataFrame, unique_preprocessed_clean_grouped_keywords_dict: dict, title_column_name: str = 'title', method: str = 'preprocess_string', custom=None) → list[source]¶

Loop over articles to calculate keywords counts

Parameters

custom (function) – This is optional custom function if you want to implement this yourself. pass as custom = function_name. it will take text as parameter with no default preprocess_string operation.
method (str) – provides the options to use any text manipulation function. preprocess_string (default and applied before all other implemented functions) custom - for putting your custom function to preprocess the text nltk_remove_stopwords, pattern_lemma_or_lemmatize_text, nltk_word_net_lemmatizer, nltk_porter_stemmer, nltk_lancaster_stemmer, spacy_lemma, nltk_remove_stopwords_spacy_lemma, convert_string_to_lowercase
dataframe_citations_with_fulltext (pd.DataFrame) – This dataframe contains all the citations details with column named ‘full_text’ containing full text like article name, abstract and keyword.
unique_preprocessed_clean_grouped_keywords_dict (dict) –

looks like this {‘keyword_group_1’: [“management”, “investing”, “risk”, “pre”, “process”],
’keyword_group_2’: [“corporate”, “pricing”],…}
title_column_name (str) – This is the name of column which contain citation title

Returns

This is the list of all citations search result which contains our all keywords count. Examples - [{‘primary_title’: ‘name’, ‘total_keywords’: count, ‘keyword_group_1_count’: count, “management”: count, “investing: count”, “risk: count”, ‘keyword_group_2_count’: count, “corporate”: count, “pricing”: count,…}]

Return type

list

systematic_review.search_count.count_keywords_in_citations_full_text_list(citations_with_fulltext_list: list, unique_preprocessed_clean_grouped_keywords_dict: dict, title_column_name: str = 'title', method: str = 'preprocess_string', custom=None) → list[source]¶

Loop over articles to calculate search_words_object counts

Parameters

custom (function) – This is optional custom_text_manipulation_function function if you want to implement this yourself. pass as custom_text_manipulation_function = function_name. it will take text as parameter with no default preprocess_string operation.
method (str) – provides the options to use any text manipulation function. preprocess_string (default and applied before all other implemented functions) custom_text_manipulation_function - for putting your custom_text_manipulation_function function to preprocess the text nltk_remove_stopwords, pattern_lemma_or_lemmatize_text, nltk_word_net_lemmatizer, nltk_porter_stemmer, nltk_lancaster_stemmer, spacy_lemma, nltk_remove_stopwords_spacy_lemma, convert_string_to_lowercase
citations_with_fulltext_list (list) – This list contains all the citations details with column named ‘full_text’ containing full text like article name, abstract and keyword.
unique_preprocessed_clean_grouped_keywords_dict (dict) –

looks like this {‘keyword_group_1’: [“management”, “investing”, “risk”, “pre”, “process”],
’keyword_group_2’: [“corporate”, “pricing”],…}
title_column_name (str) – This is the name of column which contain citation title

Returns

This is the list of all citations search result which contains our all search_words_object count. Examples - [{‘primary_title’: ‘name’, ‘total_keywords’: count, ‘keyword_group_1_count’: count, “management”: count, “investing: count”, “risk: count”, ‘keyword_group_2_count’: count, “corporate”: count, “pricing”: count,…}]

Return type

list

systematic_review.search_count.count_keywords_in_pdf_full_text(list_of_downloaded_articles_path: list, unique_preprocessed_clean_grouped_keywords_dict: dict, title_column_name: str = 'cleaned_title_pdf', method: str = 'preprocess_string', custom=None) → list[source]¶

Loop over articles pdf files to calculate keywords counts.

Parameters

custom (function) – This is optional custom function if you want to implement this yourself. pass as custom = function_name. it will take text as parameter with no default preprocess_string operation.
method (str) – provides the options to use any text manipulation function. preprocess_string (default and applied before all other implemented functions) custom - for putting your custom function to preprocess the text nltk_remove_stopwords, pattern_lemma_or_lemmatize_text, nltk_word_net_lemmatizer, nltk_porter_stemmer, nltk_lancaster_stemmer, spacy_lemma, nltk_remove_stopwords_spacy_lemma, convert_string_to_lowercase
title_column_name (str) – This is the name of column which contain citation title
list_of_downloaded_articles_path (list) – This list contains path of all the pdf files contained in directory_path.
unique_preprocessed_clean_grouped_keywords_dict (dict) –

looks like this {‘keyword_group_1’: [“management”, “investing”, “risk”, “pre”, “process”],
’keyword_group_2’: [“corporate”, “pricing”],…}

Returns

This is the list of all citations search result which contains our all keywords count. Examples - [{‘article’: ‘article_name’, ‘total_keywords’: count, ‘keyword_group_1_count’: count, “management”: count, “investing: count”, “risk: count”, ‘keyword_group_2_count’: count, “corporate”: count, “pricing”: count,…}]

Return type

list

systematic_review.search_count.count_search_words_in_citations_text(citations_with_fulltext_list: list, search_words_object: systematic_review.search_count.SearchWords, text_column_name: str = "'citation_text'", text_manipulation_method_name: str = 'preprocess_string', custom=None, custom_text_manipulation_function=None, *args, **kwargs) → list[source]¶

Loop over articles to calculate search_words_object counts

Parameters

custom_text_manipulation_function (function) – This is optional custom_text_manipulation_function function if you want to implement this yourself. pass as custom_text_manipulation_function = function_name. it will take text as parameter with no default preprocess_string operation.
text_manipulation_method_name (str) – provides the options to use any text manipulation function. preprocess_string (default and applied before all other implemented functions) custom_text_manipulation_function - for putting your custom_text_manipulation_function function to preprocess the text nltk_remove_stopwords, pattern_lemma_or_lemmatize_text, nltk_word_net_lemmatizer, nltk_porter_stemmer, nltk_lancaster_stemmer, spacy_lemma, nltk_remove_stopwords_spacy_lemma, convert_string_to_lowercase
citations_with_fulltext_list (list) – This list contains all the citations details with column named ‘full_text’ containing full text like article name, abstract and keyword.
search_words_object (object) –

looks like this {‘keyword_group_1’: [“management”, “investing”, “risk”, “pre”, “process”],
’keyword_group_2’: [“corporate”, “pricing”],…}
text_column_name (str) – This is the name of column which contain citation text

Returns

This is the list of all citations search result which contains our all search_words_object count. Examples - [{‘primary_title’: ‘name’, ‘total_keywords’: count, ‘keyword_group_1_count’: count, “management”: count, “investing: count”, “risk: count”, ‘keyword_group_2_count’: count, “corporate”: count, “pricing”: count,…}]

Return type

list

systematic_review.search_count.count_words_in_list_of_lists(list_of_lists: List[list]) → dict[source]¶

count words in list containing other lists with words.

Parameters: list_of_lists (List[list]) – This list contains each element of type list.
Returns: dictionary with key as words and value as counts
Return type: dict

systematic_review.search_count.creating_keyword_count_dict(unique_preprocessed_clean_grouped_keywords_dict: dict)[source]¶

Initialise keyword count dict with value 0 for every keyword.

Parameters

unique_preprocessed_clean_grouped_keywords_dict (dict) –

looks like this {‘keyword_group_1’: [“management”, “investing”, “risk”, “pre”, “process”],: ’keyword_group_2’: [“corporate”, “pricing”],…}

Returns

This contains key as keyword and value as 0.

Return type

dict

systematic_review.search_count.get_sorting_keywords_criterion_list(unique_preprocessed_clean_grouped_keywords_dict: dict) → list[source]¶

This sorting criteria list is based on the keywords got from the main input keywords. It contains total_keywords, group_keywords_counts, keywords_counts.

Parameters: unique_preprocessed_clean_grouped_keywords_dict (dict) – his is the dictionary comprised of unique keywords in each keyword groups. It means keyword from first keyword group can not be found in any other keyword group. Example - {‘keyword_group_1’: [“management”, “investing”, “risk”, “pre”, “process”], ‘keyword_group_2’: [“corporate”, “pricing”],…}. ‘risk’ is removed from keyword_group_2.
Returns: This is the sorting criterion list which contains column in logical manner we desire. It contains total_keywords, group_keywords_counts, and keywords in the last.
Return type: list

systematic_review.search_count.pdf_full_text_search_count_dataframe(list_of_downloaded_articles_path: list, unique_preprocessed_clean_grouped_keywords_dict: dict, title_column_name: str = 'cleaned_title', method: str = 'preprocess_string', custom=None) → pandas.core.frame.DataFrame[source]¶

Loop over articles pdf files to calculate keywords counts.

Parameters

custom (function) – This is optional custom function if you want to implement this yourself. pass as custom = function_name. it will take text as parameter with no default preprocess_string operation.
method (str) – provides the options to use any text manipulation function. preprocess_string (default and applied before all other implemented functions) custom - for putting your custom function to preprocess the text nltk_remove_stopwords, pattern_lemma_or_lemmatize_text, nltk_word_net_lemmatizer, nltk_porter_stemmer, nltk_lancaster_stemmer, spacy_lemma, nltk_remove_stopwords_spacy_lemma, convert_string_to_lowercase
title_column_name (str) – This is the name of column which contain citation title
list_of_downloaded_articles_path (list) – This list contains path of all the pdf files contained in directory_path.
unique_preprocessed_clean_grouped_keywords_dict (dict) –

looks like this {‘keyword_group_1’: [“management”, “investing”, “risk”, “pre”, “process”],
’keyword_group_2’: [“corporate”, “pricing”],…}

Returns

This is the dataframe of all citations search result which contains our all keywords count. Examples - [{‘article’: ‘article_name’, ‘total_keywords’: count, ‘keyword_group_1_count’: count, “management”: count, “investing: count”, “risk: count”, ‘keyword_group_2_count’: count, “corporate”: count, “pricing”: count,…}]

Return type

pandas.DataFrame object

systematic_review.search_count.remove_duplicates_keywords_from_next_groups(preprocessed_clean_grouped_keywords_dict: dict) → dict[source]¶

Execute search_words_object step. This takes search_words_object from {keyword_group_name: search_words_object,…} dict and remove symbols with spaces. it then convert them to lowercase and remove any duplicate keyword inside of search_words_object. outputs the {keyword_group_name: [clean_keywords],…} and then Remove duplicate instances of search_words_object in other search_words_object groups.

Parameters: preprocessed_clean_grouped_keywords_dict (dict) – This is output dictionary which contains processed non-duplicate search_words_object dict. Example - {‘keyword_group_1’: [“management”, “investing”, “risk”, “pre”, “process”], ‘keyword_group_2’: [“corporate”, “pricing”, “risk”],…}
Returns: This is the dictionary comprised of unique search_words_object in each keyword groups. It means keyword from first keyword group can not be found in any other keyword group. Example - {‘keyword_group_1’: [“management”, “investing”, “risk”, “pre”, “process”], ‘keyword_group_2’: [“corporate”, “pricing”],…} ‘risk’ is removed from keyword_group_2.
Return type: dict

systematic_review.string_manipulation module¶

Module: string_manipulation This module contains functions related to string case change, preprocess, and removing some part of it.

systematic_review.string_manipulation.convert_string_to_lowercase(string: str) → str[source]¶

Lowercase the given input string.

Parameters: string (str) – The string which might have uppercase characters in it.
Returns: This is all lowercase character string.
Return type: str

systematic_review.string_manipulation.pdf_filename_from_filepath(article_path: str) → str[source]¶

This takes the pdf path as input and clean the name of pdf by applying preprocess function from string_manipulation module.

Parameters: article_path (str) – This is the path of the pdf file.
Returns: This is the cleaned filename of the pdf.
Return type: str

systematic_review.string_manipulation.preprocess_string(string: str) → str[source]¶

replace symbols in string with spaces and Lowercase the given input string. Example - ‘Df%$df’ -> ‘df df’

Parameters: string (str) – This is input word string which contains unwanted symbols and might have uppercase characters in it.
Returns: This is cleaned string from symbols and contains only alpha characters.
Return type: str

systematic_review.string_manipulation.preprocess_string_to_space_separated_words(string: str) → str[source]¶

replace symbols in string with spaces and Lowercase the given input string. Example - ‘Df%$df’ -> ‘df df’ and convert ‘df df’ to single spaced ‘df df’.

Parameters: string (str) – This can contain string words mixed with spaces and symbols.
Returns: remove the spaces and symbols and arrange the words single spaces.
Return type: str

systematic_review.string_manipulation.remove_non_ascii(string_list: list) → list[source]¶

Remove non-ASCII characters from list of tokenized words

Parameters: string_list (list) – this list contains the words which contains the non-ASCII characters
Returns: this is modified list after removing the non-ASCII characters
Return type: list

systematic_review.string_manipulation.replace_symbols_with_space(string: str) → str[source]¶

replace symbols in string with spaces. Example - ‘df%$df’ -> ‘df df’

Parameters: string (str) – This is input word string which contains unwanted symbols.
Returns: This is cleaned string from symbols and contains only alpha characters and all lowercase character string.
Return type: str

systematic_review.string_manipulation.split_preprocess_string(text: str) → list[source]¶

This splits the words into list after applying preprocess function from string_manipulation module.

Parameters: text (str) – This is input word string which contains unwanted symbols and might have uppercase characters in it.
Returns: This is cleaned list of strings from symbols and contains only alpha characters.
Return type: list

systematic_review.string_manipulation.split_words_remove_duplicates(string_list: list) → list[source]¶

this function takes a list of words or sentences and split them to individual words. It also removes any repeating word in the list.

Parameters: string_list (list) – this is the input list which contains words and group of words inside. Example - [‘one’, ‘one two’]
Returns: this is the output list which contains only unique individual words using set(). Example - [‘one’, ‘two’]
Return type: list

systematic_review.string_manipulation.string_dict_to_lower(string_map: dict) → dict[source]¶

this convert the values into lowercase. similar function for list is available as string_list_to_lower()

Parameters: string_map (dict) – these are key:values pairs needed to be converted.
Returns: output by converting input to key: lowercase values.
Return type: dict

systematic_review.string_manipulation.string_list_to_lower(string_list: list) → list[source]¶

this convert the values into lowercase. similar function for dict is available as string_dict_to_lower()

Parameters: string_list (list) – this list contains input string need to be converted to lowercase.
Returns: this is the output list which contains original input strings but in lowercase
Return type: list

systematic_review.string_manipulation.string_to_space_separated_words(text: str) → str[source]¶

takes text string and outputs space separated words.

Parameters: text (str) – This text contains multiple spaces or trailing whitespaces
Returns: This is space separated word string with no trailing whitespaces.
Return type: str

systematic_review.string_manipulation.strip_string_from_right_side(string: str, value_to_be_stripped: str = '.pdf') → str[source]¶

Function removes the substring from the right of string.

Parameters

string (str) – This is the complete word or string. Example - ‘monster.pdf’
value_to_be_stripped (str) – This is the value which is needed to be removed from right side. Example - ‘.pdf’

Returns

This is the trimmed string that contains the left part after some part removed from the right. Example - ‘monster’

Return type

str

systematic_review.string_manipulation.text_manipulation_methods(text: str, text_manipulation_method_name: str = 'preprocess_string', custom_text_manipulation_function: Optional[Callable[[str, Any, Any], str]] = None, *args, **kwargs) → str[source]¶

This convert text or string using options like preprocess, nlp module function, for more info each respective methods methods implemented. args and kwargs will go into custom_text_manipulation_function

Parameters

kwargs (Dict[str, Any]) – These key = word or {key: word} arguments are for custom_text_manipulation_function
args (Tuple) – These arguments are for custom_text_manipulation_function
custom_text_manipulation_function (Callable[[str, Any, Any], str]) – This is optional custom_text_manipulation_function function if you want to implement this yourself. pass as custom_text_manipulation_function = function_name. it will take text as parameter with no default preprocess_string operation.
text (str) – string type text which is needed to be converted.
text_manipulation_method_name (str) – provides the options to use any text manipulation function. preprocess_string (default and applied before all other implemented functions) custom_text_manipulation_function - for putting your custom_text_manipulation_function function to preprocess the text nltk_remove_stopwords, pattern_lemma_or_lemmatize_text, nltk_word_net_lemmatizer, nltk_porter_stemmer, nltk_lancaster_stemmer, spacy_lemma, nltk_remove_stopwords_spacy_lemma, convert_string_to_lowercase, preprocess_string_to_space_separated_words

Returns

this return the converted text

Return type

str

systematic_review.validation module¶

Module: validation This module contains functions related validating our downloaded articles if they’re same as ones we require. It also contains functions to get articles source name and create list of missed or duplicate articles.

class systematic_review.validation.ValidateWordsInText(words_string: str, text_string: str, words_percentage_checker_in_text_validation_limit: float = 70, jumbled_words_percentage_checker_in_text_validation_limit: float = 70, jumbled_words_percentage_checker_in_text_wrong_word_limit: int = 2)[source]¶

Bases: object

This checks words in given Text.

exact_words_checker_in_text() → bool[source]¶

This checks for exact match of substring in string and return True or False based on success.

Returns: This returns True if exact words_string found in text_string else False.
Return type: bool

jumbled_words_percentage_checker_in_text() → tuple[source]¶

start calculating percentage if half of words are found in sequence. This also takes in consideration of words which got jumbled up due to pdf reading operation.

Returns: This returns True if exact words_string found in text_string else False. This also returns matched substring percentage.
Return type: tuple

multiple_methods() → tuple[source]¶

This text_manipulation_method_name uses different methods to validate the article_name(substring) in text. Example - exact_words, words_percentage, jumbled_words_percentage.

Returns: True and False value depicting validated article with True value. This also shows percentage matched Last it shows the text_manipulation_method_name used. like exact_words, words_percentage, jumbled_words_percentage, all if every text_manipulation_method_name is executed to validate.
Return type: tuple

words_percentage_checker_in_text() → tuple[source]¶

This checks for exact match of substring in string and return True or False based on success. It also returns matched word percentage. words_percentage_checker_in_text_validation_limit: this doesn’t work properly if words_string have duplicate words.

Returns: This returns True if exact words_string found in text_string else False. This also returns matched substring percentage.
Return type: tuple

class systematic_review.validation.Validation(citations_data: Union[List[dict], pandas.core.frame.DataFrame], parents_directory_of_research_papers_files: str, text_file_path_of_inaccessible_research_papers: Optional[str] = None, text_manipulation_method_name: str = 'preprocess_string_to_space_separated_words', words_percentage_checker_in_text_validation_limit: float = 70, jumbled_words_percentage_checker_in_text_validation_limit: float = 70, jumbled_words_percentage_checker_in_text_wrong_word_limit: int = 2)[source]¶

Bases: object

This is used to validate the downloaded files.

add_downloaded_flag_column_and_file_location_column()[source]¶

add empty columns based on research_paper_file_location_column_name and download_flag_column_name

Returns: data contains new columns.
Return type: List[dict]

check()[source]¶

Executes the validation of research articles in citation data by checking the research paper files and validating if the research articles are correct.

Returns: data contains columns with downloaded, validation method and file path columns
Return type: List[dict]

cleaned_article_column_name = 'cleaned_title'¶

download_flag_column_name = 'downloaded'¶

file_invalidated_flag_name = 'wrong'¶

file_manual_check_flag_name = 'unreadable'¶

file_name_and_path_dict()[source]¶

contains mapping of filename to file paths

Returns: key is filename and value is file paths.
Return type: dict

file_not_accessible_flag_name = 'no access'¶

file_not_downloaded_flag_name = 'no'¶

file_validated_flag_name = 'yes'¶

get_dataframe()[source]¶

Outputs the pandas.DataFrame containing validation results of input data.

Returns: This is the dataframe which contains validation Flags column downloaded with values- “yes”, “no”, “wrong”, “no access”, “unreadable” and file location column if downloaded column contains “yes”.
Return type: pandas.DataFrame

get_records_list() → List[Dict[str, Any]][source]¶

Outputs the records list containing validation results of input data.

Returns: This is the list of records which contains validation Flags column downloaded with values- “yes”, “no”, “wrong”, “no access”, “unreadable” and file location column if downloaded column contains “yes”.
Return type: List[Dict[str, Any]]

info()[source]¶

Equivalent to pandas.DataFrame.value_counts(), It return list with count of unique element in column

Returns: unique download_flag_column_name elements with counts
Return type: object

research_paper_file_location_column_name = 'file location'¶

to_csv(output_filename: Optional[str] = 'output.csv', index: bool = True)[source]¶

This function saves pandas.DataFrame to csv file.

Parameters

output_filename (str) – This is the name of output file which should contains .csv extension
index (bool) – Define if index is needed in output csv file or not.

to_excel(output_filename: Optional[str] = 'output.csv', index: bool = True)[source]¶

This function saves pandas.DataFrame to excel file.

Parameters

output_filename (str) – This is the name of output file which should contains .xlsx extension
index (bool) – Define if index is needed in output excel file or not.

validation_manual_method_name = 'manual'¶

validation_method_column_name = 'validation method'¶

systematic_review.validation.add_dict_element_with_count(dictionary: dict, key: str) → dict[source]¶

It increase the value by checking the key of dictionary or initialise new key with value 1. Works as collections module default dict with value 1.

Parameters

dictionary (dict) – This is the dictionary where we want to add element.
key (str) – This is the key of dictionary {key: value}

Returns

This is the edited dict with new elements counts.

Return type

dict

systematic_review.validation.amount_by_percentage(number: float, percentage: float) → float[source]¶

get the amount of number based on percentage. example- 5% (percentage) of 10 (number) is 0.5 (result).

Parameters

number (float) – This is the input number from which we want some percent amount
percentage (float) – This is equivalent to math percentage.

Returns

This is resultant number.

Return type

float

systematic_review.validation.calculate_percentage(value: float, total: float) → float[source]¶

calculate percentage of value in total.

Parameters

value (float) – It is input number, normally smaller than total.
total (float) – It is the larger number from which we want to know percentage

Returns

This is calculated percentage. Example 98.36065573770492 that means 98.35%

Return type

float

systematic_review.validation.compare_two_dict_members_via_percent_similarity(first_dict: dict, second_dict: dict) → float[source]¶

Compare elements in 2 dictionaries and return percentage similarity.

Parameters

first_dict (dict) – Example - first_dict = {‘mixed’:1, ‘modified’:1, ‘fruit’:1, ‘fly’:1, ‘optimization’:1}
second_dict (dict) – Example - second_dict = {‘mixed’:1, ‘modified’:1, ‘fruit’:1, ‘fly’:1, ‘optimization’:1, ‘algorithm’: 1}

Returns

This is percentage represented as decimal number. Example 98.36065573770492 that means 98.35%

Return type

float

systematic_review.validation.compare_two_list_members_via_percent_similarity(words_list: list, boolean_membership_list: list) → float[source]¶

Compare elements in 2 lists and return percentage similarity.

Parameters

words_list (list) – This contains elements whose elements to be checked for similarity.
boolean_membership_list (list) – This list contains True and False values.

Returns

This is percentage represented as decimal number. Example 98.36065573770492 that means 98.35%

Return type

float

systematic_review.validation.deep_validate_column_details_between_two_record_list(first_list_of_dict: list, second_list_of_dict: list, first_column_name: str = 'cleaned_title', second_column_name: str = 'cleaned_title_pdf') → tuple[source]¶

It produce list of matched columns rows and unmatched column rows based on same column from both.

Parameters

second_column_name (str) – This is the name of column which contain pdf article title.
first_list_of_dict (list) – Iterable object pandas.DataFrame or list which contains first_column_name
second_list_of_dict (list) – Iterable object pandas.DataFrame or list which contains first_column_name
first_column_name (str) – This is the name of column which contain citation title.

Returns

matched_list - It contains column’s row which are matched in both data object. unmatched_list - It contains column’s row which are unmatched in both data object.

Return type

tuple

systematic_review.validation.dict_from_list_with_element_count(input_list)[source]¶

Put input list elements into dictionary with count.

Parameters: input_list (list) – This is the list with elements with some duplicates present.
Returns: This is dictionary key as list elements and value as list each element count
Return type: dict

systematic_review.validation.exact_words_checker_in_text(words_string: str, text_string: str) → bool[source]¶

This checks for exact match of substring in string and return True or False based on success.

Parameters

words_string (str) – This is the word we are searching for.
text_string (str) – This is query string or lengthy text.

Returns

This returns True if exact words_string found in text_string else False.

Return type

bool

systematic_review.validation.finding_missed_articles_from_downloading(validated_pdf_list: list, original_articles_list: list) → tuple[source]¶

Checks how many articles are not downloaded yet from original list of articles.

Parameters

validated_pdf_list (list) – Contains name of pdf files whose filename is in the pdf text.
original_articles_list (list) – This is original list from where we started downloading the articles.

Returns

Missing_articles - these are the articles which are missed from downloading. Validated_articles - This is list of validated downloaded articles list.

Return type

tuple

systematic_review.validation.get_dataframe_column_as_list(dataframe: pandas.core.frame.DataFrame, column_name: str = 'primary_title')[source]¶

Get pandas dataframe column values as list.

Parameters

dataframe (pd.DataFrame) – This is the dataframe which contains column whose details we want as list.
column_name (str) – This is the name of the column.

Returns

This is the list containing the dataframe one column values.

Return type

list

systematic_review.validation.get_missed_articles_dataframe(filter_sorted_citations_df: pandas.core.frame.DataFrame, downloaded_articles_path: str, title_column_name: str = 'cleaned_title') → list[source]¶

return list of missed articles from downloading by checking original list of articles from filter_sorted_citations_df using downloaded articles path.

Parameters

title_column_name (str) – contains name of column which contain the name of article.
filter_sorted_citations_df (pd.DataFrame) – This dataframe contains records of selected articles including name of articles.
downloaded_articles_path (str) – contains parent folder of all the downloaded articles files.

Returns

list of the missed articles from downloading.

Return type

list

systematic_review.validation.get_missed_original_articles_list(original_article_list: list, downloaded_article_list: list) → list[source]¶

This check elemets of the original_article_list in downloaded_article_list and return missed articles list.

Parameters

original_article_list (list) – This list elements are checked if they are present in other list.
downloaded_article_list (list) – This list is checked if it consists elements of other list

Returns

This contains missing elements of original_article_list in downloaded_article_list.

Return type

list

systematic_review.validation.getting_article_paths_from_validation_detail(list_of_validation: list) → list[source]¶

Getting the first element from list of lists.

Parameters: list_of_validation (list) – This list contain list of three values where the first is article path.
Returns: This output list contains the articles paths
Return type: list

systematic_review.validation.jumbled_words_percentage_checker_in_text(words_string: str, text_string: str, validation_limit: float = 70, wrong_word_limit: int = 2) → tuple[source]¶

start calculating percentage if half of words are found in sequence. This also takes in consideration of words which got jumbled up due to pdf reading operation.

Parameters

words_string (str) – This is the word we are searching for.
text_string (str) – This is query string or lengthy text.
validation_limit (float) – This is the limit on similarity of checked substring. Example - 0.5 will return true if half of word found same.
wrong_word_limit (int) – This is the limit unto which algorithm ignore the wrong word in sequence.

Returns

This returns True if exact words_string found in text_string else False. This also returns matched substring percentage.

Return type

tuple

systematic_review.validation.manual_validating_of_pdf(articles_path_list: list, manual_index: int) → tuple[source]¶

This is mostly a manually used function to validate some pdfs at the end of validation process. It makes it easy to search and validate pdf and store in a list. Advice: convert these lists as text file using function in converter module to avoid data loss.

Parameters

articles_path_list (list) – These are the list of articles which skipped our automated screening and validation algorithms. mostly due to pdf to text conversions errors.
manual_index (list) – This is the index from where you will start checking in article_path_list. Normally in many tries.

Returns

external_validation_list - This is the list to be saved externally for validated articles. external_invalidated_list - This is the list to be saved externally for invalidated articles.

Return type

tuple

systematic_review.validation.multiple_methods_validating_pdf_via_filename(pdf_file_path: str, pages: str = 'first', pdf_reader: str = 'pdftotext') → tuple[source]¶

This function checks name of file and find the name in the text of pdf file. if it become successful then pdf is validated as downloaded else not downloaded. Example - pdf file name -> check in -> text of pdf file. pdf_reader options are pdftotext or pymupdf.

Parameters

pdf_reader (str) – This is python pdf reader package which convert pdf to text.
pdf_file_path (str) – the path of the pdf file.
pages (str) – This could be ‘all’ to get full text of pdf and ‘first’ for first page of pdf.

Returns

True and False value depicting validated article with True value. This also shows percentage matched Last it shows the text_manipulation_method_name used. like exact_words, words_percentage, jumbled_words_percentage, all if every text_manipulation_method_name is executed to validate.

Return type

tuple

systematic_review.validation.multiple_methods_validating_words_string_in_text(article_name: str, text: str, words_percentage_checker_in_text_validation_limit: float = 70, jumbled_words_percentage_checker_in_text_validation_limit: float = 70, jumbled_words_percentage_checker_in_text_wrong_word_limit: int = 2) → tuple[source]¶

This text_manipulation_method_name uses different methods to validate the article_name(substring) in text. Example - exact_words, words_percentage, jumbled_words_percentage.

Parameters

jumbled_words_percentage_checker_in_text_wrong_word_limit (float) – This is the limit on similarity of checked substring. Example - 0.5 will return true if half of word found same.
jumbled_words_percentage_checker_in_text_validation_limit (int) – This is the limit unto which algorithm ignore the wrong word in sequence.
words_percentage_checker_in_text_validation_limit (float) – This is the limit on similarity of checked substring. Example - 0.5 will return true if half of word found same.
article_name (str) – This is input string which we want to validate in text.
text (str) – This is query string or lengthy text.

Returns

True and False value depicting validated article with True value. This also shows percentage matched Last it shows the text_manipulation_method_name used. like exact_words, words_percentage, jumbled_words_percentage, all if every text_manipulation_method_name is executed to validate.

Return type

tuple

systematic_review.validation.similarity_sequence_matcher(string_a: str, string_b: str) → float[source]¶

Shows the percentage similarity between two strings like 0.9836065573770492 that means 98.35%

Parameters

string_a (str) – This is first string
string_b (str) – This is second string

Returns

This is the result of SequenceMatcher Example 0.9836065573770492 that means 98.35%

Return type

float

systematic_review.validation.validate_column_details_between_two_record_list(first_list_of_dict: list, second_list_of_dict: list, first_column_name: str = 'cleaned_title', second_column_name: str = 'cleaned_title_pdf') → tuple[source]¶

It produce list of matched columns rows and unmatched column rows based on same column from first list of dict. Note- emphasis on first list as function check all records of first list of dict in second list of dict. title column of second_list_of_dict is kept by merging with first.

Parameters

second_column_name (str) – This is the name of column which contain pdf article title.
first_list_of_dict (list) – Iterable object pandas.DataFrame or list which contains first_column_name
second_list_of_dict (list) – Iterable object pandas.DataFrame or list which contains first_column_name
first_column_name (str) – This is the name of column which contain citation title.

Returns

matched_list - It contains column’s row which are matched in both data object. unmatched_list - It contains column’s row which are unmatched in both data object.

Return type

tuple

systematic_review.validation.validating_multiple_pdfs_via_filenames(list_of_pdf_files_path: list, pages: str = 'first', pdf_reader: str = 'pdftotext') → tuple[source]¶

This function checks pdf files in list_of_pdf_files_path and validate them with function named ‘validating_pdf_via_filename’. Example - multiple pdf file name -> check in -> text of pdf file. pdf_reader options are pdftotext or pymupdf.

Parameters

pages (str) – This could be ‘all’ to get full text of pdf and ‘first’ for first page of pdf.
pdf_reader (str) – This is python pdf reader package which convert pdf to text.
list_of_pdf_files_path (list) – the list of the path of the pdf file.

Returns

validated_pdf_list - contains name of pdf files whose filename is in the pdf text invalidated_pdf_list - list of name of files which can’t be included in validated_pdf_list manual_pdf_list - list of files which can’t be opened using python pdf reader or errors opening them.

Return type

tuple

systematic_review.validation.validating_pdf_via_filename(pdf_file_path: str, pages: str = 'first', method: str = 'exact_words') → bool[source]¶

This function checks name of file and find the name in the text of pdf file. if it become successful then pdf is validated as downloaded else not downloaded. Example - pdf file name -> check in -> text of pdf file. pdf_reader options are pdftotext or pymupdf.

Parameters

pdf_file_path (str) – the path of the pdf file.
pages (str) – This could be ‘all’ to get full text of pdf and ‘first’ for first page of pdf.
method (str) – This is the switch option to select text_manipulation_method_name from exact_words, words_percentage, jumbled_words_percentage.

Returns

True and False value depicting validated article with True value.

Return type

bool

systematic_review.validation.validating_pdfs_using_multiple_pdf_reader(pdfs_parent_dir_path: str) → tuple[source]¶

This function uses two python readers pdftotext and pymupdf for validating if the filename are present inside of pdf file text.

Parameters: pdfs_parent_dir_path (str) – This is the parent directory of all the downloaded pdfs.
Returns: validated_pdf_list - contains name of pdf files whose filename is in the pdf text invalidated_pdf_list - list of name of files which can’t be included in validated_pdf_list manual_pdf_list - list of files which can’t be opened using python pdf reader or errors opening them.
Return type: tuple

systematic_review.validation.words_percentage_checker_in_text(words_string: str, text_string: str, validation_limit: float = 70) → tuple[source]¶

This checks for exact match of substring in string and return True or False based on success. It also returns matched word percentage. Limit: this doesn’t work properly if words_string have duplicate words.

Parameters

words_string (str) – This is the word we are searching for.
text_string (str) – This is query string or lengthy text.
validation_limit (float) – This is the limit on similarity of checked substring. Example - 0.5 will return true if half of word found same.

Returns

This returns True if exact words_string found in text_string else False. This also returns matched substring percentage.

Return type

tuple