systematic_review package¶
systematic_review.analysis module¶
Module: analysis This module contain code for generating info, diagrams and tables. It can be used to generate systematic review flow and citations information.
- class systematic_review.analysis.Annotate(figure_axes, start_coordinate, end_coordinate, arrow_style='<|-')[source]¶
Bases:
object
This class makes it easier to draw arrows into matplotlib.pyplot.axes figure
- class systematic_review.analysis.CitationAnalysis(dataframe)[source]¶
Bases:
object
This takes any pandas dataframe containing citation details and produces analyses on various columns.
- authors_analysis(authors_column_name='authors')[source]¶
generates the details based on pandas dataframe column of article authors. example- Number of authors, Articles with single authors, Articles per authors, Authors per articles
- Parameters
authors_column_name (str) – Name of column containing authors details.
- Returns
contains Number of authors, Articles with single authors, Articles per authors, Authors per articles
- Return type
tuple
- extract_keywords(column_name: str = 'keywords')[source]¶
return dataframe with search_words_object column containing single keyword in row that are used in the articles.
- Parameters
column_name (str) – column name of search_words_object detail in citation dataframe
- keyword_diagram(column_name: str = 'keywords', top_result=None, method: str = 'seaborn', theme_style='darkgrid', xaxis_label_rotation=90, pandas_bar_kind: str = 'bar', diagram_fname: Optional[str] = None, **kwargs)[source]¶
generates chart showing how many articles are published by different publishers.
- Parameters
pandas_bar_kind (str) – pandas plot option of kind of chart needed. defaults to ‘bar’ in this implementation
column_name (str) – column name of search_words_object detail in citation dataframe
theme_style (str) – name of the bar chart theme
xaxis_label_rotation (float) – rotate the column elements shown on x axis or horizontally.
top_result (int) – This limits the number of column unique elements to be shown
method (str) – provide option to plot chart using either ‘seaborn’ or ‘pandas’
diagram_fname (str) – filename or path of diagram image to be saved.
kwargs (dict) – kwargs are also given to
matplotlib.pyplot.savefig(**kwargs)
- keywords_info(column_name: str = 'keywords')[source]¶
return search_words_object and number of times they are used in the articles
- Parameters
column_name (str) – column name of search_words_object detail in citation dataframe
- publication_place_diagram(column_name: str = 'place_published', top_result=None, method: str = 'seaborn', theme_style='darkgrid', xaxis_label_rotation=90, pandas_bar_kind: str = 'bar', diagram_fname: Optional[str] = None, **kwargs)[source]¶
generates chart showing how many articles are published from different places or countries.
- Parameters
pandas_bar_kind (str) – pandas plot option of kind of chart needed. defaults to ‘bar’ in this implementation
column_name (str) – column name of publication place detail in citation dataframe
theme_style (str) – name of the bar chart theme
xaxis_label_rotation (float) – rotate the column elements shown on x axis or horizontally.
top_result (int) – This limits the number of column unique elements to be shown
method (str) – provide option to plot chart using either ‘seaborn’ or ‘pandas’
diagram_fname (str) – filename or path of diagram image to be saved.
kwargs (dict) – kwargs are also given to
matplotlib.pyplot.savefig(**kwargs)
- publication_place_info(column_name: str = 'place_published')[source]¶
shows how many articles are published from different places or countries.
- Parameters
column_name (str) – column name of publication place detail in citation dataframe
- Returns
contains publication place and count of publications
- Return type
object
- publication_year_diagram(column_name: str = 'year', top_result=None, method: str = 'seaborn', theme_style='darkgrid', xaxis_label_rotation=90, pandas_bar_kind: str = 'bar', diagram_fname: Optional[str] = None, **kwargs)[source]¶
generates chart showing how many articles are published each year.
- Parameters
pandas_bar_kind (str) – pandas plot option of kind of chart needed. defaults to ‘bar’ in this implementation
column_name (str) – column name of publication year detail in citation dataframe
theme_style (str) – name of the bar chart theme
xaxis_label_rotation (float) – rotate the column elements shown on x axis or horizontally.
top_result (int) – This limits the number of column unique elements to be shown
method (str) – provide option to plot chart using either ‘seaborn’ or ‘pandas’
diagram_fname (str) – filename or path of diagram image to be saved.
kwargs (dict) – kwargs are also given to
matplotlib.pyplot.savefig(**kwargs)
- publication_year_info(column_name: str = 'year')[source]¶
shows how many articles are published each year.
- Parameters
column_name (str) – column name of publication year detail in citation dataframe
- Returns
contains year and count of publications
- Return type
object
- publisher_diagram(column_name: str = 'publisher', top_result=None, method: str = 'seaborn', theme_style='darkgrid', xaxis_label_rotation=90, pandas_bar_kind: str = 'bar', diagram_fname: Optional[str] = None, **kwargs)[source]¶
generates chart showing how many articles are published by different publishers.
- Parameters
pandas_bar_kind (str) – pandas plot option of kind of chart needed. defaults to ‘bar’ in this implementation
column_name (str) – column name of publisher detail in citation dataframe
theme_style (str) – name of the bar chart theme
xaxis_label_rotation (float) – rotate the column elements shown on x axis or horizontally.
top_result (int) – This limits the number of column unique elements to be shown
method (str) – provide option to plot chart using either ‘seaborn’ or ‘pandas’
diagram_fname (str) – filename or path of diagram image to be saved.
kwargs (dict) – kwargs are also given to
matplotlib.pyplot.savefig(**kwargs)
- class systematic_review.analysis.SystematicReviewInfo(citations_files_parent_folder_path: Optional[str] = None, filter_sorted_citations_df: Optional[pandas.core.frame.DataFrame] = None, validated_research_papers_df: Optional[pandas.core.frame.DataFrame] = None, selected_research_papers_df: Optional[pandas.core.frame.DataFrame] = None)[source]¶
Bases:
object
This analyse whole systematic review process and takes all produced file to generate tables, figure.
- download_flag_column_name = 'downloaded'¶
- file_validated_flag_name = 'yes'¶
- get_text_list() List[str] [source]¶
This produces the list of all analysis done in this class.
- Returns
This contains systematic review information in sentences.
- Return type
List[str]
- systematic_review_diagram(fig_width=10, fig_height=10, diagram_fname: Optional[str] = None, color: bool = True, color_info: bool = True, auto_fig_size: bool = True, hide_border: bool = True, **kwargs)[source]¶
This outputs the systematic review diagram resembling PRISMA guidelines.
- Parameters
kwargs (dict) – kwargs are also given to
matplotlib.pyplot.savefig(**kwargs)
hide_border (bool) – border is line outside of diagram
auto_fig_size (bool) – this sets the figure size automatically based on given data.
color (bool) – This is color inside of diagram boxes. turn this off by putting False.
color_info (bool) – This show meaning of color in diagram.
diagram_fname (str) – filename or path of diagram image to be saved.
fig_width (float) – This is width of figure in inches.
fig_height (float) – This is height of figure in inches.
- class systematic_review.analysis.TextInBox(figure_axes, x_coordinate, y_coordinate, text='')[source]¶
Bases:
object
This is matplotlib text in box class to make it easier to use text boxes.
- systematic_review.analysis.analysis_of_multiple_ris_citations_files(citations_files_parent_folder_path: str) dict [source]¶
This function loads all ris citations files from folder and return the databases names and collected number of citations from the databases to dict.
- Parameters
citations_files_parent_folder_path (str) – this is the path of parent folder of where citations files exists.
- Returns
this is dict of databases name and number of records in ris files.
- Return type
dict
- systematic_review.analysis.creating_sample_review_file(selected_citation_df)[source]¶
This function outputs dataframe with including columns to make literature review easier.
- Parameters
selected_citation_df (pandas.DataFrame object) – This dataframe is the result of last step of systematic-reviewpy. This contains records for manual literature review.
- Returns
This is dataframe with additional columns for helping in adding details of literature review.
- Return type
pandas.DataFrame object
- systematic_review.analysis.custom_box(**kwargs) dict [source]¶
This is the option for matplotlib text in box.
- Parameters
kwargs (dict) – Contains key word arguments
- Returns
contains options
- Return type
dict
- systematic_review.analysis.duplicate_count(dataframe: pandas.core.frame.DataFrame) int [source]¶
return count of the duplicate articles.
- Parameters
dataframe (pd.DataFrame) – Input pandas dataframe where we want to check numbers of duplicates.
- Returns
number of duplicates records.
- Return type
int
- systematic_review.analysis.missed_article_count(filter_sorted_citations_df: pandas.core.frame.DataFrame, downloaded_articles_path: str, title_column_name: str = 'cleaned_title')[source]¶
return count of missed articles from downloading by checking original list of articles from filter_sorted_citations_df using downloaded articles path.
- Parameters
title_column_name (str) – contains name of column which contain the name of article.
filter_sorted_citations_df (pd.DataFrame) – This dataframe contains records of selected articles including name of articles.
downloaded_articles_path (str) – contains parent folder of all the downloaded articles files.
- Returns
count of the missed articles from downloading.
- Return type
int
- systematic_review.analysis.pandas_countplot_with_pandas_dataframe_column(dataframe, column_name, top_result, plot_kind: str = 'bar', diagram_fname: Optional[str] = None, **kwargs)[source]¶
generate pandas count chart using dataframe column.
- Parameters
dataframe (pd.DataFrame) – dataframe which contains column whose value counts to be shown.
column_name (str) – Name of pandas column elements are supposed to be counted.
top_result (int) – This limits the number of column unique elements to be shown
plot_kind (str) – pandas plot option of kind of chart needed. defaults to ‘bar’ in this implementation
diagram_fname (str) – filename or path of diagram image to be saved.
kwargs (dict) – kwargs are also given to
matplotlib.pyplot.savefig(**kwargs)
- systematic_review.analysis.seaborn_countplot_with_pandas_dataframe_column(dataframe, column_name, theme_style='darkgrid', xaxis_label_rotation=90, top_result=None, diagram_fname: Optional[str] = None, **kwargs)[source]¶
generate seaborn count bar chart using dataframe column.
- Parameters
diagram_fname (str) – filename or path of diagram image to be saved.
dataframe (pd.DataFrame) – dataframe which contains column whose value counts to be shown.
column_name (str) – Name of pandas column elements are supposed to be counted.
theme_style (str) – name of the bar chart theme
xaxis_label_rotation (float) – rotate the column elements shown on x axis or horizontally.
top_result (int) – This limits the number of column unique elements to be shown
kwargs (dict) – kwargs are also given to
matplotlib.pyplot.savefig(**kwargs)
- Returns
show the bar chart
- Return type
object
- systematic_review.analysis.text_padding_for_visualise(text: str, front_padding_space_multiple: int = 4, top_bottom_line_padding_multiple: int = 1)[source]¶
This add required space on all four side of text for better look.
- Parameters
text (str) – This is the input word.
front_padding_space_multiple (int) – This multiply the left and right side of spaces for increased padding.
top_bottom_line_padding_multiple (int) – This multiply the top and down side of spaces for increased padding.
- Returns
str - text with spaces on all four sides. int - height that is number of lines. int - width that is number of char in longest line.
- Return type
tuple
- systematic_review.analysis.vertical_dict_view(dictionary: dict) str [source]¶
convert dict to string with each element in new line.
- Parameters
dictionary (dict) – Contains key and value which we want to print vertically.
- Returns
This prints key1 : value1 and key2 : value2 … in vertical format
- Return type
str
systematic_review.citation module¶
Module: citation This module contains functions which changes format or get details from citations. It also include functions to fix some typos.
- class systematic_review.citation.Citations(citations_files_parent_folder_path, title_column_name: str = 'title', text_manipulation_method_name: str = 'preprocess_string_to_space_separated_words')[source]¶
Bases:
object
- create_citations_dataframe() pandas.core.frame.DataFrame [source]¶
Executes citation step. This function load all the citations from path, add required columns for next steps, and remove duplicates.
- Returns
DataFrame with additional columns needed for next steps of systematic review and duplicates are removed
- Return type
pandas.DataFrame object
- get_dataframe()[source]¶
executes the create citations dataframe function and outputs the pd.DataFrame
- Returns
outputs the citations data.
- Return type
pd.DataFrame
- get_records_list() List[Dict[str, Any]] [source]¶
Executes citation step. This function load all the citations from path, add required columns for next steps, and remove duplicates.
- Returns
list with additional columns needed for next steps of systematic review and duplicates are removed
- Return type
List[Dict[str, Any]]
- to_csv(output_filename: Optional[str] = 'output.csv', index: bool = True)[source]¶
This function saves pandas.DataFrame to csv file.
- Parameters
output_filename (str) – This is the name of output file which should contains .csv extension
index (bool) – Define if index is needed in output csv file or not.
- to_excel(output_filename: Optional[str] = 'output.csv', index: bool = True)[source]¶
This function saves pandas.DataFrame to excel file.
- Parameters
output_filename (str) – This is the name of output file which should contains .xlsx extension
index (bool) – Define if index is needed in output excel file or not.
- systematic_review.citation.add_citation_text_column(dataframe_object: pandas.core.frame.DataFrame, title_column_name: str = 'title', abstract_column_name: str = 'abstract', keyword_column_name: str = 'keywords') pandas.core.frame.DataFrame [source]¶
This takes dataframe of citations and return the full text comprises of “title”, “abstract”, “search_words_object”
- Parameters
dataframe_object (pandas.DataFrame object) – this is the object of famous python library pandas. for more lemma_info: https://pandas.pydata.org/docs/
title_column_name (str) – This is the name of column which contain citation title
abstract_column_name (str) – This is the name of column which contain citation abstract
keyword_column_name (str) – This is the name of column which contain citation search_words_object
- Returns
this is dataframe_object comprises of full text column.
- Return type
pd.DataFrame
- systematic_review.citation.add_multiple_sources_column(citation_dataframe: pandas.core.frame.DataFrame, group_by: list = ['title', 'year']) pandas.core.frame.DataFrame [source]¶
This function check if citations or article title is available at more than one sources and add column named ‘multiple_sources’ to the dataframe with list of name of sources names.
- Parameters
citation_dataframe (pandas.DataFrame object) – Input dataset which contains citations or article title with sources more than one.
group_by (list) – column label or sequence of labels, optional Only consider certain columns for citations or article title with sources more than one, by default use all of the columns.
- Returns
DataFrame with additional column with list of sources names
- Return type
pandas.DataFrame object
- systematic_review.citation.citations_to_ris_converter(input_file_path: str, output_filename: str = 'output_ris_file.ris', input_file_type: str = 'read_csv') None [source]¶
This asks for citations columns name from tabular data and then convert the data to ris format.
- Parameters
input_file_path (str) – this is the path of input file
output_filename (str) – this is the name of the output ris file with extension. output file path is also valid choice.
input_file_type (str) – this function default is csv but other formats are also supported by putting ‘read_{file_type}’. such as input_file_type = ‘read_excel’ all file type supported by pandas can be used by putting pandas IO tools methods. for more info visit- https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html
- Returns
- Return type
None
- systematic_review.citation.drop_columns_based_on_column_name_list(dataframe: pandas.core.frame.DataFrame, column_name_list: list) pandas.core.frame.DataFrame [source]¶
This function drop columns based on the column name in the list.
- Parameters
dataframe (pandas.DataFrame object) – This dataframe contains columns which we want to drop or remove.
column_name_list (list) – This is the name of dataframe columns to be removed
- Returns
DataFrame with columns mentioned in column_name_list removed.
- Return type
pandas.DataFrame object
- systematic_review.citation.drop_duplicates_citations(citation_dataframe: pandas.core.frame.DataFrame, subset: list = ['title', 'year'], keep: Literal[first, last, False] = 'first', index_reset: bool = True) pandas.core.frame.DataFrame [source]¶
Return DataFrame with duplicate rows removed. Considering certain columns is optional. Indexes, including time indexes are ignored.
- Parameters
index_reset (bool) – It
citation_dataframe (pandas.DataFrame object) – Input dataset which contains duplicate rows
subset (list) – column label or sequence of labels, optional Only consider certain columns for identifying duplicates, by default use all of the columns.
keep (str) – options includes {‘first’, ‘last’, False}, default ‘first’. Determines which duplicates (if any) to keep. -
first
: Drop duplicates except for the first occurrence. -last
: Drop duplicates except for the last occurrence. - False : Drop all duplicates.
- Returns
DataFrame with duplicates removed
- Return type
pandas.DataFrame object
- systematic_review.citation.drop_search_words_count_columns(dataframe, search_words_object: systematic_review.search_count.SearchWords) pandas.core.frame.DataFrame [source]¶
removes columns created based on the keywords.
- Parameters
dataframe (pandas.DataFrame object) – This dataframe contains keywords columns which we want to drop or remove.
search_words_object (dict) – This is the dictionary comprised of unique keywords in each keyword groups. It means keyword from first keyword group can not be found in any other keyword group. Example - {‘keyword_group_1’: [“management”, “investing”, “risk”, “pre”, “process”], ‘keyword_group_2’: [“corporate”, “pricing”],…}
- Returns
DataFrame with keywords columns removed.
- Return type
pandas.DataFrame object
- systematic_review.citation.edit_ris_citation_paste_values_after_regex_pattern(input_file_path: str, output_filename: str = 'output_file.ris', edit_line_regex: str = '^DO ', paste_value: str = 'ER - ') None [source]¶
This is created to edit ris files which doesn’t specify ER for ‘end of citations’ and paste ER after end point of citation, replace ‘DO’ with other ris classifiers such as TY, JO etc.
- Parameters
input_file_path (str) – this is the path of input file
output_filename (str) – this is the name of the output ris file with extension.
edit_line_regex (str) – this is the regex to find ris classifiers lines such as DO, TY, JO etc.
paste_value (str) – this is value to be pasted, most helpful is ER ris classifier which signify citation end.
- Returns
- Return type
None
- systematic_review.citation.get_details_of_all_article_name_from_citations(filtered_list_of_dict: list, sources_name_citations_path_list_of_dict: list, doi_url: bool = False, title_column_name: str = 'title') list [source]¶
This function searches source names, doi, and url for all articles in filtered_list_of_dict.
- Parameters
filtered_list_of_dict (list) – This is the list of article citations dict after filtering it using min_limit on grouped_keywords_count
sources_name_citations_path_list_of_dict (list) – This is the list of all the sources names and it’s citations at dir_path. Examples - {‘sources_name’: ‘all source articles citations’, …}
doi_url (bool) – This signify if we want to get the value of url and doi from citation
title_column_name (str) – This is the name of column which contain citation title
- Returns
This list contains all article names with source names. (optional url and doi)
- Return type
list
- systematic_review.citation.get_details_via_article_name_from_citations(article_name: str, sources_name_citations_path_list_of_dict: list, doi_url: bool = False, title_column_name: str = 'title') dict [source]¶
Iterate through citations and find article_name and put source_name in column, with doi and url being optional
- Parameters
article_name (str) – This is the primary title of the citation or name of the article.
sources_name_citations_path_list_of_dict (list) – This is the list of all the sources names and it’s citations at dir_path. Examples - {‘sources_name’: ‘all source articles citations’, …}
doi_url (bool) – This signify if we want to get the value of url and doi from citation
title_column_name (str) – This is the name of column which contain citation title
- Returns
This dict contains the article_name, source_name and optional url and doi
- Return type
dict
- systematic_review.citation.get_missed_articles_source_names(missed_articles_list: list, all_articles_title_source_name_list_of_dict: list, article_column_name: str = 'article_name', source_column_name: str = 'source_name') list [source]¶
- Parameters
missed_articles_list (list) – This contains the list of articles that got missed while downloading.
all_articles_title_source_name_list_of_dict (list) – This list contains all article names with source names. (optional url and doi)
article_column_name (str) – This is the name of article column in the all_articles_title_source_name_list_of_dict.
source_column_name (str) – This is the name of source column in the all_articles_title_source_name_list_of_dict.
- Returns
This list contains articles_name and sources name.
- Return type
list
systematic_review.converter module¶
Module: converter This module contains functions related to files and data type conversion. such as list to txt file, pandas df to list of dicts and many more.
- class systematic_review.converter.ASReview(data: Union[List[dict], pandas.core.frame.DataFrame])[source]¶
Bases:
object
- class systematic_review.converter.Reader(file_path: str)[source]¶
Bases:
object
Contains functionality to read files.
- get_text(pages: str = 'all')[source]¶
It understand the type of file and output the content of file.
- Parameters
pages (str) – contain option to read ‘first’ or ‘all’ pages.
- Returns
This is text in readable file.
- Return type
str
- pandas_reader(input_file_type)[source]¶
Read file using pandas IO https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html
- Parameters
input_file_type (str) – check pandas IO for examples like read_csv, read_excel etc.
- Returns
This is the required text from pandas IO.
- Return type
str
- pdf_pdftotext_reader(pages: str = 'all')[source]¶
Extract the text from pdf file via pdftotext. for more lemma_info, visit: https://pypi.org/project/pdftotext/
- Parameters
pages (str) – This could be ‘all’ to get full text of pdf and ‘first’ for first page of pdf.
- Returns
This is the required text from pdf file.
- Return type
str
- pdf_pymupdf_reader(pages: str = 'all')[source]¶
Extract the text from pdf file via fitz(PyMuPDF). for more lemma_info, visit: https://pypi.org/project/PyMuPDF/
- Parameters
pages (str) – This could be ‘all’ to get full text of pdf and ‘first’ for first page of pdf.
- Returns
This is the required text from pdf file.
- Return type
str
- systematic_review.converter.add_preprocess_column(dataframe_object: pandas.core.frame.DataFrame, column_name: str = 'title')[source]¶
Takes dataframe and column name to apply preprocess function from string_manipulation module.
- Parameters
dataframe_object (pandas.DataFrame object) – This is object with column containing column which needs to be preprocessed.
column_name (str) – This is the name of the column of dataframe.
- Returns
DataFrame with additional column with preprocessed column.
- Return type
pandas.DataFrame object
- systematic_review.converter.apply_custom_function_on_dataframe_column(dataframe: pandas.core.frame.DataFrame, column_name: str, custom_function, new_column_name: Optional[str] = None, *args, **kwargs) pandas.core.frame.DataFrame [source]¶
This apply custom_text_manipulation_function function to all element of dataframe column.
- Parameters
new_column_name (str) – This is the new name you want to give your modified column and new column will be added to dataframe without modifying original column.
dataframe (pd.DataFrame) – This is the pandas dataframe consisting of column name with elements capable to be transformed with custom_text_manipulation_function function.
column_name (str) – name of dataframe column whose elements are needed to be transformed
custom_function – This is custom_text_manipulation_function function to be applied on each elements of the pandas column elements.
- Returns
This is transformed dataframe.
- Return type
pd.DataFrame
- systematic_review.converter.dataframe_column_counts(dataframe, column_name)[source]¶
Equivalent to pandas.DataFrame.value_counts(), It return list with count of unique element in column
- Parameters
dataframe (pd.DataFrame) – dataframe which contains column that is to be counted
column_name (str) – Name of pandas column elements are supposed to be counted.
- Returns
unique column elements with counts
- Return type
object
- systematic_review.converter.dataframe_to_csv_file(dataframe_object: pandas.core.frame.DataFrame, output_filename: Optional[str] = 'output.csv', index: bool = True)[source]¶
This function saves pandas.DataFrame to csv file.
- Parameters
dataframe_object (pandas.DataFrame object) – this is the object of python library pandas. for more lemma_info: https://pandas.pydata.org/docs/
output_filename (str) – This is the name of output file which should contains .csv extension
index (bool) – Define if index is needed in output csv file or not.
- systematic_review.converter.dataframe_to_excel_file(dataframe_object: pandas.core.frame.DataFrame, output_filename: Optional[str] = 'output.csv', index: bool = True)[source]¶
This function saves pandas.DataFrame to excel file.
- Parameters
dataframe_object (pandas.DataFrame object) – this is the object of python library pandas. for more lemma_info: https://pandas.pydata.org/docs/
output_filename (str) – This is the name of output file which should contains .xlsx extension
index (bool) – Define if index is needed in output excel file or not.
- systematic_review.converter.dataframe_to_records_list(dataframe: pandas.core.frame.DataFrame) List[Dict[str, Any]] [source]¶
converts pandas dataframe to the list of dictionaries (records).
- Parameters
pd.DataFrame – This is the pandas dataframe consisted of all data from dictionaries converted into respective rows.
- Returns
This list contains the dictionaries inside as elements. Example - [{‘primary_title’ : “this is first title”}, {‘primary_title’ : “this is second title”}, {‘primary_title’ : “this is third title”}]
- Return type
List[Dict[str, Any]]
- systematic_review.converter.dict_key_value_to_records(dictionary: dict, key_column_name: str, value_column_name: str)[source]¶
converts {‘key’:value, key1: value1},etc to record = [{‘key_column_name’: key, value_column_name: value}, etc]. that is used to convert to pd.DataFrame
- Parameters
dictionary (dict) – hash map or dictionary that contains key and value pairs.
key_column_name (str) – name of records column
value_column_name (str) – name of records column
- Returns
This list is in records format.
- Return type
list
- systematic_review.converter.dict_values_data_type(dictionary)[source]¶
This provide the data type of dictionary values by outputting dictionary.
- Parameters
dictionary (dict) – This is the dictionary which contains different types of object in values. Example - {“first”: [2, 5], “sec”: 3}
- Returns
This will output {“<class ‘list’>”: [“first”], “<class ‘int’>”: [“sec”]}
- Return type
dict
- systematic_review.converter.extract_pandas_df_column1_row_values_based_on_column2_value(pandas_dataframe, column2_value, column2name='source_name', column1name='article_name')[source]¶
extract the values of pandas dataframe column1’s row_values based on values of column2 value
- Parameters
pandas_dataframe (pd.DataFrame) – This is the pandas dataframe containing at least two columns with values.
column2_value (object) – This should be str in normal cases but can be any object type supported in pandas for column value.
column2name (str) – This is the name of the column by which we are extracting the column1 values.
column1name (str) – This is the name of the column whose values we require.
- Returns
This is the list of the resultant values from column1 rows.
- Return type
list
- systematic_review.converter.get_pdf_object_from_pdf_path(pdf_file_path: str)[source]¶
Extract text as pdf object from the pdf file where loop and indexing can show text per pages.
- Parameters
pdf_file_path (str) – This is the path of pdf file.
- Returns
- Return type
This is pdf object with Extracted text.
- systematic_review.converter.get_text_from_multiple_pdf_reader(pdf_file_path: str, pages: str = 'all') Union[str, bool] [source]¶
This Function get text from pdf files using pdftotext. if failed then text comes from pymupdf.
- Parameters
pdf_file_path (str) – This is the path of pdf file.
pages (str) – This could be ‘all’ to get full text of pdf and ‘first’ for first page of pdf.
- Returns
This is the required text from pdf file.
- Return type
str
- systematic_review.converter.get_text_from_pdf(pdf_file_path: str, pages: str = 'all', pdf_reader: str = 'pdftotext') Union[str, bool] [source]¶
This Function get text from pdf files using either pdftotext or pymupdf.
- Parameters
pdf_reader (str) – This is python pdf reader package which convert pdf to text.
pdf_file_path (str) – This is the path of pdf file.
pages (str) – This could be ‘all’ to get full text of pdf and ‘first’ for first page of pdf.
- Returns
This is the required text from pdf file.
- Return type
str
- systematic_review.converter.get_text_from_pdf_pdftotext(pdf_file_path: str, pages: str = 'all') str [source]¶
Extract the text from pdf file via pdftotext. for more lemma_info, visit: https://pypi.org/project/pdftotext/
- Parameters
pdf_file_path (str) – This is the path of the pdf file.
pages (str) – This could be ‘all’ to get full text of pdf and ‘first’ for first page of pdf.
- Returns
This is the required text from pdf file.
- Return type
str
- systematic_review.converter.get_text_from_pdf_pymupdf(pdf_file_path: str, pages: str = 'all') str [source]¶
Extract the text from pdf file via fitz(PyMuPDF). for more lemma_info, visit: https://pypi.org/project/PyMuPDF/
- Parameters
pages (str) – This could be ‘all’ to get full text of pdf and ‘first’ for first page of pdf.
pdf_file_path (str) – This is the path of pdf file.
- Returns
This is the required text from pdf file.
- Return type
str
- systematic_review.converter.json_file_to_dict(json_file_path: str) dict [source]¶
Read the json file from the path given. Convert json file data to the python dictionary.
- Parameters
json_file_path (str) – This is the json file path which is needed to be converted.
- Returns
This is the data in dict format converted from json file.
- Return type
dict
- systematic_review.converter.list_to_string(list_name)[source]¶
This converts list to text_string and put each element in new line.
- Parameters
list_name (list) – This is the python data structure list which contains some data.
- Returns
This is the text string comprises of all data of list.
- Return type
str
- systematic_review.converter.list_to_text_file(filename: str, list_name: str, permission: str = 'w')[source]¶
This converts list to text file and put each element in new line.
- Parameters
filename (str) – This is the name to be given for text file.
list_name (list) – This is the python data structure list which contains some data.
permission (str) – These are the os permissions given for the file. check more lemma_info on python library ‘os’.
- Returns
- Return type
None
- systematic_review.converter.load_multiple_ris_citations_files(citations_files_parent_folder_path: str) List[dict] [source]¶
This function loads all ris citations files from folder
- Parameters
citations_files_parent_folder_path (str) – this is the path of parent folder of where citations files exists.
- Returns
this is list of citations dicts inclusive of all citation files.
- Return type
List[dict]
- systematic_review.converter.load_multiple_ris_citations_files_to_dataframe(citations_files_parent_folder_path: str) pandas.core.frame.DataFrame [source]¶
This function loads all ris citations files from folder
- Parameters
citations_files_parent_folder_path (str) – this is the path of parent folder of where citations files exists.
- Returns
this is dataframe of citations dicts inclusive of all citation files.
- Return type
pd.DataFrame
- systematic_review.converter.load_text_file(file_path: str, permission: str = 'r')[source]¶
This reads text file. get all line of text file by file object. for more info visit- https://docs.python.org/3/tutorial/inputoutput.html
- Parameters
file_path (str) – This is the path or name of text file.
permission (str) – These are the os permissions given for the file.
- Returns
This contains all lines loaded.
- Return type
file object
- systematic_review.converter.records_list_to_dataframe(list_of_dicts: List[Dict[str, Any]]) pandas.core.frame.DataFrame [source]¶
converts the list of dictionaries to pandas dataframe.
- Parameters
list_of_dicts (List[Dict[str, Any]]) – This list contains the dictionaries inside as elements. Example - [{‘primary_title’ : “this is the title”}]
- Returns
This is the pandas dataframe consisted of all data from dictionaries converted into respective rows.
- Return type
pd.DataFrame
- systematic_review.converter.remove_empty_lines(input_file_path: str, output_filename: str = 'output_file.ris') None [source]¶
This function removes the blank lines from the input file and output new file.
- Parameters
input_file_path (str) – this is the path of input file
output_filename (str) – this is the name of the output ris file with extension.
- Returns
- Return type
None
- systematic_review.converter.ris_file_to_pandas_dataframe(ris_file_path: str) pandas.core.frame.DataFrame [source]¶
This needs ‘rispy’ to read ris to list of dicts. It then convert list of dicts to pandas.DataFrame
- Parameters
ris_file_path (str) – This is the path of ris citations file
- Returns
dataframe object from pandas
- Return type
pd.DataFrame
- systematic_review.converter.ris_file_to_records_list(ris_file_path: str) List[Dict[str, Any]] [source]¶
Converts .ris file to list of dictionaries of citations using rispy(https://pypi.org/project/rispy/). For more lemma_info on ris format, visit: https://en.wikipedia.org/wiki/RIS_(file_format)
- Parameters
ris_file_path (str) – This is the filepath of the ris file.
- Returns
This list contains dictionaries of citations in records format, same as in pandas.
- Return type
List[Dict[str, Any]]
- systematic_review.converter.text_file_to_list(file_path: str, permission: str = 'r')[source]¶
This converts text file to list and put each line in list as single element. get first line of text file by list[0].
- Parameters
file_path (str) – This is the name to be given for text file.
permission (str) – These are the os permissions given for the file. check more lemma_info on python library ‘os’.
- Returns
This contains all lines loaded into list with one line per list element. [first line, second line,…. ]
- Return type
list
- systematic_review.converter.try_convert_dataframe_column_elements_to_list(dataframe: pandas.core.frame.DataFrame, column_name: str) List[list] [source]¶
try statement for converting each element of dataframe column to list object.
- Parameters
dataframe (pd.DataFrame) – The dataframe with column to convert into list
column_name (str) – Name of column for conversion
- Returns
This is list with each element of type list.
- Return type
List[list]
- systematic_review.converter.unpack_list_of_lists(list_of_lists)[source]¶
unpack list consisting of other list to output list which will include all elements from other lists.
- Parameters
list_of_lists (list) – this is list consisting of elements and lists. example [“first_element”, [“second_element”]]
- Returns
This is the resultant list consisting of only elements. example [“first_element”, “second_element”]
- Return type
list
- systematic_review.converter.unpack_list_of_lists_with_optional_apply_custom_function(list_of_lists: List[list], custom_function=None) list [source]¶
unpack lists inside of list to new list containing all the elements from list_of_lists with optional custom_function applied on all elements. example- [[1,2,3], [3,4,5]] to [1,2,3,3,4,5]
- Parameters
list_of_lists (List[list]) – This list contains lists as elements which might contains other elements.
custom_function – This is optional function to be applied on each element of list_of_lists
- Returns
list containing all the elements with any optional transformation using custom_function.
- Return type
list
- systematic_review.converter.write_json_file_with_dict(output_file_path: str, input_dict: dict) None [source]¶
Write json file at output_file_path with the help of input dictionary.
- Parameters
output_file_path (str) – This is the path of output file we want, if only name is provided then it will export json to the script path.
input_dict (dict) – This is the python dictionary which we want to be saved in json file format.
- Returns
Function doesn’t return anything but write a json file at output_file_path.
- Return type
None
systematic_review.filter_sort module¶
Module: filter_sort Description for filter: each searched search_words_object group can be used to filter using conditions such as searched search_words_object >= some count values and filter them until you have required number of articles that can be manually read and filter.
Description for sort: This converts the data into sorted manner so it is easier for humans to understand.
- class systematic_review.filter_sort.FilterSort(data: Union[List[dict], pandas.core.frame.DataFrame], search_words_object: systematic_review.search_count.SearchWords, required_number: int)[source]¶
Bases:
object
This contains functionality to filter and sort the data.
- filter_and_sort() pandas.core.frame.DataFrame [source]¶
Execute filter and sort step. creates sorting criterion list, sort the dataframe based on the sorting criterion list.
- Returns
- This is the sorted dataframe which contains columns in this sequential manner. It contains citation df,
total_keywords, group_keywords_counts, and keywords_counts in the last.
- Return type
pd.DataFrame
- get_dataframe()[source]¶
executes the filter and sort function and outputs the pd.DataFrame
- Returns
outputs the filter and sorted data.
- Return type
pd.DataFrame
- get_records_list()[source]¶
executes the filter and sort function and outputs the records list file
- Returns
outputs the filter and sorted data.
- Return type
List[dict]
- to_csv(output_filename: Optional[str] = 'output.csv', index: bool = True)[source]¶
This function saves pandas.DataFrame to csv file.
- Parameters
output_filename (str) – This is the name of output file which should contains .csv extension
index (bool) – Define if index is needed in output csv file or not.
- to_excel(output_filename: Optional[str] = 'output.csv', index: bool = True)[source]¶
This function saves pandas.DataFrame to excel file.
- Parameters
output_filename (str) – This is the name of output file which should contains .xlsx extension
index (bool) – Define if index is needed in output excel file or not.
- systematic_review.filter_sort.dataframe_sorting_criterion_list(citations_grouped_keywords_counts_df: pandas.core.frame.DataFrame, sorting_keywords_criterion_list: list, reverse: bool = False)[source]¶
- Provide a sorting criterion list for dataframe columns. put citations columns to the left and search_words counts
on the right. On making reverse equal to true it put search_words on the left.
- Parameters
reverse (bool) – default to False to output citations columns to the left and keyword counts on the right. On True it does opposite.
citations_grouped_keywords_counts_df (pd.DataFrame) – This dataframe contains all columns with counts of search_words_object.
sorting_keywords_criterion_list (list) – This is the sorting criterion list which contains column in logical manner we desire.It contains total_keywords, group_keywords_counts, and keywords_counts in the last.
- Returns
This is the dataframe sorting criterion list which contains column in logical manner we desire. It contains citations details in the left while total_keywords, group_keywords_counts, and keywords_counts in the right.
- Return type
list
- systematic_review.filter_sort.filter_and_sort(citations_grouped_keywords_counts_df: pandas.core.frame.DataFrame, search_words_object: systematic_review.search_count.SearchWords, required_number: int) pandas.core.frame.DataFrame [source]¶
Execute filter and sort step. creates sorting criterion list, sort the dataframe based on the sorting criterion list.
- Parameters
required_number (int) – This is the least number of documents we want.
citations_grouped_keywords_counts_df (pd.DataFrame) – This dataframe contains all columns with counts of search_words_object.
search_words_object (object) – search_words_object should contain dictionary comprised of unique search_words_object in each keyword groups. It means keyword from first keyword group can not be found in any other keyword group. Example - {‘keyword_group_1’: [“management”, “investing”, “risk”, “pre”, “process”], ‘keyword_group_2’: [“corporate”, “pricing”],…}
- Returns
- This is the sorted dataframe which contains columns in this sequential manner. It contains citation df,
total_keywords, group_keywords_counts, and keywords_counts in the last.
- Return type
pd.DataFrame
- systematic_review.filter_sort.filter_dataframe_on_keywords_group_name_count(citations_grouped_keywords_counts_df: pandas.core.frame.DataFrame, min_limit: int, common_word: str = '_count', method: str = 'suffix') List[dict] [source]¶
This function gets columns name from pandas dataframe which contains given prefix or suffix. It then filter dataframe to the point where all prefix and suffix column name have values more than min_limit.
- Parameters
citations_grouped_keywords_counts_df (pd.DataFrame) – This is input dataframe which contains some columns which have prefix or suffix in names.
min_limit (int) – This is the least value we want in all search_words_object group names.
common_word (str) – This is the similar word string in many column names.
method (str) – This is to specify if we are looking for prefix or suffix in column names.
- Returns
This is the filtered citations list based on min_limit of grouped_keywords_counts.
- Return type
List[dict]
- systematic_review.filter_sort.finding_required_article_by_changing_min_limit_recursively(citations_grouped_keywords_counts_df: pandas.core.frame.DataFrame, required_number_of_articles: int, addition: int = 0, search: bool = True, prev_lower_total_articles_rows: int = 0)[source]¶
This function increases the min_limit value to reach up to required_number_of_articles. this function return the min_limit value of exact required_number_of_articles can be extracted from dataframe else it provide the lower and upper limit of min_limit
- Parameters
citations_grouped_keywords_counts_df (pd.DataFrame) – This is input dataframe which contains some columns which have prefix or suffix in names.
required_number_of_articles (int) – This is the number of articles you want after filtration process.
addition (int) – This is the number by which you want to increase the min_limit on grouped keyword count.
search (bool) – This signify the status of searching for best value of min_limit
prev_lower_total_articles_rows (int) – This is the previous lower total articles rows
- Returns
This prints the values rather than returning the values. It return search which is of no use.
- Return type
bool
- systematic_review.filter_sort.get_dataframe_sorting_criterion_list(citations_grouped_keywords_counts_df, unique_preprocessed_clean_grouped_keywords_dict)[source]¶
This sorting criteria list is based on the search_words_object got from the main input search_words_object. It contains total_keywords, group_keywords_counts, keywords_counts.
- Parameters
citations_grouped_keywords_counts_df (pd.DataFrame) – This dataframe contains all columns with counts of search_words_object.
unique_preprocessed_clean_grouped_keywords_dict (dict) – his is the dictionary comprised of unique search_words_object in each keyword groups. It means keyword from first keyword group can not be found in any other keyword group. Example - {‘keyword_group_1’: [“management”, “investing”, “risk”, “pre”, “process”], ‘keyword_group_2’: [“corporate”, “pricing”],…}. ‘risk’ is removed from keyword_group_2.
- Returns
This is the sorting criterion list which contains column in logical manner we desire. It contains total_keywords, group_keywords_counts, and keywords_counts in the last.
- Return type
list
- systematic_review.filter_sort.get_pd_df_columns_names_with_prefix_suffix(input_pandas_dataframe: pandas.core.frame.DataFrame, common_word: str = '_count', method: str = 'suffix') List[str] [source]¶
Provide the columns name from pandas dataframe which contains given prefix or suffix.
- Parameters
input_pandas_dataframe (pd.DataFrame) – This dataframe contains many columns some of which contains the common word we are looking for.
common_word (str) – This is the similar word string in many column names.
method (str) – This is to specify if we are looking for prefix or suffix in column names.
- Returns
This list contains the name of columns which follow above criteria.
- Return type
List[str]
- systematic_review.filter_sort.manually_check_filter_by_min_limit_changes(citations_grouped_keywords_counts_df: pandas.core.frame.DataFrame, required_number_of_articles: int, min_limit: int = 1, iterations: int = 20, addition: int = 20)[source]¶
manual text_manipulation_method_name to check number of articles based on changing min_limit.
- Parameters
citations_grouped_keywords_counts_df (pd.DataFrame) – This is input dataframe which contains some columns which have prefix or suffix in names.
required_number_of_articles (int) – This is the number of articles you want after filtration process.
min_limit (int) – This is the least value we want in all search_words_object group names.
iterations (int) – This is the number of iterations in for underling loop.
addition (int) – This is the number by which you want to increase the min_limit on grouped keyword count.
- Returns
This prints the values rather than returning the values.
- Return type
None
- systematic_review.filter_sort.return_finding_near_required_article_by_changing_min_limit_while_loop(citations_grouped_keywords_counts_df: pandas.core.frame.DataFrame, required_number_of_articles: int)[source]¶
This function increases the min_limit value to reach unto required_number_of_articles. this function return the min_limit value of exact required_number_of_articles can be extracted from dataframe else it provide the lower and upper limit of min_limit
- Parameters
citations_grouped_keywords_counts_df (pd.DataFrame) – This is input dataframe which contains some columns which have prefix or suffix in names.
required_number_of_articles (int) – This is the number of articles you want after filtration process.
- Returns
This tuple consists of following values in same order exact match values: min_limit, total_articles_rows lower_info : min_limit, lower_total_articles_rows upper_info : min_limit, upper_total_articles_rows
- Return type
tuple
- systematic_review.filter_sort.return_finding_required_article_by_changing_min_limit_recursively(citations_grouped_keywords_counts_df: pandas.core.frame.DataFrame, required_number_of_articles: int, addition: int = 0, search: bool = True, prev_lower_total_articles_rows: int = 0, upper_info=(None, None))[source]¶
This function increases the min_limit value to reach unto required_number_of_articles. this function return the min_limit value of exact required_number_of_articles can be extracted from dataframe else it provide the lower and upper limit of min_limit
- Parameters
citations_grouped_keywords_counts_df (pd.DataFrame) – This is input dataframe which contains some columns which have prefix or suffix in names.
required_number_of_articles (int) – This is the number of articles you want after filtration process.
addition (int) – This is the number by which you want to increase the min_limit on grouped keyword count.
search (bool) – This signify the status of searching for best value of min_limit
prev_lower_total_articles_rows (int) – This is the previous lower total articles rows
upper_info (list) – This is list consists of [min_limit, upper_total_articles_rows]
- Returns
This tuple consists of following values in same order searching flag: True or False exact match values: min_limit, total_articles_rows lower_info : min_limit, lower_total_articles_rows upper_info : min_limit, upper_total_articles_rows
- Return type
tuple
- systematic_review.filter_sort.sort_citations_grouped_keywords_counts_df(citations_grouped_keywords_counts_df: pandas.core.frame.DataFrame, sorting_keywords_criterion_list: list) pandas.core.frame.DataFrame [source]¶
This function sort the dataframe based on the sorting criterion list.
- Parameters
citations_grouped_keywords_counts_df (pd.DataFrame) – This dataframe contains all columns with counts of search_words_object.
sorting_keywords_criterion_list (list) – This is the sorting criterion list which contains column in logical manner we desire.It contains total_keywords, group_keywords_counts, and keywords_counts in the last.
- Returns
pd.DataFrame – This is the sorted dataframe which contains columns in this sequential manner. It contains total_keywords,
group_keywords_counts, and keywords_counts in the last.
- systematic_review.filter_sort.sort_dataframe_based_on_column(dataframe, column_name, ascending=True)[source]¶
sort the dataframe based on column values.
- Parameters
ascending (bool) – This decide increasing or decreasing order of sort. default to ascending a-z, 1-9.
dataframe (pd.DataFrame) – This is unsorted dataframe.
column_name (str) – This is the name of column which is used to sort the dataframe.
- Returns
This is sorted dataframe based on title_column_name.
- Return type
pd.DataFrame
systematic_review.nlp module¶
Module: nlp (Natural language processing) This module contains functions related to removing stop words, Lemmatization, and stemming Approaches. Functions will import the supporting AI model only when they are executed. For more Examples and info visit: https://www.machinelearningplus.com/nlp/lemmatization-examples-python/ and https://www.machinelearningplus.com/nlp/lemmatization-examples-python/
- systematic_review.nlp.nltk_lancaster_stemmer(input_text: str) str [source]¶
This function returns stemmed text. Uses nltk.stem LancasterStemmer
- Parameters
input_text (str) – This may contains all words in dictionary.
- Returns
This output text contains stems of a words. Example - “car” is matched with words like “cars” but not “automobile”.
- Return type
str
- systematic_review.nlp.nltk_porter_stemmer(input_text: str) str [source]¶
This function returns stemmed text. Uses nltk.stem PorterStemmer
- Parameters
input_text (str) – This may contains all words in dictionary.
- Returns
This output text contains stems of a words. Example - “car” is matched with words like “cars” but not “automobile”.
- Return type
str
- systematic_review.nlp.nltk_remove_stopwords(text: str) str [source]¶
Remove unnecessary words such as she, are, of, which, and in.
- Parameters
text (str) – This may contains all words in dictionary.
- Returns
This contains words other than stop words described in nltk english stop words.
- Return type
str
- systematic_review.nlp.nltk_remove_stopwords_spacy_lemma(string_list_lower: str) List[str] [source]¶
This function returns lemmatize text of lowercase input string. Uses spacy en_core_web_sm
- Parameters
string_list_lower (str) – This may contains all lowercase words in dictionary.
- Returns
This output text contains word-forms which are linguistically valid lemmas.
- Return type
List[str]
- systematic_review.nlp.nltk_word_net_lemmatizer(input_text: str) str [source]¶
This function returns lemmatize text. Uses nltk.stem WordNetLemmatizer
- Parameters
input_text (str) – This may contains all words in dictionary.
- Returns
This output text contains word-forms which are linguistically valid lemmas. Example - “car” is matched with words like “cars” and “automobile”.
- Return type
str
- systematic_review.nlp.pattern_lemma_or_lemmatize_text(input_text: str, lemma_info: bool = False) str [source]¶
This return lemma if lemma_info is True else it returns lemmatize text. Uses pattern.en lemma
- Parameters
input_text (str) – This may contains all words in dictionary.
lemma_info (bool) – This is the switch variable which define return either to lemma information or lemmatize text.
- Returns
This output text contains word-forms which are linguistically valid lemmas. Example - “car” is matched with words like “cars” and “automobile”.
- Return type
str
- systematic_review.nlp.spacy_lemma(input_text: str) str [source]¶
This function returns lemmatize text. Uses spacy en_core_web_sm
- Parameters
input_text (str) – This may contains all words in dictionary.
- Returns
This output text contains word-forms which are linguistically valid lemmas. Example - “car” is matched with words like “cars” and “automobile”.
- Return type
str
systematic_review.os_utils module¶
Module: os_utils This module contains functions related to getting directories, files, and filenames from os paths.
- systematic_review.os_utils.extract_files_path_from_directories_or_subdirectories(directory_path: str) list [source]¶
Getting all files paths from the directory and its subdirectories.
- Parameters
directory_path (str) – This is the directory path of files we require.
- Returns
This list contains path of all the files contained in directory_path.
- Return type
list
- systematic_review.os_utils.extract_subdirectories_path_from_directory(directory_path: str) list [source]¶
Getting all sub directories paths from the directory.
- Parameters
directory_path (str) – This is the directory path of sub directories we require.
- Returns
This list contains path of all the sub directories contained in directory_path.
- Return type
list
- systematic_review.os_utils.get_all_filenames_in_dir(dir_path: str) List[str] [source]¶
This provides all the names of files at dir_path.
- Parameters
dir_path (str) – This is the path of folder we are searching files in.
- Returns
This is the list of all the names of files at dir_path.
- Return type
List[str]
- systematic_review.os_utils.get_directory_file_name_and_path(dir_path: str) tuple [source]¶
Get file names and file paths from directory path.
- Parameters
dir_path (str) – This is the path of the directory.
- Returns
This tuple contains list of downloaded_articles_name_list and downloaded_articles_path_list.
- Return type
tuple
- systematic_review.os_utils.get_file_extension_from_path(file_path: str) str [source]¶
Returns the file extension from pdf filepath.
- Parameters
file_path (str) – A path is a string of characters used to uniquely identify a location in a directory structure. for more info visit- https://en.wikipedia.org/wiki/Path_(computing)
- Returns
A filename extension, file extension or file type is an identifier specified as a suffix to the name of a computer file. for more info visit- https://en.wikipedia.org/wiki/Filename_extension
- Return type
str
- systematic_review.os_utils.get_filename_from_path(file_path: str) str [source]¶
Returns the filename from pdf filepath.
- Parameters
file_path (str) – A path is a string of characters used to uniquely identify a location in a directory structure. for more info visit- https://en.wikipedia.org/wiki/Path_(computing)
- Returns
A filename or file name is a name used to uniquely identify a computer file in a directory structure. for more info visit- https://en.wikipedia.org/wiki/Filename
- Return type
str
- systematic_review.os_utils.get_path_leaf(file_path: str) str [source]¶
- Extract file name from path. for more details visit:
- Parameters
file_path (str) – This is the path of file.
- Returns
This is name of file.
- Return type
str
- systematic_review.os_utils.get_sources_name_citations_mapping(dir_path: str) list [source]¶
This makes the list of {‘sources_name’: ‘all source articles citations’, …} from mentioning the dir path of ris files.
- Parameters
dir_path (str) – This is the path of folder we are searching ris files in.
- Returns
This is the list of all the sources names and it’s citations at dir_path.
- Return type
list
systematic_review.search_count module¶
Module: search_count This module contains all necessary functions for searching the citations, articles text and count number of search_words_object present.
- class systematic_review.search_count.SearchCount(data: Union[List[dict], pandas.core.frame.DataFrame], search_words_object: systematic_review.search_count.SearchWords, text_manipulation_method_name: str = 'preprocess_string', custom_text_manipulation_function=None, *args, **kwargs)[source]¶
Bases:
object
Used to search search_words in citations and research papers. This can output both records list and pandas.DataFrame as well as can take both inputs.
- citation_text_column_name = 'citation_text'¶
- count_search_words_in_citations_text(citations_records_list: List[Dict[str, Any]]) List[Dict[str, Any]] [source]¶
Loop over each citations to count search words (SearchWords instance) in citation data.
- Parameters
citations_records_list (List[Dict[str, Any]]) – This list contains all the citations details with column named ‘full_text’ containing full text like article name, abstract and keyword.
- Returns
This is the list of all citations search result which contains our all search_words_object count. Examples - [{‘title’: ‘name’, ‘total_keywords’: count, ‘keyword_group_1_count’: count, “management”: count, “investing: count”, “risk: count”, ‘keyword_group_2_count’: count, “corporate”: count, “pricing”: count,…}]
- Return type
List[Dict[str, Any]]
- count_search_words_in_research_paper_text(research_papers_records_list: List[Dict[str, Any]]) List[Dict[str, Any]] [source]¶
Loop over validated research paper to count search words (SearchWords instance) in research papers data.
- Parameters
research_papers_records_list (List[Dict[str, Any]]) – This list contains data of all the research papers files contained in directory_path.
- Returns
This is the list of all citations search result which contains our all search_words_object count. Examples - [{‘article’: ‘article_name’, ‘total_keywords’: count, ‘keyword_group_1_count’: count, “management”: count, “investing: count”, “risk: count”, ‘keyword_group_2_count’: count, “corporate”: count, “pricing”: count,…}]
- Return type
List[Dict[str, Any]]
- counts() List[Dict[str, Any]] [source]¶
This takes records list and return search counts based on type of citation data or research papers data.
- Returns
records list containing the citation data or research papers data.
- Return type
List[Dict[str, Any]]
- download_flag_column_name = 'downloaded'¶
- get_dataframe()[source]¶
Outputs the pandas.DataFrame containing counts results of input data.
- Returns
This is the dataframe of all citations search result which contains our all search_words_object count. Examples - [{‘article’: ‘article_name’, ‘total_keywords’: count, ‘keyword_group_1_count’: count, “management”: count, “investing: count”, “risk: count”, ‘keyword_group_2_count’: count, “corporate”: count, “pricing”: count,…}]
- Return type
pandas.DataFrame
- get_records_list() List[Dict[str, Any]] [source]¶
Outputs the records list containing counts results of input data.
- Returns
This is the list of records which contains all search_words_object count from input data. Examples - [{‘article’: ‘article_name’, ‘total_keywords’: count, ‘keyword_group_1_count’: count, “management”: count, “investing: count”, “risk: count”, ‘keyword_group_2_count’: count, “corporate”: count, “pricing”: count,…}]
- Return type
List[Dict[str, Any]]
- research_paper_file_location_column_name = 'file location'¶
- to_csv(output_filename: Optional[str] = 'output.csv', index: bool = True)[source]¶
This function saves pandas.DataFrame to csv file.
- Parameters
output_filename (str) – This is the name of output file which should contains .csv extension
index (bool) – Define if index is needed in output csv file or not.
- to_excel(output_filename: Optional[str] = 'output.csv', index: bool = True)[source]¶
This function saves pandas.DataFrame to excel file.
- Parameters
output_filename (str) – This is the name of output file which should contains .xlsx extension
index (bool) – Define if index is needed in output excel file or not.
- class systematic_review.search_count.SearchWords(search_words, text_manipulation_method_name: str = 'preprocess_string', custom_text_manipulation_function=None, default_search_words_group_name: str = 'search_words_group_', all_unique_keywords: bool = False, unique_keywords: bool = True, *args, **kwargs)[source]¶
Bases:
object
This class contains all functionalities related to search words.
- construct_search_words_from_list() dict [source]¶
This takes keywords_list which contains search_words_object as [‘keyword1 keyword2 keyword3’, ‘keyword1 keyword2’] and function construct dict as {‘keyword_group_1’: ‘keyword1 keyword2 keyword3’, ‘keyword_group_2’: ‘keyword1 keyword2’}
- Returns
the dictionary contains the group name and search_words_object paired as value Examples - {‘keyword_group_1’: ‘keyword1 keyword2 keyword3’, ‘keyword_group_2’: ‘keyword1 keyword2’}
- Return type
dict
- creating_default_keyword_count_dict()[source]¶
Initialise keyword count dict with value 0 for every keyword.
- Returns
This contains key as keyword and value as 0.
- Return type
dict
- get_sample_search_words_json(output_file_path: str = 'sample_search_words_template.json') None [source]¶
Outputs the json sample search_words_object file template as example which can be edited by user to upload search_words_object.
- Parameters
output_file_path (str) – this is optional output file path for json template
- Returns
function create the file on the root folder unless specified in output_file_path
- Return type
None
- get_sorting_keywords_criterion_list() List[str] [source]¶
This sorting criteria list is based on the search_words_object got from the main input search_words_object. It contains total_keywords, group_keywords_counts, keywords_counts.
- Returns
This is the sorting criterion list which contains column in logical manner we desire. It contains total_keywords, group_keywords_counts, and search_words_object in the last.
- Return type
List[str]
- preprocess_search_keywords_dictionary(grouped_keywords_dictionary: dict) dict [source]¶
This takes search_words_object from {keyword_group_name: search_words_object,…} dict and remove symbols with spaces. it then convert them to lowercase and remove any duplicate keyword inside of search_words_object. outputs the {keyword_group_name: [clean_keywords],…}
- Parameters
grouped_keywords_dictionary (dict) – This is the input dictionary of search_words_object used for systematic review. Example - {‘keyword_group_name’: “Management investing corporate pricing risk Risk Pre-process”,…}
- Returns
This is output dictionary which contains processed non-duplicate search_words_object dict. Example - {‘keyword_group_name’: [“management”, “investing”, “corporate”, “pricing”, “risk”, “pre”, “process”],…}
- Return type
dict
- preprocess_searched_keywords(grouped_keywords_dictionary: dict) dict [source]¶
Remove duplicate instances of search_words_object in other search_words_object groups.
- Parameters
grouped_keywords_dictionary (dict) – This is the input dictionary of search_words_object used for systematic review. Example - {‘keyword_group_name’: “Management investing corporate pricing risk Risk Pre-process”,…}
- Returns
This is the dictionary comprised of unique search_words_object in each keyword groups. It means keyword from first keyword group can not be found in any other keyword group. Example - {‘keyword_group_1’: [“management”, “investing”, “risk”, “pre”, “process”], ‘keyword_group_2’: [“corporate”, “pricing”],…} ‘risk’ is removed from keyword_group_2.
- Return type
dict
- sample_dict = {'keywords_common_words': 'accuracy classification cross sectional cross-section expected metrics prediction predict expert system', 'keywords_finance': 'Management investing corporate pricing risk', 'keywords_machine_learning': 'neural fuzzy inference system artificial intelligence artificial computational neural networks'}¶
- systematic_review.search_count.adding_citation_details_with_keywords_count_in_pdf_full_text(filter_sorted_citations_df: pandas.core.frame.DataFrame, pdf_full_text_search_count: list, unique_preprocessed_clean_grouped_keywords_dict: dict, first_column_name: str = 'cleaned_title', second_column_name: str = 'cleaned_title_pdf') pandas.core.frame.DataFrame [source]¶
Combining the pdf search_words_object counts with the citation details from filtered and sorted citation full text dataframe.
- Parameters
second_column_name (str) – This is the name of column which contain pdf article title.
first_column_name (str) – This is the name of column which contain citation title.
filter_sorted_citations_df (pandas.DataFrame object) –
- This is the sorted dataframe which contains columns in this sequential manner. It contains citation df,
total_keywords, group_keywords_counts, and keywords_counts in the last.
pdf_full_text_search_count (list) – This is the list of all citations search result which contains our all search_words_object count. Examples - [{‘article’: ‘article_name’, ‘total_keywords’: count, ‘keyword_group_1_count’: count, “management”: count, “investing: count”, “risk: count”, ‘keyword_group_2_count’: count, “corporate”: count, “pricing”: count,…}]
unique_preprocessed_clean_grouped_keywords_dict (dict) – This is the dictionary comprised of unique search_words_object in each keyword groups. It means keyword from first keyword group can not be found in any other keyword group. Example - {‘keyword_group_1’: [“management”, “investing”, “risk”, “pre”, “process”], ‘keyword_group_2’: [“corporate”, “pricing”],…}. ‘risk’ is removed from keyword_group_2.
- Returns
This dataframe contains citations details from filtered and sorted citation full text dataframe and search_words_object counts from searching in pdf file text.
- Return type
pandas.DataFrame object
- systematic_review.search_count.adding_dict_key_or_increasing_value(input_dict: dict, dict_key: str, step: int = 1, default_dict_value: int = 1)[source]¶
Increase the value of dict(key:value) by step using key. If key not present then it get initialised with default dict value
- Parameters
input_dict (dict) – This is the dictionary which we want to modify.
dict_key (str) – This is the key of dictionary
step (int) – This is the addition number by which value of dictionary needed to be increased.
default_dict_value (int) – If key is not available in dictionary then this default value is used to add new key.
- Returns
This is the modified dictionary
- Return type
dict
- systematic_review.search_count.citation_list_of_dict_search_count_to_df(citations_list: list, keywords: dict, title_column_name: str = 'title', method: str = 'preprocess_string', custom=None) pandas.core.frame.DataFrame [source]¶
Loop over articles to calculate search_words_object counts and return dataframe.
- Parameters
title_column_name (str) – This is the name of column which contain citation title
custom (function) – This is optional custom_text_manipulation_function function if you want to implement this yourself. pass as custom_text_manipulation_function = function_name. it will take text as parameter with no default preprocess_string operation.
method (str) – provides the options to use any text manipulation function. preprocess_string (default and applied before all other implemented functions) custom_text_manipulation_function - for putting your custom_text_manipulation_function function to preprocess the text nltk_remove_stopwords, pattern_lemma_or_lemmatize_text, nltk_word_net_lemmatizer, nltk_porter_stemmer, nltk_lancaster_stemmer, spacy_lemma, nltk_remove_stopwords_spacy_lemma, convert_string_to_lowercase
citations_list (list) – list with additional columns needed for next steps of systematic review and duplicates are removed
keywords (dict) – This is output dictionary which contains processed non-duplicate search_words_object dict. Example - {‘keyword_group_name’: [“management”, “investing”, “corporate”, “pricing”, “risk”, “pre”, “process”],…}
- Returns
This is pandas object of all citations search result which contains our all search_words_object count. Examples - [{‘primary_title’: ‘name’, ‘total_keywords’: count, ‘keyword_group_1_count’: count, “management”: count, “investing: count”, “risk: count”, ‘keyword_group_2_count’: count, “corporate”: count, “pricing”: count,…}]
- Return type
pandas.DataFrame object
- systematic_review.search_count.citation_search_count_dataframe(citations_df: pandas.core.frame.DataFrame, keywords: dict, title_column_name: str = 'title', method: str = 'preprocess_string', custom=None) pandas.core.frame.DataFrame [source]¶
Loop over articles to calculate keywords counts and return dataframe.
- Parameters
title_column_name (str) – This is the name of column which contain citation title
custom (function) – This is optional custom function if you want to implement this yourself. pass as custom = function_name. it will take text as parameter with no default preprocess_string operation.
method (str) – provides the options to use any text manipulation function. preprocess_string (default and applied before all other implemented functions) custom - for putting your custom function to preprocess the text nltk_remove_stopwords, pattern_lemma_or_lemmatize_text, nltk_word_net_lemmatizer, nltk_porter_stemmer, nltk_lancaster_stemmer, spacy_lemma, nltk_remove_stopwords_spacy_lemma, convert_string_to_lowercase
citations_df (pandas.DataFrame object) – DataFrame with additional columns needed for next steps of systematic review and duplicates are removed
keywords (dict) – This is output dictionary which contains processed non-duplicate keywords dict. Example - {‘keyword_group_name’: [“management”, “investing”, “corporate”, “pricing”, “risk”, “pre”, “process”],…}
- Returns
This is pandas object of all citations search result which contains our all keywords count. Examples - [{‘primary_title’: ‘name’, ‘total_keywords’: count, ‘keyword_group_1_count’: count, “management”: count, “investing: count”, “risk: count”, ‘keyword_group_2_count’: count, “corporate”: count, “pricing”: count,…}]
- Return type
pandas.DataFrame object
- systematic_review.search_count.count_keywords_in_citations_full_text(dataframe_citations_with_fulltext: pandas.core.frame.DataFrame, unique_preprocessed_clean_grouped_keywords_dict: dict, title_column_name: str = 'title', method: str = 'preprocess_string', custom=None) list [source]¶
Loop over articles to calculate keywords counts
- Parameters
custom (function) – This is optional custom function if you want to implement this yourself. pass as custom = function_name. it will take text as parameter with no default preprocess_string operation.
method (str) – provides the options to use any text manipulation function. preprocess_string (default and applied before all other implemented functions) custom - for putting your custom function to preprocess the text nltk_remove_stopwords, pattern_lemma_or_lemmatize_text, nltk_word_net_lemmatizer, nltk_porter_stemmer, nltk_lancaster_stemmer, spacy_lemma, nltk_remove_stopwords_spacy_lemma, convert_string_to_lowercase
dataframe_citations_with_fulltext (pd.DataFrame) – This dataframe contains all the citations details with column named ‘full_text’ containing full text like article name, abstract and keyword.
unique_preprocessed_clean_grouped_keywords_dict (dict) –
- looks like this {‘keyword_group_1’: [“management”, “investing”, “risk”, “pre”, “process”],
’keyword_group_2’: [“corporate”, “pricing”],…}
title_column_name (str) – This is the name of column which contain citation title
- Returns
This is the list of all citations search result which contains our all keywords count. Examples - [{‘primary_title’: ‘name’, ‘total_keywords’: count, ‘keyword_group_1_count’: count, “management”: count, “investing: count”, “risk: count”, ‘keyword_group_2_count’: count, “corporate”: count, “pricing”: count,…}]
- Return type
list
- systematic_review.search_count.count_keywords_in_citations_full_text_list(citations_with_fulltext_list: list, unique_preprocessed_clean_grouped_keywords_dict: dict, title_column_name: str = 'title', method: str = 'preprocess_string', custom=None) list [source]¶
Loop over articles to calculate search_words_object counts
- Parameters
custom (function) – This is optional custom_text_manipulation_function function if you want to implement this yourself. pass as custom_text_manipulation_function = function_name. it will take text as parameter with no default preprocess_string operation.
method (str) – provides the options to use any text manipulation function. preprocess_string (default and applied before all other implemented functions) custom_text_manipulation_function - for putting your custom_text_manipulation_function function to preprocess the text nltk_remove_stopwords, pattern_lemma_or_lemmatize_text, nltk_word_net_lemmatizer, nltk_porter_stemmer, nltk_lancaster_stemmer, spacy_lemma, nltk_remove_stopwords_spacy_lemma, convert_string_to_lowercase
citations_with_fulltext_list (list) – This list contains all the citations details with column named ‘full_text’ containing full text like article name, abstract and keyword.
unique_preprocessed_clean_grouped_keywords_dict (dict) –
- looks like this {‘keyword_group_1’: [“management”, “investing”, “risk”, “pre”, “process”],
’keyword_group_2’: [“corporate”, “pricing”],…}
title_column_name (str) – This is the name of column which contain citation title
- Returns
This is the list of all citations search result which contains our all search_words_object count. Examples - [{‘primary_title’: ‘name’, ‘total_keywords’: count, ‘keyword_group_1_count’: count, “management”: count, “investing: count”, “risk: count”, ‘keyword_group_2_count’: count, “corporate”: count, “pricing”: count,…}]
- Return type
list
- systematic_review.search_count.count_keywords_in_pdf_full_text(list_of_downloaded_articles_path: list, unique_preprocessed_clean_grouped_keywords_dict: dict, title_column_name: str = 'cleaned_title_pdf', method: str = 'preprocess_string', custom=None) list [source]¶
Loop over articles pdf files to calculate keywords counts.
- Parameters
custom (function) – This is optional custom function if you want to implement this yourself. pass as custom = function_name. it will take text as parameter with no default preprocess_string operation.
method (str) – provides the options to use any text manipulation function. preprocess_string (default and applied before all other implemented functions) custom - for putting your custom function to preprocess the text nltk_remove_stopwords, pattern_lemma_or_lemmatize_text, nltk_word_net_lemmatizer, nltk_porter_stemmer, nltk_lancaster_stemmer, spacy_lemma, nltk_remove_stopwords_spacy_lemma, convert_string_to_lowercase
title_column_name (str) – This is the name of column which contain citation title
list_of_downloaded_articles_path (list) – This list contains path of all the pdf files contained in directory_path.
unique_preprocessed_clean_grouped_keywords_dict (dict) –
- looks like this {‘keyword_group_1’: [“management”, “investing”, “risk”, “pre”, “process”],
’keyword_group_2’: [“corporate”, “pricing”],…}
- Returns
This is the list of all citations search result which contains our all keywords count. Examples - [{‘article’: ‘article_name’, ‘total_keywords’: count, ‘keyword_group_1_count’: count, “management”: count, “investing: count”, “risk: count”, ‘keyword_group_2_count’: count, “corporate”: count, “pricing”: count,…}]
- Return type
list
- systematic_review.search_count.count_search_words_in_citations_text(citations_with_fulltext_list: list, search_words_object: systematic_review.search_count.SearchWords, text_column_name: str = "'citation_text'", text_manipulation_method_name: str = 'preprocess_string', custom=None, custom_text_manipulation_function=None, *args, **kwargs) list [source]¶
Loop over articles to calculate search_words_object counts
- Parameters
custom_text_manipulation_function (function) – This is optional custom_text_manipulation_function function if you want to implement this yourself. pass as custom_text_manipulation_function = function_name. it will take text as parameter with no default preprocess_string operation.
text_manipulation_method_name (str) – provides the options to use any text manipulation function. preprocess_string (default and applied before all other implemented functions) custom_text_manipulation_function - for putting your custom_text_manipulation_function function to preprocess the text nltk_remove_stopwords, pattern_lemma_or_lemmatize_text, nltk_word_net_lemmatizer, nltk_porter_stemmer, nltk_lancaster_stemmer, spacy_lemma, nltk_remove_stopwords_spacy_lemma, convert_string_to_lowercase
citations_with_fulltext_list (list) – This list contains all the citations details with column named ‘full_text’ containing full text like article name, abstract and keyword.
search_words_object (object) –
- looks like this {‘keyword_group_1’: [“management”, “investing”, “risk”, “pre”, “process”],
’keyword_group_2’: [“corporate”, “pricing”],…}
text_column_name (str) – This is the name of column which contain citation text
- Returns
This is the list of all citations search result which contains our all search_words_object count. Examples - [{‘primary_title’: ‘name’, ‘total_keywords’: count, ‘keyword_group_1_count’: count, “management”: count, “investing: count”, “risk: count”, ‘keyword_group_2_count’: count, “corporate”: count, “pricing”: count,…}]
- Return type
list
- systematic_review.search_count.count_words_in_list_of_lists(list_of_lists: List[list]) dict [source]¶
count words in list containing other lists with words.
- Parameters
list_of_lists (List[list]) – This list contains each element of type list.
- Returns
dictionary with key as words and value as counts
- Return type
dict
- systematic_review.search_count.creating_keyword_count_dict(unique_preprocessed_clean_grouped_keywords_dict: dict)[source]¶
Initialise keyword count dict with value 0 for every keyword.
- Parameters
unique_preprocessed_clean_grouped_keywords_dict (dict) –
- looks like this {‘keyword_group_1’: [“management”, “investing”, “risk”, “pre”, “process”],
’keyword_group_2’: [“corporate”, “pricing”],…}
- Returns
This contains key as keyword and value as 0.
- Return type
dict
- systematic_review.search_count.get_sorting_keywords_criterion_list(unique_preprocessed_clean_grouped_keywords_dict: dict) list [source]¶
This sorting criteria list is based on the keywords got from the main input keywords. It contains total_keywords, group_keywords_counts, keywords_counts.
- Parameters
unique_preprocessed_clean_grouped_keywords_dict (dict) – his is the dictionary comprised of unique keywords in each keyword groups. It means keyword from first keyword group can not be found in any other keyword group. Example - {‘keyword_group_1’: [“management”, “investing”, “risk”, “pre”, “process”], ‘keyword_group_2’: [“corporate”, “pricing”],…}. ‘risk’ is removed from keyword_group_2.
- Returns
This is the sorting criterion list which contains column in logical manner we desire. It contains total_keywords, group_keywords_counts, and keywords in the last.
- Return type
list
- systematic_review.search_count.pdf_full_text_search_count_dataframe(list_of_downloaded_articles_path: list, unique_preprocessed_clean_grouped_keywords_dict: dict, title_column_name: str = 'cleaned_title', method: str = 'preprocess_string', custom=None) pandas.core.frame.DataFrame [source]¶
Loop over articles pdf files to calculate keywords counts.
- Parameters
custom (function) – This is optional custom function if you want to implement this yourself. pass as custom = function_name. it will take text as parameter with no default preprocess_string operation.
method (str) – provides the options to use any text manipulation function. preprocess_string (default and applied before all other implemented functions) custom - for putting your custom function to preprocess the text nltk_remove_stopwords, pattern_lemma_or_lemmatize_text, nltk_word_net_lemmatizer, nltk_porter_stemmer, nltk_lancaster_stemmer, spacy_lemma, nltk_remove_stopwords_spacy_lemma, convert_string_to_lowercase
title_column_name (str) – This is the name of column which contain citation title
list_of_downloaded_articles_path (list) – This list contains path of all the pdf files contained in directory_path.
unique_preprocessed_clean_grouped_keywords_dict (dict) –
- looks like this {‘keyword_group_1’: [“management”, “investing”, “risk”, “pre”, “process”],
’keyword_group_2’: [“corporate”, “pricing”],…}
- Returns
This is the dataframe of all citations search result which contains our all keywords count. Examples - [{‘article’: ‘article_name’, ‘total_keywords’: count, ‘keyword_group_1_count’: count, “management”: count, “investing: count”, “risk: count”, ‘keyword_group_2_count’: count, “corporate”: count, “pricing”: count,…}]
- Return type
pandas.DataFrame object
- systematic_review.search_count.remove_duplicates_keywords_from_next_groups(preprocessed_clean_grouped_keywords_dict: dict) dict [source]¶
Execute search_words_object step. This takes search_words_object from {keyword_group_name: search_words_object,…} dict and remove symbols with spaces. it then convert them to lowercase and remove any duplicate keyword inside of search_words_object. outputs the {keyword_group_name: [clean_keywords],…} and then Remove duplicate instances of search_words_object in other search_words_object groups.
- Parameters
preprocessed_clean_grouped_keywords_dict (dict) – This is output dictionary which contains processed non-duplicate search_words_object dict. Example - {‘keyword_group_1’: [“management”, “investing”, “risk”, “pre”, “process”], ‘keyword_group_2’: [“corporate”, “pricing”, “risk”],…}
- Returns
This is the dictionary comprised of unique search_words_object in each keyword groups. It means keyword from first keyword group can not be found in any other keyword group. Example - {‘keyword_group_1’: [“management”, “investing”, “risk”, “pre”, “process”], ‘keyword_group_2’: [“corporate”, “pricing”],…} ‘risk’ is removed from keyword_group_2.
- Return type
dict
systematic_review.string_manipulation module¶
Module: string_manipulation This module contains functions related to string case change, preprocess, and removing some part of it.
- systematic_review.string_manipulation.convert_string_to_lowercase(string: str) str [source]¶
Lowercase the given input string.
- Parameters
string (str) – The string which might have uppercase characters in it.
- Returns
This is all lowercase character string.
- Return type
str
- systematic_review.string_manipulation.pdf_filename_from_filepath(article_path: str) str [source]¶
This takes the pdf path as input and clean the name of pdf by applying preprocess function from string_manipulation module.
- Parameters
article_path (str) – This is the path of the pdf file.
- Returns
This is the cleaned filename of the pdf.
- Return type
str
- systematic_review.string_manipulation.preprocess_string(string: str) str [source]¶
replace symbols in string with spaces and Lowercase the given input string. Example - ‘Df%$df’ -> ‘df df’
- Parameters
string (str) – This is input word string which contains unwanted symbols and might have uppercase characters in it.
- Returns
This is cleaned string from symbols and contains only alpha characters.
- Return type
str
- systematic_review.string_manipulation.preprocess_string_to_space_separated_words(string: str) str [source]¶
replace symbols in string with spaces and Lowercase the given input string. Example - ‘Df%$df’ -> ‘df df’ and convert ‘df df’ to single spaced ‘df df’.
- Parameters
string (str) – This can contain string words mixed with spaces and symbols.
- Returns
remove the spaces and symbols and arrange the words single spaces.
- Return type
str
- systematic_review.string_manipulation.remove_non_ascii(string_list: list) list [source]¶
Remove non-ASCII characters from list of tokenized words
- Parameters
string_list (list) – this list contains the words which contains the non-ASCII characters
- Returns
this is modified list after removing the non-ASCII characters
- Return type
list
- systematic_review.string_manipulation.replace_symbols_with_space(string: str) str [source]¶
replace symbols in string with spaces. Example - ‘df%$df’ -> ‘df df’
- Parameters
string (str) – This is input word string which contains unwanted symbols.
- Returns
This is cleaned string from symbols and contains only alpha characters and all lowercase character string.
- Return type
str
- systematic_review.string_manipulation.split_preprocess_string(text: str) list [source]¶
This splits the words into list after applying preprocess function from string_manipulation module.
- Parameters
text (str) – This is input word string which contains unwanted symbols and might have uppercase characters in it.
- Returns
This is cleaned list of strings from symbols and contains only alpha characters.
- Return type
list
- systematic_review.string_manipulation.split_words_remove_duplicates(string_list: list) list [source]¶
this function takes a list of words or sentences and split them to individual words. It also removes any repeating word in the list.
- Parameters
string_list (list) – this is the input list which contains words and group of words inside. Example - [‘one’, ‘one two’]
- Returns
this is the output list which contains only unique individual words using set(). Example - [‘one’, ‘two’]
- Return type
list
- systematic_review.string_manipulation.string_dict_to_lower(string_map: dict) dict [source]¶
this convert the values into lowercase. similar function for list is available as string_list_to_lower()
- Parameters
string_map (dict) – these are key:values pairs needed to be converted.
- Returns
output by converting input to key: lowercase values.
- Return type
dict
- systematic_review.string_manipulation.string_list_to_lower(string_list: list) list [source]¶
this convert the values into lowercase. similar function for dict is available as string_dict_to_lower()
- Parameters
string_list (list) – this list contains input string need to be converted to lowercase.
- Returns
this is the output list which contains original input strings but in lowercase
- Return type
list
- systematic_review.string_manipulation.string_to_space_separated_words(text: str) str [source]¶
takes text string and outputs space separated words.
- Parameters
text (str) – This text contains multiple spaces or trailing whitespaces
- Returns
This is space separated word string with no trailing whitespaces.
- Return type
str
- systematic_review.string_manipulation.strip_string_from_right_side(string: str, value_to_be_stripped: str = '.pdf') str [source]¶
Function removes the substring from the right of string.
- Parameters
string (str) – This is the complete word or string. Example - ‘monster.pdf’
value_to_be_stripped (str) – This is the value which is needed to be removed from right side. Example - ‘.pdf’
- Returns
This is the trimmed string that contains the left part after some part removed from the right. Example - ‘monster’
- Return type
str
- systematic_review.string_manipulation.text_manipulation_methods(text: str, text_manipulation_method_name: str = 'preprocess_string', custom_text_manipulation_function: Optional[Callable[[str, Any, Any], str]] = None, *args, **kwargs) str [source]¶
This convert text or string using options like preprocess, nlp module function, for more info each respective methods methods implemented. args and kwargs will go into custom_text_manipulation_function
- Parameters
kwargs (Dict[str, Any]) – These key = word or {key: word} arguments are for custom_text_manipulation_function
args (Tuple) – These arguments are for custom_text_manipulation_function
custom_text_manipulation_function (Callable[[str, Any, Any], str]) – This is optional custom_text_manipulation_function function if you want to implement this yourself. pass as custom_text_manipulation_function = function_name. it will take text as parameter with no default preprocess_string operation.
text (str) – string type text which is needed to be converted.
text_manipulation_method_name (str) – provides the options to use any text manipulation function. preprocess_string (default and applied before all other implemented functions) custom_text_manipulation_function - for putting your custom_text_manipulation_function function to preprocess the text nltk_remove_stopwords, pattern_lemma_or_lemmatize_text, nltk_word_net_lemmatizer, nltk_porter_stemmer, nltk_lancaster_stemmer, spacy_lemma, nltk_remove_stopwords_spacy_lemma, convert_string_to_lowercase, preprocess_string_to_space_separated_words
- Returns
this return the converted text
- Return type
str
systematic_review.validation module¶
Module: validation This module contains functions related validating our downloaded articles if they’re same as ones we require. It also contains functions to get articles source name and create list of missed or duplicate articles.
- class systematic_review.validation.ValidateWordsInText(words_string: str, text_string: str, words_percentage_checker_in_text_validation_limit: float = 70, jumbled_words_percentage_checker_in_text_validation_limit: float = 70, jumbled_words_percentage_checker_in_text_wrong_word_limit: int = 2)[source]¶
Bases:
object
This checks words in given Text.
- exact_words_checker_in_text() bool [source]¶
This checks for exact match of substring in string and return True or False based on success.
- Returns
This returns True if exact words_string found in text_string else False.
- Return type
bool
- jumbled_words_percentage_checker_in_text() tuple [source]¶
start calculating percentage if half of words are found in sequence. This also takes in consideration of words which got jumbled up due to pdf reading operation.
- Returns
This returns True if exact words_string found in text_string else False. This also returns matched substring percentage.
- Return type
tuple
- multiple_methods() tuple [source]¶
This text_manipulation_method_name uses different methods to validate the article_name(substring) in text. Example - exact_words, words_percentage, jumbled_words_percentage.
- Returns
True and False value depicting validated article with True value. This also shows percentage matched Last it shows the text_manipulation_method_name used. like exact_words, words_percentage, jumbled_words_percentage, all if every text_manipulation_method_name is executed to validate.
- Return type
tuple
- words_percentage_checker_in_text() tuple [source]¶
This checks for exact match of substring in string and return True or False based on success. It also returns matched word percentage. words_percentage_checker_in_text_validation_limit: this doesn’t work properly if words_string have duplicate words.
- Returns
This returns True if exact words_string found in text_string else False. This also returns matched substring percentage.
- Return type
tuple
- class systematic_review.validation.Validation(citations_data: Union[List[dict], pandas.core.frame.DataFrame], parents_directory_of_research_papers_files: str, text_file_path_of_inaccessible_research_papers: Optional[str] = None, text_manipulation_method_name: str = 'preprocess_string_to_space_separated_words', words_percentage_checker_in_text_validation_limit: float = 70, jumbled_words_percentage_checker_in_text_validation_limit: float = 70, jumbled_words_percentage_checker_in_text_wrong_word_limit: int = 2)[source]¶
Bases:
object
This is used to validate the downloaded files.
- add_downloaded_flag_column_and_file_location_column()[source]¶
add empty columns based on research_paper_file_location_column_name and download_flag_column_name
- Returns
data contains new columns.
- Return type
List[dict]
- check()[source]¶
Executes the validation of research articles in citation data by checking the research paper files and validating if the research articles are correct.
- Returns
data contains columns with downloaded, validation method and file path columns
- Return type
List[dict]
- cleaned_article_column_name = 'cleaned_title'¶
- download_flag_column_name = 'downloaded'¶
- file_invalidated_flag_name = 'wrong'¶
- file_manual_check_flag_name = 'unreadable'¶
- file_name_and_path_dict()[source]¶
contains mapping of filename to file paths
- Returns
key is filename and value is file paths.
- Return type
dict
- file_not_accessible_flag_name = 'no access'¶
- file_not_downloaded_flag_name = 'no'¶
- file_validated_flag_name = 'yes'¶
- get_dataframe()[source]¶
Outputs the pandas.DataFrame containing validation results of input data.
- Returns
This is the dataframe which contains validation Flags column downloaded with values- “yes”, “no”, “wrong”, “no access”, “unreadable” and file location column if downloaded column contains “yes”.
- Return type
pandas.DataFrame
- get_records_list() List[Dict[str, Any]] [source]¶
Outputs the records list containing validation results of input data.
- Returns
This is the list of records which contains validation Flags column downloaded with values- “yes”, “no”, “wrong”, “no access”, “unreadable” and file location column if downloaded column contains “yes”.
- Return type
List[Dict[str, Any]]
- info()[source]¶
Equivalent to pandas.DataFrame.value_counts(), It return list with count of unique element in column
- Returns
unique download_flag_column_name elements with counts
- Return type
object
- research_paper_file_location_column_name = 'file location'¶
- to_csv(output_filename: Optional[str] = 'output.csv', index: bool = True)[source]¶
This function saves pandas.DataFrame to csv file.
- Parameters
output_filename (str) – This is the name of output file which should contains .csv extension
index (bool) – Define if index is needed in output csv file or not.
- to_excel(output_filename: Optional[str] = 'output.csv', index: bool = True)[source]¶
This function saves pandas.DataFrame to excel file.
- Parameters
output_filename (str) – This is the name of output file which should contains .xlsx extension
index (bool) – Define if index is needed in output excel file or not.
- validation_manual_method_name = 'manual'¶
- validation_method_column_name = 'validation method'¶
- systematic_review.validation.add_dict_element_with_count(dictionary: dict, key: str) dict [source]¶
It increase the value by checking the key of dictionary or initialise new key with value 1. Works as collections module default dict with value 1.
- Parameters
dictionary (dict) – This is the dictionary where we want to add element.
key (str) – This is the key of dictionary {key: value}
- Returns
This is the edited dict with new elements counts.
- Return type
dict
- systematic_review.validation.amount_by_percentage(number: float, percentage: float) float [source]¶
get the amount of number based on percentage. example- 5% (percentage) of 10 (number) is 0.5 (result).
- Parameters
number (float) – This is the input number from which we want some percent amount
percentage (float) – This is equivalent to math percentage.
- Returns
This is resultant number.
- Return type
float
- systematic_review.validation.calculate_percentage(value: float, total: float) float [source]¶
calculate percentage of value in total.
- Parameters
value (float) – It is input number, normally smaller than total.
total (float) – It is the larger number from which we want to know percentage
- Returns
This is calculated percentage. Example 98.36065573770492 that means 98.35%
- Return type
float
- systematic_review.validation.compare_two_dict_members_via_percent_similarity(first_dict: dict, second_dict: dict) float [source]¶
Compare elements in 2 dictionaries and return percentage similarity.
- Parameters
first_dict (dict) – Example - first_dict = {‘mixed’:1, ‘modified’:1, ‘fruit’:1, ‘fly’:1, ‘optimization’:1}
second_dict (dict) – Example - second_dict = {‘mixed’:1, ‘modified’:1, ‘fruit’:1, ‘fly’:1, ‘optimization’:1, ‘algorithm’: 1}
- Returns
This is percentage represented as decimal number. Example 98.36065573770492 that means 98.35%
- Return type
float
- systematic_review.validation.compare_two_list_members_via_percent_similarity(words_list: list, boolean_membership_list: list) float [source]¶
Compare elements in 2 lists and return percentage similarity.
- Parameters
words_list (list) – This contains elements whose elements to be checked for similarity.
boolean_membership_list (list) – This list contains True and False values.
- Returns
This is percentage represented as decimal number. Example 98.36065573770492 that means 98.35%
- Return type
float
- systematic_review.validation.deep_validate_column_details_between_two_record_list(first_list_of_dict: list, second_list_of_dict: list, first_column_name: str = 'cleaned_title', second_column_name: str = 'cleaned_title_pdf') tuple [source]¶
It produce list of matched columns rows and unmatched column rows based on same column from both.
- Parameters
second_column_name (str) – This is the name of column which contain pdf article title.
first_list_of_dict (list) – Iterable object pandas.DataFrame or list which contains first_column_name
second_list_of_dict (list) – Iterable object pandas.DataFrame or list which contains first_column_name
first_column_name (str) – This is the name of column which contain citation title.
- Returns
matched_list - It contains column’s row which are matched in both data object. unmatched_list - It contains column’s row which are unmatched in both data object.
- Return type
tuple
- systematic_review.validation.dict_from_list_with_element_count(input_list)[source]¶
Put input list elements into dictionary with count.
- Parameters
input_list (list) – This is the list with elements with some duplicates present.
- Returns
This is dictionary key as list elements and value as list each element count
- Return type
dict
- systematic_review.validation.exact_words_checker_in_text(words_string: str, text_string: str) bool [source]¶
This checks for exact match of substring in string and return True or False based on success.
- Parameters
words_string (str) – This is the word we are searching for.
text_string (str) – This is query string or lengthy text.
- Returns
This returns True if exact words_string found in text_string else False.
- Return type
bool
- systematic_review.validation.finding_missed_articles_from_downloading(validated_pdf_list: list, original_articles_list: list) tuple [source]¶
Checks how many articles are not downloaded yet from original list of articles.
- Parameters
validated_pdf_list (list) – Contains name of pdf files whose filename is in the pdf text.
original_articles_list (list) – This is original list from where we started downloading the articles.
- Returns
Missing_articles - these are the articles which are missed from downloading. Validated_articles - This is list of validated downloaded articles list.
- Return type
tuple
- systematic_review.validation.get_dataframe_column_as_list(dataframe: pandas.core.frame.DataFrame, column_name: str = 'primary_title')[source]¶
Get pandas dataframe column values as list.
- Parameters
dataframe (pd.DataFrame) – This is the dataframe which contains column whose details we want as list.
column_name (str) – This is the name of the column.
- Returns
This is the list containing the dataframe one column values.
- Return type
list
- systematic_review.validation.get_missed_articles_dataframe(filter_sorted_citations_df: pandas.core.frame.DataFrame, downloaded_articles_path: str, title_column_name: str = 'cleaned_title') list [source]¶
return list of missed articles from downloading by checking original list of articles from filter_sorted_citations_df using downloaded articles path.
- Parameters
title_column_name (str) – contains name of column which contain the name of article.
filter_sorted_citations_df (pd.DataFrame) – This dataframe contains records of selected articles including name of articles.
downloaded_articles_path (str) – contains parent folder of all the downloaded articles files.
- Returns
list of the missed articles from downloading.
- Return type
list
- systematic_review.validation.get_missed_original_articles_list(original_article_list: list, downloaded_article_list: list) list [source]¶
This check elemets of the original_article_list in downloaded_article_list and return missed articles list.
- Parameters
original_article_list (list) – This list elements are checked if they are present in other list.
downloaded_article_list (list) – This list is checked if it consists elements of other list
- Returns
This contains missing elements of original_article_list in downloaded_article_list.
- Return type
list
- systematic_review.validation.getting_article_paths_from_validation_detail(list_of_validation: list) list [source]¶
Getting the first element from list of lists.
- Parameters
list_of_validation (list) – This list contain list of three values where the first is article path.
- Returns
This output list contains the articles paths
- Return type
list
- systematic_review.validation.jumbled_words_percentage_checker_in_text(words_string: str, text_string: str, validation_limit: float = 70, wrong_word_limit: int = 2) tuple [source]¶
start calculating percentage if half of words are found in sequence. This also takes in consideration of words which got jumbled up due to pdf reading operation.
- Parameters
words_string (str) – This is the word we are searching for.
text_string (str) – This is query string or lengthy text.
validation_limit (float) – This is the limit on similarity of checked substring. Example - 0.5 will return true if half of word found same.
wrong_word_limit (int) – This is the limit unto which algorithm ignore the wrong word in sequence.
- Returns
This returns True if exact words_string found in text_string else False. This also returns matched substring percentage.
- Return type
tuple
- systematic_review.validation.manual_validating_of_pdf(articles_path_list: list, manual_index: int) tuple [source]¶
This is mostly a manually used function to validate some pdfs at the end of validation process. It makes it easy to search and validate pdf and store in a list. Advice: convert these lists as text file using function in converter module to avoid data loss.
- Parameters
articles_path_list (list) – These are the list of articles which skipped our automated screening and validation algorithms. mostly due to pdf to text conversions errors.
manual_index (list) – This is the index from where you will start checking in article_path_list. Normally in many tries.
- Returns
external_validation_list - This is the list to be saved externally for validated articles. external_invalidated_list - This is the list to be saved externally for invalidated articles.
- Return type
tuple
- systematic_review.validation.multiple_methods_validating_pdf_via_filename(pdf_file_path: str, pages: str = 'first', pdf_reader: str = 'pdftotext') tuple [source]¶
This function checks name of file and find the name in the text of pdf file. if it become successful then pdf is validated as downloaded else not downloaded. Example - pdf file name -> check in -> text of pdf file. pdf_reader options are pdftotext or pymupdf.
- Parameters
pdf_reader (str) – This is python pdf reader package which convert pdf to text.
pdf_file_path (str) – the path of the pdf file.
pages (str) – This could be ‘all’ to get full text of pdf and ‘first’ for first page of pdf.
- Returns
True and False value depicting validated article with True value. This also shows percentage matched Last it shows the text_manipulation_method_name used. like exact_words, words_percentage, jumbled_words_percentage, all if every text_manipulation_method_name is executed to validate.
- Return type
tuple
- systematic_review.validation.multiple_methods_validating_words_string_in_text(article_name: str, text: str, words_percentage_checker_in_text_validation_limit: float = 70, jumbled_words_percentage_checker_in_text_validation_limit: float = 70, jumbled_words_percentage_checker_in_text_wrong_word_limit: int = 2) tuple [source]¶
This text_manipulation_method_name uses different methods to validate the article_name(substring) in text. Example - exact_words, words_percentage, jumbled_words_percentage.
- Parameters
jumbled_words_percentage_checker_in_text_wrong_word_limit (float) – This is the limit on similarity of checked substring. Example - 0.5 will return true if half of word found same.
jumbled_words_percentage_checker_in_text_validation_limit (int) – This is the limit unto which algorithm ignore the wrong word in sequence.
words_percentage_checker_in_text_validation_limit (float) – This is the limit on similarity of checked substring. Example - 0.5 will return true if half of word found same.
article_name (str) – This is input string which we want to validate in text.
text (str) – This is query string or lengthy text.
- Returns
True and False value depicting validated article with True value. This also shows percentage matched Last it shows the text_manipulation_method_name used. like exact_words, words_percentage, jumbled_words_percentage, all if every text_manipulation_method_name is executed to validate.
- Return type
tuple
- systematic_review.validation.similarity_sequence_matcher(string_a: str, string_b: str) float [source]¶
Shows the percentage similarity between two strings like 0.9836065573770492 that means 98.35%
- Parameters
string_a (str) – This is first string
string_b (str) – This is second string
- Returns
This is the result of SequenceMatcher Example 0.9836065573770492 that means 98.35%
- Return type
float
- systematic_review.validation.validate_column_details_between_two_record_list(first_list_of_dict: list, second_list_of_dict: list, first_column_name: str = 'cleaned_title', second_column_name: str = 'cleaned_title_pdf') tuple [source]¶
It produce list of matched columns rows and unmatched column rows based on same column from first list of dict. Note- emphasis on first list as function check all records of first list of dict in second list of dict. title column of second_list_of_dict is kept by merging with first.
- Parameters
second_column_name (str) – This is the name of column which contain pdf article title.
first_list_of_dict (list) – Iterable object pandas.DataFrame or list which contains first_column_name
second_list_of_dict (list) – Iterable object pandas.DataFrame or list which contains first_column_name
first_column_name (str) – This is the name of column which contain citation title.
- Returns
matched_list - It contains column’s row which are matched in both data object. unmatched_list - It contains column’s row which are unmatched in both data object.
- Return type
tuple
- systematic_review.validation.validating_multiple_pdfs_via_filenames(list_of_pdf_files_path: list, pages: str = 'first', pdf_reader: str = 'pdftotext') tuple [source]¶
This function checks pdf files in list_of_pdf_files_path and validate them with function named ‘validating_pdf_via_filename’. Example - multiple pdf file name -> check in -> text of pdf file. pdf_reader options are pdftotext or pymupdf.
- Parameters
pages (str) – This could be ‘all’ to get full text of pdf and ‘first’ for first page of pdf.
pdf_reader (str) – This is python pdf reader package which convert pdf to text.
list_of_pdf_files_path (list) – the list of the path of the pdf file.
- Returns
validated_pdf_list - contains name of pdf files whose filename is in the pdf text invalidated_pdf_list - list of name of files which can’t be included in validated_pdf_list manual_pdf_list - list of files which can’t be opened using python pdf reader or errors opening them.
- Return type
tuple
- systematic_review.validation.validating_pdf_via_filename(pdf_file_path: str, pages: str = 'first', method: str = 'exact_words') bool [source]¶
This function checks name of file and find the name in the text of pdf file. if it become successful then pdf is validated as downloaded else not downloaded. Example - pdf file name -> check in -> text of pdf file. pdf_reader options are pdftotext or pymupdf.
- Parameters
pdf_file_path (str) – the path of the pdf file.
pages (str) – This could be ‘all’ to get full text of pdf and ‘first’ for first page of pdf.
method (str) – This is the switch option to select text_manipulation_method_name from exact_words, words_percentage, jumbled_words_percentage.
- Returns
True and False value depicting validated article with True value.
- Return type
bool
- systematic_review.validation.validating_pdfs_using_multiple_pdf_reader(pdfs_parent_dir_path: str) tuple [source]¶
This function uses two python readers pdftotext and pymupdf for validating if the filename are present inside of pdf file text.
- Parameters
pdfs_parent_dir_path (str) – This is the parent directory of all the downloaded pdfs.
- Returns
validated_pdf_list - contains name of pdf files whose filename is in the pdf text invalidated_pdf_list - list of name of files which can’t be included in validated_pdf_list manual_pdf_list - list of files which can’t be opened using python pdf reader or errors opening them.
- Return type
tuple
- systematic_review.validation.words_percentage_checker_in_text(words_string: str, text_string: str, validation_limit: float = 70) tuple [source]¶
This checks for exact match of substring in string and return True or False based on success. It also returns matched word percentage. Limit: this doesn’t work properly if words_string have duplicate words.
- Parameters
words_string (str) – This is the word we are searching for.
text_string (str) – This is query string or lengthy text.
validation_limit (float) – This is the limit on similarity of checked substring. Example - 0.5 will return true if half of word found same.
- Returns
This returns True if exact words_string found in text_string else False. This also returns matched substring percentage.
- Return type
tuple