KWICer: Producing an annotated bibliography from a set of PDFs by quantifying keywords
This code can be used to rapidly create an annotated bibliography that will help users navigate and synthesize a large body of literature. Users can input a set of PDF or DOCX files that they have identified as relevant to their question as well as lists of relevant search terms. This code will convert the input documents into TXT files, trim the files to exclude extraneous text such as the references section of a scientific paper, and transform the body of the document into tokens that are easily searchable. It will then count the number of times each supplied search term occurs in each source, as well as identify occurrences of North American states and provinces, and print them to a CSV file. This CSV file can be used to identify documents that are most likely to address specific aspects of a research question by sorting relevant search terms by frequency. The code also generates a set of figures that characterizes the nature and content of the sources, including 1) a map which shows the number of sources that referenced each North American state or province 2) a graph which shows the number of sources that referenced different ecosystem types, and 3) a heatmap which shows the number of sources that mention intersecting search terms. Users can easily customize their search term lists to apply this code to their specific research question.
Citation Information
| Publication Year | 2026 |
|---|---|
| Title | KWICer: Producing an annotated bibliography from a set of PDFs by quantifying keywords |
| DOI | 10.5066/P1476GUY |
| Authors | Lydia N Bailey, Dana M Varner, Sarah E Whipple |
| Product Type | Software Release |
| Record Source | USGS Asset Identifier Service (AIS) |
| USGS Organization | Fort Collins Science Center |
| Rights | This work is licensed under CC BY 4.0 |