Prompt Detail:
To modify the script to compress the text extracted from various file types in a given folder, you can use the following approach:
- Use libraries such as
docx2txt
,PyPDF2
, and other parsers to read the contents of different file types. - Create a function that takes a file path as input, reads the file contents, and returns the extracted text.
- Traverse the folder structure and process each file based on its extension, calling the appropriate reader function.
- For each file, call the
compress_text_gpt4
function and store the compressed content.
Here's a modified version of the script to handle various file types in a given folder:
pythonimport openai
import os
import docx2txt
import PyPDF2
# Add other parsers for different file types as needed
def read_txt_file(file_path):
with open(file_path, 'r', encoding='utf-8') as f:
return f.read()
def read_docx_file(file_path):
return docx2txt.process(file_path)
def read_pdf_file(file_path):
with open(file_path, 'rb') as f:
reader = PyPDF2.PdfFileReader(f)
text = ''
for page_num in range(reader.numPages):
text += reader.getPage(page_num).extractText()
return text
# Add other reader functions for different file types as needed
def process_files(directory):
for root, dirs, files in os.walk(directory):
for file in files:
file_path = os.path.join(root, file)
# Determine file type and read contents
_, file_extension = os.path.splitext(file)
file_extension = file_extension.lower()
if file_extension == ".txt":
content = read_txt_file(file_path)
elif file_extension == ".docx":
content = read_docx_file(file_path)
elif file_extension == ".pdf":
content = read_pdf_file(file_path)
# Add more file types as needed
else:
continue # Skip unsupported file types
compressed_content = compress_text_gpt4(text=content)
print(f"Original Text ({file}):")
print(content)
print(f"\nCompressed Text ({file}):")
print(compressed_content)
print("\n" + "-" * 80 + "\n")
# Example usage
if __name__ == "__main__":
target_directory = "path/to/your/folder"
process_files(target_directory)
Remember to install the required libraries using pip:
bashpip install docx2txt PyPDF2
This script will process all supported file types in the specified folder and print the original and compressed content. You can modify the script to store the compressed content in files or perform other tasks as needed. Note that processing large folders with many files might consume a significant amount of API requests and take a long time to complete.