Quickstart
This documentation provides an overview and explanation of a Google Colab script. The script processes files, generates embeddings, and uploads them to a FaithCopilot server while supporting two vector storage providers: Pinecone and Upstash.
1. Setup and Installation
The script starts by mounting Google Drive, which is where the source files are stored. Necessary Python packages are installed, including libraries for document handling, PDF parsing, and Whisper for transcription.
from google.colab import drive
drive.mount('/content/drive')
%%capture
%pip install python-docx
%pip install -q PyMuPDF
%pip install -q git+https://github.com/openai/whisper.git
2. Imports and Configurations
The script imports necessary Python libraries for various tasks such as file handling, HTTP requests, document processing, and model loading.
2.1 Imported Libraries
os
,shutil
,re
,json
,time
- General utilities for file and string operations.requests
- HTTP library for API requests.fitz
(PyMuPDF) - PDF processing library.whisper
- Automatic speech recognition (ASR) model by OpenAI.subprocess
- Used for running system commands.tqdm
- Progress bars for loops.Document
frompython-docx
- Processing DOCX files.
2.2 Configuration Variables
The configuration includes metadata for vector storage providers, API keys, file paths, and batch processing settings.
VECTOR_STORE_PROVIDER = 'pinecone' # Can be 'pinecone' or 'upstash'
INDEX_NAME = 'my-index'
FC_NAME = 'My Sermon Notes'
DIRECTORY_PATH = '/content/drive/MyDrive/Datastore Test/My Sermon Notes'
FC_TAGS = ['sermons', 'notes', 'english']
BATCH_SIZE = 50
index_metadata = {'language': 'en'}
NAMESPACE = FC_NAME.lower().replace(' ', '-')
3. Temporary Folder Creation
The script creates a temporary folder to store intermediate files for processing. If the folder already exists, it will be removed and re-created.
current_dir = os.getcwd()
TEMP_FOLDER_PATH = os.path.join(current_dir, 'temp')
if os.path.exists(TEMP_FOLDER_PATH):
shutil.rmtree(TEMP_FOLDER_PATH)
os.makedirs(TEMP_FOLDER_PATH, exist_ok=True)
4. API Tokens and Keys
The script uses multiple external APIs (FaithCopilot, Cloudflare, Pinecone, and Upstash) which require API keys for authentication. These keys are stored in variables and masked in the documentation for security reasons.
5. Helper Functions
5.1 Pinecone Functions
Functions to interact with Pinecone API, including retrieving the index list, creating indexes, and upserting vectors to Pinecone.
def get_pinecone_index_list():
headers = {'Api-Key': PINECONE_API_KEY, 'X-Pinecone-API-Version': '2024-07'}
response = requests.get('https://api.pinecone.io/indexes', headers=headers)
return response.json()
5.2 Upstash Functions
Functions to interact with Upstash API, similar to the Pinecone functions, handling index creation and vector upserting.
def get_upstash_index_list():
response = requests.get('https://api.upstash.com/v2/vector/index/', auth=(UPSTASH_EMAIL_OR_ID, UPSTASH_API_KEY))
return response.json()
5.3 Metadata Helper Functions
These functions extract metadata from different file types, such as blog posts, sermons, and documents.
def get_metadata(content):
lines = content.split('\n')
title = lines[0]
url = lines[1]
metadata = {'title': title, 'url': url}
return metadata
5.4 Miscellaneous Helper Functions
These include batch processing, namespace generation, and file embedding extraction.
def get_batches(object_to_batch, batch_size=100):
if isinstance(object_to_batch, dict):
keys = list(object_to_batch.keys())
return [{key: object_to_batch[key] for key in keys[i:i+batch_size]} for i in range(0, len(keys), batch_size)]
elif isinstance(object_to_batch, list):
return [object_to_batch[i:i+batch_size] for i in range(0, len(object_to_batch), batch_size)]
6. File Processing and Transcription
The script supports different file types such as TXT, PDF, DOCX, and audio files (MP3/WAV). For audio files, Whisper transcribes the speech to text.
def voice_to_plaintext(audio_file_path):
model = whisper.load_model("base")
result = model.transcribe(audio_file_path)
return result['text'] if result else False
7. Vector Store Selection
The script checks whether Pinecone or Upstash is being used for vector storage, and fetches the index host accordingly. If no index is found, it creates one.
if VECTOR_STORE_PROVIDER == 'pinecone':
INDEX_HOST = get_pinecone_index_host(INDEX_NAME)
elif VECTOR_STORE_PROVIDER == 'upstash':
INDEX_HOST = get_upstash_index_host(INDEX_NAME)
8. Cloudflare Embeddings
For each file, embeddings are generated using the Cloudflare BAAI API. If the API call fails, the script retries the request.
def get_cloudflare_embeddings(text):
headers = {"Authorization": f"Bearer {CF_TOKEN}"}
response = requests.post(f"https://api.cloudflare.com/client/v4/accounts/{CF_ACCOUNT_ID}/ai/run/@cf/baai/bge-base-en-v1.5", headers=headers, json={"text": text})
return response.json()['result']['data']
9. Upload to FaithCopilot
The processed files, along with their metadata and embeddings, are uploaded to the FaithCopilot server via an API call.
def upload_to_faithcopilot(files, index_metadata):
faithcopilot_data = {'name': FC_NAME, 'tags': json.dumps(FC_TAGS), 'metadata': json.dumps(index_metadata)}
headers = {"Authorization": f"Bearer {FC_TOKEN}"}
response = requests.post(FC_ENDPOINT, headers=headers, files=files, data=faithcopilot_data)
return response
10. Cleanup and Script Completion
After processing, the temporary folder is deleted and the script execution time is printed.
shutil.rmtree(TEMP_FOLDER_PATH)
script_end_time = time.time()
script_total_time = script_end_time - script_start_time
print(f"Script ran fully in {script_total_time / 60} minutes.")
11. Conclusion
This script automates the process of extracting text, generating embeddings, and uploading files to FaithCopilot, utilizing cloud services like Pinecone, Upstash, and Cloudflare.