If you have been following along with our DIY AI project, you should have your scheduling script ready to go, and now we are going to use that to start collecting data that we will be able to use to train the AI we are creating. For this task, we’ll address scanning directories for file metadata, a foundational feature of many AI applications.
Why scan directories for metadata?
Metadata about files, such as size, modification date, and file type, is often the starting point for advanced functionalities like data analysis, machine learning preprocessing, or system monitoring. This information can help the AI find files, sort files, and more.
At the start, the script will only scan your desktop for files because this is where most of the action takes place on many people’s computers, and it can often help with organization. However, it’s easy to change to scan any file you want, and it will even scan the entire hard drive, which will be resource-intensive. Once the script scans the drive, it will create a log, and once you have enough logs, you can move on to the next step, which we’ll discuss a little later.
Breaking down the script
Let’s start by breaking down the script and looking at how it works. Then, I’ll paste the entire code so you can copy and paste it. If you haven’t read over the earlier guides, it can be a good idea to ensure you’re up to speed and have everything you need to continue.
Import essential libraries
You should have all of the libraries you need for this script installed, but you will still need to import them.
- Os will help you move through the directories and get file information.
- Pandas organizes data into a DataFrame for easy handling.
- Pathlib simplifies file and path operations.
- Datetime generates human-readable timestamps.
import os
import pandas as pd
from pathlib import Path
from datetime import datetime
Create the scan_directory function.
This function performs the core task of scanning the directories and gathering the metadata. It looks through the directory and any subdirectories and retrieves the file’s size and when it was last modified. It then converts the timestamps to readable dates.
def scan_directory(directory: Path) -> pd.DataFrame:
paths, names, extensions, sizes, modified_times = [], [], [], [], []
for root, dirs, files in os.walk(directory):
for filename in files:
file_path = Path(root) / filename
paths.append(str(file_path))
names.append(file_path.stem)
extensions.append(file_path.suffix)
sizes.append(file_path.stat().st_size)
mod_time = file_path.stat().st_mtime
modified_times.append(datetime.fromtimestamp(mod_time))
return pd.DataFrame({
"path": paths,
"name": names,
"extension": extensions,
"size_bytes": sizes,
"last_modified": modified_times
})
Define the script’s main workflow.
This next part of the code tells the script what directory it should scan and prints the first few lines of what it finds so you know it’s working, as it can take some time to finish if your desktop has a lot of files and folders. You can change this to scan any directory.
if __name__ == "__main__":
directory_to_scan = Path.home() / "Desktop" # Scans the user's Desktop
file_df = scan_directory(directory_to_scan) # Calls the function
print(file_df.head()) # Displays the first few rows for confirmation
Save the file with a timestamp in the correct folder
This bit of code saves our data to a CSV file with a timestamp in the name, so each time you run it, you’ll get a new file that your AI will be able to compare with the others. It then places the file in a folder named Directory_Malpper_Data in our data folder.
project_root = Path(__file__).resolve().parent.parent
data_dir = project_root / "data"
directory_mapper_dir = data_dir / "Directory_Mapper_Data"
directory_mapper_dir.mkdir(parents=True, exist_ok=True) # Creates folders if missing
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S") # Generates a unique timestamp
output_filename = f"desktop_files_metadata_{timestamp}.csv"
output_path = directory_mapper_dir / output_filename
file_df.to_csv(output_path, index=False) # Saves the DataFrame to CSV
print(f"File metadata saved to {output_path}")
Complete script for directory_scanner.py
import os
import pandas as pd
from pathlib import Path
from datetime import datetime
def scan_directory(directory: Path) -> pd.DataFrame:
paths, names, extensions, sizes, modified_times = [], [], [], [], []
for root, dirs, files in os.walk(directory):
for filename in files:
file_path = Path(root) / filename
paths.append(str(file_path))
names.append(file_path.stem)
extensions.append(file_path.suffix)
sizes.append(file_path.stat().st_size)
mod_time = file_path.stat().st_mtime
modified_times.append(datetime.fromtimestamp(mod_time))
return pd.DataFrame({
"path": paths,
"name": names,
"extension": extensions,
"size_bytes": sizes,
"last_modified": modified_times
})
if __name__ == "__main__":
directory_to_scan = Path.home() / "Desktop"
file_df = scan_directory(directory_to_scan)
print(file_df.head())
project_root = Path(__file__).resolve().parent.parent
data_dir = project_root / "data"
directory_mapper_dir = data_dir / "Directory_Mapper_Data"
directory_mapper_dir.mkdir(parents=True, exist_ok=True)
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
output_filename = f"desktop_files_metadata_{timestamp}.csv"
output_path = directory_mapper_dir / output_filename
file_df.to_csv(output_path, index=False)
print(f"File metadata saved to {output_path}")
What’s next?
With the script added to your project, you can run it to ensure it scans the directory and creates a saved file in the right folder. Then, add it to the scheduler we made last time so it runs once per day. You can run it more or less depending on how frequently your files might change.
We’ll need our new script to collect data for at least several days. Once we have some good data, we can start preparing it for machine learning. More on that in the next guide.
Follow GeekSided to learn more about AI and Python programming.