pyspark - Finding files in a directory using Python wildcards but avoiding certain text

Saturday, 12 November 2016

pyspark - Finding files in a directory using Python wildcards but avoiding certain text

Sorry for that train wreck of a title...not sure how else to word it.

I'm ingesting files from a certain directory one category at a time. The category is part of the filename following a very specific format, but there are a few issues throwing my process off.

Example filename:

.../Bike.txt

If there's an overabundance of source data for a particular category, the system will create numbered files to handle the overflow. In that case, the files may look like this:

.../Bike_1.txt

.../Bike_2.txt

I need to grab the files for a particular category regardless of whether it's "Bike.txt" or "Bike_1.txt". I figured I could use a wildcard to find files matching "Bike*.txt". The problem with this is that I may also have a file called something like "Bike_Helmet.txt", and I do not want to ingest that file if I'm currently looking at the bike category.

This is being done usying PySpark in Databricks. I've used the glob library up until now to handle this, but I'm not sure it can do what I need here.

To summarize, after specifying a category, I want to find files that match the following formats:

.../[category].txt

.../[category]_[a number].txt

But I do not want to retrieve files that are of the format .../[category]_[non-numeric string].txt.

Is there a way to do this in a single pass, or will I have to ingest based on .../[category].txt first and then .../[category]_[0-9]*.txt a second time?

Blog

Saturday, 12 November 2016

pyspark - Finding files in a directory using Python wildcards but avoiding certain text

No comments:

Post a Comment

c++ - Does curly brackets matter for empty constructor?