Processing huge CSV files in Laravel.I regularly work with CSVs made up of millions of lines and dozens of fields. These files can weigh two to three gigabytes. Such volumes need to be handled carefully and efficiently. I will share with you the strategy I have developed for this over the past couple of years.
Table of Contents
Use a CSV library.
My first and most important recommendation may seem simple, but surely we all had the thought “ This is just a CSV, what can be difficult there – I’ll parse it myself! “.
Correct answer: it’s hard!
If you don’t believe me, read the book ” Programmers ‘ False Conceptions About CSV ” and ask yourself: are you really ready to handle all these subtleties? I didn’t even know about most of them!
So just use any suitable CSV parser. I recommend League / CSV . It is fantastically user-friendly, easy to use, tested, and actively supported.
Never trust a data provider.
The CSV files we upload to our system are mainly exported from other systems, including government ones. It may sound harsh, ” Never trust a data provider, ” but rest assured they will constantly change something about them. And it’s better to be ready for this.
Without notification, they can:
- change the order of fields
- change field name
- add / remove field
- start escaping certain characters
We download data automatically on a schedule outside of business hours. The system receives the file, opens it, and imports the data. This means that we need to make sure the file is in the correct format before we start downloading the millions of lines.
Consistency of headers.
If you are using the League / CSV package, then the order of the fields in the source does not matter, because you end up with an associative array per field. But it does matter if the field name changes or if it is deleted altogether. Then this can lead to errors when the necessary data is not available.
It is important to identify these problems instead of skipping empty data.
We do this in the following way: we set up our “expectations” for the field headers, and if they do not match, then we throw an exception and stop the import process.
We usually get them directly from the file:
protected function validateHeaders ( $ path, $ expectedHeaders )
{
$ headersFromFile = trim ( shell_exec ( “head -1 $ path” )) ;
if ( $ expectedHeaders! == $ headersFromFile ) {
Log :: error ( ‘Expected:’ . $ ExpectedHeaders ) ;
Log :: error ( ‘Actual:’ . $ HeadersFromFile ) ;
throw new Exception ( “Headers do not match. Check the logs.” ) ;
}
}
You can also check the first record returned by the object Reader.
protected function validateHeaders ( $ reader, $ expectedHeaders )
{
$ headersFromFile = array_keys ( $ reader- > fetchOne ()) ;
if ( $ expectedHeaders! == $ headersFromFile ) {
Log :: error ( ‘Expected:’ . $ This – > originalHeaders ) ;
Log :: error ( ‘Actual:’ . $ HeadersFromFile ) ;
throw new Exception ( “Headers do not match. Check the logs.” ) ;
}
I recommend checking the headers to make sure you are working with the correct data.
No headlines?
It’s hard to check headers if they don’t exist at all!
This often happens. In this case, we treat the first row of data as a header and track changes to it as if it were a header. Most of the data we receive is relatively stable, so while we get occasional false positives, we don’t need to update this “fixed” data frequently.
Correct Shielding.
The League / CSV package does most of the work for you, but there are times when the data provider has encoded the data in a non-standard way. You should always check them, not blindly load them into the system.
We recently ran into an issue where the vendor was not properly escaping a character, causing two entries to merge into one. Naturally, the process did not stop with an error, because how does the library know about this, it just follows the CSV specification. We only discovered this when we tried to insert the resulting value into the database and got a “value too long” error. It became clear that the parsing was not working properly.
The way we are catching this now is a bit tricky, but I’ll cover it briefly.
The first thing we do when uploading a new CSV file is to check how many records it contains. Previously, we just counted the number of lines, but some fields may have hyphens, so we had to come up with another way. Each line in a particular file contains an easily identifiable template, for example, | 043,2020, and if we count how many times it appears, we will find out how many entries should be.
public function linesInFile ( $ path )
{
$ pattern = “\ | 043,2020,” ;
// ‘E’ – turns on regex
// ‘a’ – make it work with the file as with text (which it is).
// ‘o’ – tells it to show only matches, not the whole line
return ( int ) trim ( exec ( “grep -oaE ‘$ pattern’ $ path | wc -l” )) ;
}
Knowing how many records should be, we compare with how many records we actually got:
public function ensureCsvParsed ( $ path )
{
$ reader = $ this – > getReader ( $ path ) ;
$ lines = $ this – > linesInFile ( $ path ) ;
if ( $ lines == $ reader- > count ()) {
// CSV document contains how many records
// how many lines are in the file.
return ;
}
// Find out which line contains the error …
throw new Exception ( ‘String [X] contains an error.’ ) ;
}
If these numbers do not match, then an error has occurred somewhere.
We figure out which line is in error by splitting the file in half and checking again. As a result, we get the line number with the wrong escaping.
I will not post all the code that does this work here, but in general, I will explain how it is done. We start at the very end of the file and get one line:
$ line from file = exec ( “sed -n ‘{$ line}, {$ line} p’ $ path” ) ;
Then we get the string that is at the address $line – 2(subtract the header and compensate for the zero indexes).
$ lineFromReader = $ reader- > fetchOne ( $ line – 2 ) ;
Compare these two strings and check if they contain the same values. If so, the error is BEFORE this line, so go to the middle of the file and try again. If they match, then we have gone too far. Set the variable to the last “good” line and jump forward.
We repeat this until we get to the “bad string”, which is usually filled with a random number of backslashes and quotes. Congratulations, you’ve found it!
Streaming to save memory.
If the file is more than ten megabytes in size, then you need to make sure that you are using the streaming mode, and not loading everything at once into memory.
Streaming a file means it will be loaded into memory in small chunks. This is fairly easy to do and will greatly increase your bandwidth.
// Load everything into memory (Don’t do this!)
$ reader = Reader :: createFromString ( file_get_contents ( ‘/path/to/my/file.csv’ )) ;
// Opens as a stream (Do this!)
$ reader = Reader :: createFromPath ( ‘/path/to/my/file.csv’ , ‘r’ ) ;
Fortunately, the PHP League package solves all the problems for us. You just need to make sure that the correct method is called.
Even if your files are not located locally, you can still use streaming mode.
For example, if they are on S3, then you can use the AWS stream wrapper. Here’s how you can get a stream from S3 to Laravel:
public function getStream ( $ disk, $ path )
{
// Get the S3 adapter.
$ adapter = $ disk- > getDriver () – > getAdapter () ;
// Register the stream wrapper s3: //
$ adapter- > getClient () – > registerStreamWrapper () ;
// Now we can use the s3: // protocol
$ path = ‘s3: //’ . $ adapter- > getBucket () . ‘/’ . $ path;
// Return the stream
return fopen ( $ path, ‘r’ , false , stream_context_create ([
‘s3’ = > [ ‘seekable’ = > true ]
])) ;
}
// …
$ stream = $ this – > getStream ( $ disk, $ path ) ;
$ reader = Reader :: createFromStream ( $ stream ) ;
Now you can work with S3 files chunk by chunk instead of downloading it.
Please note that many of the methods in this tutorial only work with local files. Therefore, in the case of using S3, you need to download the file to yourself and work with it locally.
Page processing.
If you have a couple of million lines that you are trying to process in a reasonable amount of time, then you will find yourself limited to trying to read the entire file line at a time in a single stream.
Instead, take a large page of 50,000-100,000 records and process it in a separate process. You can create multiple processes to process multiple pages in parallel, rather than work with them sequentially.
This is done with a single command responsible for creating “child” commands for importing individual pages:
public function import file ( $ path )
{
// Completely depends on you and your requirements
$ maxProcesses = 10 ;
$ perPage = 50,000 ;
$ pages = ceil ( $ this – > linesInFile ( $ path ) / $ perPage ) ;
$ page = 1 ;
$ processes = [] ;
while ( $ page < = $ pages || count ( $ processes )) {
$ processes = $ this – > getRunningProcesses ( $ processes ) ;
// If there is room to start another process
// and we need to import one more page,
// then start a new process
if ( $ page < = $ pages && count ( $ processes ) < $ maxProcesses ) {
$ command = “php artisan import: page $ page $ path” ;
// Execute the command in the background
exec ( $ command. ‘> / dev / null 2> & 1 & echo $!’ , $ processes ) ;
$ page ++;
}
sleep ( 10 ) ;
}
}
And then, in the child process, you can use operators to retrieve the records of a particular page:
public function records ( $ reader )
{
return ( new Statement )
– > offset ( $ this – > page * $ this – > perPage )
– > limit ( $ this – > perPage )
– > process ( $ reader ) ;
}
With a paging scheme, you control speed and bandwidth based on the system resources you have.
Processing via Tasks, Commands, and Daemons
Where better to handle CVS files is still an open question. But I have a few tips.
TASKS
Processing within the queue. The easiest way. Launched and forgot. Ideal for small imports initiated by discrete action, such as a user uploading a file.
But there are drawbacks. The first is the task timeout. If the imports are huge, then you will quickly exceed the default 90-second timeout for queued tasks. Naturally, you can increase longer import, but most likely you will again run into the limiter. For example, some of our imports run for several hours.
It may be possible to get around this if you divide long tasks into small ones, but then you have to take care of coordination between them. How to find out which import has already been completed? Perhaps the new batching of tasks from Laravel 8 can help here, but I don’t have that experience yet.
COMMANDS
If the imported data is available programmatically, for example through the API or is on FTP, then you can schedule a command that will take it and start processing. This is convenient because you don’t have to worry about timeouts. Be sure to call in ->runInBackground()->withoutOverlapping()if the planned working time is calculated in hours.
We use this method to import the largest files at night. During the day we upload files to S3, and at night the team wakes up and looks to see if there is work for it.
Everything works fine, there is no need to worry about timeouts, we can spawn several child processes with one “parent” command and receive parallel imports.
DEMONS
As you know, a daemon is simply a process that runs in the background, not one under the direct control of the user. A command horizon:workin Laravel is usually configured as a daemon.
Using import daemon is convenient for several reasons. First, they, like teams, do not have a timeout. Secondly – this is what the commands lack – the daemon can start working immediately.
If your import is based on user-uploaded files, then you probably won’t want to wait until midnight to start importing it. In fact, even a minute-by-minute check is not fast enough. Users want to see immediate reactions.
For custom import, we keep a daemon waiting for new raw data in the database. As soon as he sees them, he starts processing.
If you are implementing a strategy like this, then make sure to use a lock, otherwise, two daemons might start importing the same document, leading to duplicate data.
public function handleNextImport ()
{
// Make sure we are the only process looking for new imports
$ handler = Cache :: lock ( ‘selectNextImport’ ) – > get ( function () {
$ import = Import :: processable () – > first () ;
if ( ! $ import ) {
return ;
}
$ import- > markAsProcessing () ;
return $ import- > makeHandler () ;
}) ;
optional ( $ handler ) – > handle () ;
}
In truth, we are not using a real demon because setting it up is such a gem. We start a pseudo-daemon, which I wrote about in the article ” Laravel Pseudo-Daemons “.
Data cleansing.
All of your data will be presented as strings because this is how CSV works. Most likely, you do not want to store data in this format, but convert it to the necessary primitive types. I will tell you about some of the problems I ran into while doing this.
REMOVING EXTRA SPACES
The first thing I always do is utf8_encodeget rid of the extra spaces. UTF-8 encoding removes some non-standard values that may be present in the text, for example, \x00.
At this point, I also convert any empty strings to null what they essentially are.
public function clean ( $ record )
{
foreach ( $ record as $ key = > $ value ) {
$ value = utf8_encode ( $ value ) ;
$ value = trim ( $ value ) ;
$ value = $ value === ” ? null : $ value;
$ record [ $ key ] = $ value;
}
return $ record;
}
Now that the data is more or less reliable, we can proceed to convert them to the correct primitives.
BOOLEAN (BOOLEAN)
If your field is true/ false, then you should pay attention to:
yes / y / Y / YES
no / n / NO / N
true / TRUE
false / FALSE
1
0
(empty)
NUMBERS
Casting here depends on whether your number should be an integer or floating-point. But in any case, you need to remove the commas ( ,) and currency indicators ( $).
If you only want whole numbers, then I remove the decimals and then use the function digits_only.
function digits_only ( $ val, $ replacement = ” )
{
return preg_replace ( ‘/ \ D /’ , $ replacement, $ val ) ;
}
$ value = explode ( ‘.’ , $ value ) ;
$ value = head ( $ value ) ;
$ value = digits_only ( $ value ) ;
If you need to handle both integers and floating point, then you can remove the commas and currency indicators, and then use a simple PHP cast.
$ value = 0 + $ value;
If $ value is a string float( 123.45), then we get float, and if it is a string int( 123), then we get int.
DATES
Hopefully, the date format will be the same throughout the data file. If so, then we use the method createFromFormatfrom Carbon.
public function date ( $ record, $ key, $ format )
{
if ( Arr :: has ( $ record, $ key )) {
$ record [ $ key ] = Carbon :: createFromFormat ( $ format, $ record [ $ key ]) ;
}
return $ record;
}
If Carbon cannot generate a date from the resulting string, then it will throw an exception, which is what we need. It is better to get an explicit error than bad data.
EOF
CSV is a fantastic format for transferring data. Almost every system can export to it. And, if you know a few subtleties, then the import of this data will also pass without problems.
For professional services or consultation from GCC Marketing Dubai please Contact Us or call us directly on 00971567300683