RE: [VuFind-General] Structural marc problems

50 views

Skip to first unread message

Demian Katz

unread,

Jun 26, 2012, 3:29:10 PM6/26/12

to kevin smith, hori...@mailman.xmission.com, vufind-...@lists.sourceforge.net, solrma...@googlegroups.com

I’m copying this to solrmarc-tech in case anyone there has some ideas. Are you able to open your MARC records a tool like MarcEdit or convert them with a tool like yaz-marcdump? It might be interesting to do some cross-checking. Maybe you can fix problems with a different tool prior to importing… or perhaps another tool will give you a better idea of exactly what is wrong.

- Demian

From: kevin smith [mailto:ash...@gmail.com]
Sent: Tuesday, June 26, 2012 3:01 PM
To: hori...@mailman.xmission.com; vufind-...@lists.sourceforge.net
Subject: [VuFind-General] Structural marc problems

Hello,

I have run into some issues importing records into VuFind from Horizon. It had been going along fine for weeks, but now I am getting all sorts of errors. Something like this:

ERROR [main] (MarcImporter.java:257) - Error reading record: For input string: "ocm6"

ERROR [main] (MarcImporter.java:257) - Error reading record: unable to parse record length

ERROR [main] (MarcImporter.java:257) - Error reading record: null

ERROR [main] (MarcImporter.java:257) - Error reading record: Directory length is not a multiple of 12 bytes long. Unable to continue.

ERROR [main] (MarcImporter.java:257) - Error reading record: unable to parse record length

ERROR [main] (MarcImporter.java:257) - Error reading record: null

ERROR [main] (MarcImporter.java:257) - Error reading record: Directory length is not a multiple of 12 bytes long. Unable to continue.

ERROR [main] (MarcImporter.java:257) - Error reading record: unable to parse record length

ERROR [main] (MarcImporter.java:257) - Error reading record: null

ERROR [main] (MarcImporter.java:257) - Error reading record: Directory length is not a multiple of 12 bytes long. Unable to continue.

ERROR [main] (MarcImporter.java:257) - Error reading record: unable to parse record length

ERROR [main] (MarcImporter.java:257) - Error reading record: null

I am not sure if this has to do with a record that is too long for marc, or some other leader problems. I have been looking at this blog post about some Horizon specific issues: http://bibwild.wordpress.com/2010/02/02/structural-marc-problems-you-may-encounter/

I am just not sure where to go from here. Are there queries I can run against the Horizon database to check for, and fix the integrity of the marc records?

Thanks,

--
Kevin Smith
Digital Library Manager
Wake County Public Libraries

Demian Katz

unread,

Jun 26, 2012, 3:49:29 PM6/26/12

to kevin smith, vufind-...@lists.sourceforge.net, solrma...@googlegroups.com

Are you able to export in MARC-XML? The XML format doesn’t have the same size restrictions as binary MARC, so that might help you extract workable records.

As for pinpointing the problem record, can you export specific ranges of record numbers? If nothing else, you can try extracting different chunks until you hit the problem. If the issue cropped up in the past couple of weeks, chances are the bad record was modified or added recently, which might help you track it down.

- Demian

From: kevin smith [mailto:ash...@gmail.com]
Sent: Tuesday, June 26, 2012 3:47 PM
To: Demian Katz
Cc: hori...@mailman.xmission.com; vufind-...@lists.sourceforge.net; solrma...@googlegroups.com
Subject: Re: [VuFind-General] Structural marc problems

When I try to open the file with MarcEdit, I get the error Record too large error (larger than 99,999 bytes). Error Number: -7

So, it looks like I have a record that is too large. Anyone know how I can identify which record, or records are causing the problem?

Simon Spero

unread,

Jun 26, 2012, 5:25:54 PM6/26/12

to solrma...@googlegroups.com, kevin smith, vufind-...@lists.sourceforge.net

The record length in a valid marc leader is stored as five ascii digits. Since the none of the valid characters in the sixth position are digits, you can use this to spot the naughty records. There are two cases: at the start of the file, and after other records .

1) If the oversize record is the first record in a file, the first six or more characters in the file will be ascii digits.

2) If an oversize record is not the first record in the file, then it will occur as a sequence of six or more ascii digits immediately following an ASCII GS (RT in Marc speak) character. The code point for GS is hex 0x1d/ octal 035/decimal 29.

This might work at stripping out the bad records (untested):

RS=$'\035'

sed -E -e "s/^[0-9]{6,}[^$RS]*$RS//" -e "s/$RS[0-9]{6,}[^$RS]*$RS/$RS/" <Old-File.mrc >New-File.mrc

Simon

--
You received this message because you are subscribed to the Google Groups "solrmarc-tech" group.
To post to this group, send email to solrma...@googlegroups.com.
To unsubscribe from this group, send email to solrmarc-tec...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/solrmarc-tech?hl=en.

Robert Haschart

unread,

Jun 27, 2012, 2:50:47 PM6/27/12

to kevin smith, vufind-...@lists.sourceforge.net, solrma...@googlegroups.com

Kevin,

How big is the file that contains the bad record? If the file is not too large such that email-ing would be prohibitive, you could send that file of records to me do that I could see what is going on.

SolrMarc (and the Marc4j library that is relies on) ought to be able to handle records that are too large. I have encountered records and processed them successfully where not only was the record greater than 99999 bytes long, but the directory portion of the MARC record was greater than 99999 bytes. So the record ought to be able to be handled, (unless for instance the record creation software does something like writing out the record length using however many bytes are needed to represent it, which I think is not one of the error modes that the MarcPermissiveStreamReader is designed to handle.

Actually now that I think of it, It would be worth making sure that the property marc.permissive = true in your import.properties file.

-Bob Haschart

On 6/26/2012 3:47 PM, kevin smith wrote:

When I try to open the file with MarcEdit, I get the error Record too large error (larger than 99,999 bytes). Error Number: -7

So, it looks like I have a record that is too large. Anyone know how I can identify which record, or records are causing the problem?

On Tue, Jun 26, 2012 at 3:29 PM, Demian Katz <demia...@villanova.edu> wrote:

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/

_______________________________________________
VuFind-General mailing list
VuFind-...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/vufind-general

Simon Spero

unread,

Jun 27, 2012, 3:36:21 PM6/27/12

to solrma...@googlegroups.com

On Jun 27, 2012 2:50 PM, "Robert Haschart" <rh...@virginia.edu> wrote:
> So the record ought to be able to be handled, (unless for instance the record creation software does something like writing out the record length using however many bytes are needed to represent it, which I think is not one of the error modes that the MarcPermissiveStreamReader is designed to handle.

I think this is what happens; I'm not 100% sure about the directory offset, but I think this will be the case.

The record can be recovered iff no data field has a length that overflows the directory slot, as a running count can be made on the assumption that there is no unallocated space. All that is needed from the directory is the tag; the rest can be inferred from the presence or absence of field terminators, (combined with an assumption that offset is monotonic increasing, since there might be field terminators in the middle of a field. cut and paste)

Jonathan Rochkind

unread,

Jun 27, 2012, 3:51:01 PM6/27/12

to solrma...@googlegroups.com

Yeah, I've had SolrMarc succesfully handling illegal "too large" marc
records too. I write em out again as MarcXML, but read em as 'illegal'
too large marc.

The key for my records, was that initially my ILS exporter was writing
these 'too large' records out using _six_ ascii bytes to express the
length -- this made the byte offsets for the rest of the leader (and the
rest of the record) off by one, and SolrMarc/Marc4J could _not_ handle
that.

However, once I fixed things so these too large record were written out
with a not-correct but only 5 byte length (either 99999 or 00000
actually worked, I think), SolrMarc was reading em in fine for me.

I am not sure if I use marc.permissive=true (or if there are any other
settings that might effect that?). Actually, I can look and see... here
are the settings I can find in my config file that may or may not be
relevant:

marc.to_utf_8 = true
marc.permissive = true
marc.default_encoding = MARC8
marc.include_errors = false

On 6/27/2012 2:50 PM, Robert Haschart wrote:
> Kevin,
>
> How big is the file that contains the bad record? If the file is not too
> large such that email-ing would be prohibitive, you could send that file
> of records to me do that I could see what is going on.
>
> SolrMarc (and the Marc4j library that is relies on) ought to be able to
> handle records that are too large. I have encountered records and
> processed them successfully where not only was the record greater than
> 99999 bytes long, but the directory portion of the MARC record was
> greater than 99999 bytes. So the record ought to be able to be handled,
> (unless for instance the record creation software does something like
> writing out the record length using however many bytes are needed to
> represent it, which I think is not one of the error modes that the
> MarcPermissiveStreamReader is designed to handle.
>
> Actually now that I think of it, It would be worth making sure that the
> property marc.permissive = true in your import.properties file.
>
> -Bob Haschart
>
>
>
>
>
> On 6/26/2012 3:47 PM, kevin smith wrote:
>> When I try to open the file with MarcEdit, I get the error Record too
>> large error (larger than 99,999 bytes). Error Number: -7
>> So, it looks like I have a record that is too large. Anyone know how I
>> can identify which record, or records are causing the problem?
>>
>> On Tue, Jun 26, 2012 at 3:29 PM, Demian Katz

>> <demia...@villanova.edu <mailto:demia...@villanova.edu>> wrote:
>>
>> Iï¿½m copying this to solrmarc-tech in case anyone there has some

>> ideas. Are you able to open your MARC records a tool like MarcEdit
>> or convert them with a tool like yaz-marcdump? It might be
>> interesting to do some cross-checking. Maybe you can fix problems

>> with a different tool prior to importingï¿½ or perhaps another tool

>> will give you a better idea of exactly what is wrong.
>>
>> - Demian
>>

>> *From:*kevin smith [mailto:ash...@gmail.com
>> <mailto:ash...@gmail.com>]
>> *Sent:* Tuesday, June 26, 2012 3:01 PM
>> *To:* hori...@mailman.xmission.com
>> <mailto:hori...@mailman.xmission.com>;
>> vufind-...@lists.sourceforge.net
>> <mailto:vufind-...@lists.sourceforge.net>
>> *Subject:* [VuFind-General] Structural marc problems

>> threats.http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>>
>>
>> _______________________________________________
>> VuFind-General mailing list
>> VuFind-...@lists.sourceforge.net <mailto:VuFind-...@lists.sourceforge.net>
>> https://lists.sourceforge.net/lists/listinfo/vufind-general

Alan Rykhus

unread,

Jun 27, 2012, 4:03:45 PM6/27/12

to solrma...@googlegroups.com

Hello,

Making the inference that you can look at the tags and decode the record
might make sense to you, but this is not correct. If the record is
longer than 99999 bits, it cannot be encoded in MARC and you should not
expect to be able to decode it with anything. A good suggestion was to
export the record in MARC-XML.

Why can't it be encoded in MARC?

1. It is larger than 99999 bytes. This limit was set a long time ago.
There is really no reason to have records larger than that.

2. The directory has groupings of 12 digits for each field. Digits 0-2
are the field tag. Digits 3-6 are the length of the field. Digits 7-12
represent the number of digits from the end of the directory where the
field starts. This is also limited to 5 digits, so your extremely large
record would break the directory here in that you would need 6 digits to
state where a field begins.

While you can state that you can look for x29, x30, and x31 characters
and decipher the records, the record will still not meet the MARC
standard. If this were the correct way to decipher the record then the
directory would not be needed.

There is a ton of software out there that is written to the MARC
standard. Please don't infer that there are other ways to do it.

al

> --
> You received this message because you are subscribed to the Google
> Groups "solrmarc-tech" group.
> To post to this group, send email to solrma...@googlegroups.com.

> To unsubscribe from this group, send email to solrmarc-tech
> +unsub...@googlegroups.com.

> For more options, visit this group at
> http://groups.google.com/group/solrmarc-tech?hl=en.

--
Alan Rykhus
PALS, A Program of the Minnesota State Colleges and Universities
(507)389-1975
alan....@mnsu.edu
"It's hard to lead a cavalry charge if you think you look funny on a
horse" ~ Adlai Stevenson

Jonathan Rochkind

unread,

Jun 27, 2012, 4:08:18 PM6/27/12

to solrma...@googlegroups.com, Alan Rykhus

Most of us have more experience with software that violates the Marc
standard than software that follows it, I'm afraid. Our task is working
around it.

I think using SolrMarc's own indexing stream to translate from illegal
Marc (over 99999 bytes) to legal (MarcXML), I don't see why there's
anything wrong with that.

Obviously you don't have to do it if you don't want.

I have an ILS that outputs illegal Marc longer than 99999 bytes. I could
(and have) told my vendor to fix it, but they have not, so I've got to
deal with it.

(The only way they could really fix it would be by providing a tool to
output in MarcXML or marc-in-json instead of ISO 2709 MARC. Becuase
there REALLY IS more than 99999 bytes of data in the record. I'm not
sure what you mean by "There is really no reason to have records larger
than that." -- what if I do anyway?)

What would you suggest I do, simply avoid ever doing a marc export from
my ILS until my vendor changes their software, or my institution
replaces the ILS with another product? That is not very realistic, I
prefer to get things done, working around problems.

Alan Rykhus

unread,

Jun 27, 2012, 4:34:45 PM6/27/12

to solrma...@googlegroups.com

Hello Jonathon,

I'm just stating don't expect things to work with illegal MARC records.
These are illegal, malformed, whatever you want to call it MARC
records.

As for the size, Marc records are supposed to describe the material, not
be it. I have librarians in my consortia that create these monstrosities
too and then complain that they don't work in the system.

If solrmarc can be used as a tool to fix your problems, so be it. Just
don't cause a problem for me, because I'm using solrmarc too.

al

Simon Spero

unread,

Jun 27, 2012, 6:25:46 PM6/27/12

to solrma...@googlegroups.com

The preferred term is "Undocumented MARC records".

There are tens of millons of them.

You are free to configure your installation to bail out on hitting these records by setting marc.permissive=false ; intolerance is tolerated.

Simon

To unsubscribe from this group, send email to solrmarc-tec...@googlegroups.com.

Reply all

Reply to author

Forward

0 new messages

Search

Clear search

Close search

Google apps

Main menu