|  | 
	
		|  | 
	
		| PDF Overview - Peering into the Internals of PDF | 
	
		| Author:
		Ayush Anand | 
	
		|  | 
	
		|  | 
	
		|  | 
	
		
	
			 |  | 
	
		
	
		|  | 
	
		|  | 
	
		|  | 
	
		|  | 
	
	
		
	
			 |  | 
	
	
		|  | 
	
		|  | 
	
		|  | 
	
		
	
		 |  | 
		|  
		
			|   | Portable Document Format (PDF) is a file format for representing 
			documents in a manner independent of the application software, 
			hardware, and operating system used to create them and of the output 
			device on which they are to be displayed or printed. |  | 
	
		|  | 
	
		| In this introductory article I will explain the internals of PDF 
			document, its structures and components with examples and 
			screenshots.  It will help you understand intrinsics of PDF 
			document and will be more useful if you are into PDF malware analysis. | 
	
		|  | 
	
		|  | 
	
		|  | 
	
		
	
		
		 |  | 
	
	
		| PDF syntax consists of four main 
			components: | 
		 | 
				ObjectsFile StructureDocument StructureContent Stream | 
	
		 |  | 
	
	
	
		 |  | 
	
	 
		 	
	
 |  | 
 | A PDF file consists primarily of objects, of which there are eight 
		types: | 
 | 
				Boolean values, representing true or falseNumbers include integer and realStringsNamesArrays, ordered collections of objectsDictionaries, collections of objects indexed by NamesStreams, usually containing large amounts of dataThe null object denoted by keyword null | 
	
	
		 | I will explain more details about each of 
			these objects in detail in the following section. | 
	
	
		 |  | 
	
	
		 |  | 
	
	
		 |  | 
	
	
		 
			  
	
	
		 |  | 
	
	
		 | String objects can be represented in two 
			ways: | 
	
	
		 | 
				Literal StringsHexadecimal Strings | 
	
	
		 | Literal Strings consists of any number of 
			characters between opening and closing parenthesis. | 
	
	
		 |  | 
	
	
		 
			| Example (This is a string objects)
 If string is too long then it can be 
			represented using backslash as shown below
 (This is a very long\
 String.)
 Hexadecimal Strings consists of hexadecimal character 
			enclose with angel bracket
 Example:
 <A0C1D2E3F1>
 | 
	
	
		 |  | 
	
	
		 | Here each pair of hexadecimal defines one 
			byte of string. | 
	
	
		 |  | 
	
	
		 |  | 
	
	
		 |  | 
	
	
	
		 
			  
	
	
		 |  | 
	
	
		 | A names object is uniquely defined by 
			sequence of characters. Slash character(/) defined a name. | 
	
	
		 |  | 
	
	
		 
			| Example /secsavvy
 /SecSavvy
 Both are different name.
 /Sec#20Savvy 
			mean Sec Savvy 20 is hexadecimal value for white space.
 Note: Pdf 
			is case-sensitive.
 | 
	
	
		 |  | 
	
	
		 |  | 
		 |  | 
			
			 
			  
	
	
		 |  | 
	
	
		 | An array object is collection of objects. 
			PDF array object can be heterogeneous. It is defined with square 
			brackets. | 
	
	
		 |  | 
	
	
		 
			| Example [1 (string) /Name 3.14]
 | 
	
	
		 |  | 
	
	
		 |  | 
		 |  | 
		
	 
			  
	
	
		 |  | 
	
	
		 | Dictionary object consists of pairs of 
			objects. The first element is key and the second is value. 
 The key must be name. A dictionary is written as a sequence of 
			key-value pairs enclosed in double angle brackets (<< … >>).
 | 
	
	
		 |  | 
	
	
		 
			| Example << /Type /Pages
 /Kids [ 4 0 R ]
 /Count 1
 >>
 Count is a 
			key and 1 is value.
 | 
	
	
			 |  | 
	
	
		 |  | 
		 |  | 
	
	 
			  
	
	
		 |  | 
	
	
		 | A stream object, like a string object, is 
			a sequence of bytes. Stream can be of unlimited length, whereas a 
			string is subject to an implementation limit. For this reason, 
			objects with potentially large amounts of data, such as images and 
			page descriptions, are represented as streams. 
 A stream 
			consists of a dictionary followed by zero or more bytes bracketed 
			between the keywords stream and endstream:
 | 
	
	
		 |  | 
	
	
		 
			| dictionary 
 stream
 ... Zero or more bytes ...
 endstream
 | 
	
	
		 |  | 
	
	
		 |  | 
	
	
		 |  | 
	
	 
			  
	
	
		 |  | 
	
	
		 | Objects may be labeled so that they can be 
			referred to by other objects. A labeled object is called an indirect 
			object. | 
	
	
		 |  | 
	
	
		 
			| Example Consider this object
 obj and endobj is a keyword.
 
 10 
			0 obj
 (SecSavvy String)
 endobj
 
 This object defined a string 
			of object number 10.
 This object can be referred in a file by 
			indirect reference as
 10 0 R
 | 
	
	
		 |  | 
	
	
		 |  | 
	
	
		 |  | 
	
	 
			  
	
	
		 |  | 
	
	
		 | A filter is an optional part of the 
			specification of a stream, indicating how the data in the stream 
			must be decoded before it is used. For example, if a stream has an 
			ASCIIHexDecode filter, an application reading the data in that 
			stream will transform the ASCII hexadecimal-encoded data in the 
			stream into binary data. 
 For data encoded using LZW and ASCII 
			base-85 encoding (in that order) can be decoded using the following 
			entry in the stream dictionary:
 
 /Filter [ 
			/ASCII85Decode /LZWDecode ]
 | 
	
	
		 |  | 
	
	
		 
			| Example 1 0 obj
 << /Length 534 /Filter [ /ASCII85Decode /LZWDecode ]>>
 
 stream
 
 J..)6T`?p&<!J9%_[umg"B7/Z7KNXbN'S+,*Q/&"OLT'FLIDK#!n`$"<Atdi`\Vn%b%)&'cA*VnK\CJY(sF>c!Jnl@RM]WM;jjH6Gnc75idkL5]+cPZKEBPWdR>FF(kj1_R%W_d&/jS!;iuad7h?[L-F$+]]0A3Ck*$I0KZ?;<)CJtqi65XbVc3\n5ua:Q/=0$W<#N3U;H,MQKqfg1?:lUpR;6oN[C2E4ZNr8Udn.'p+?#X+1>0Kuk$bCDF/(3fL5]Oq)^kJZ!C2H1'TO]Rl?Q:&’<5&iP!$Rq;BXRecDN[IJB`,)o8XJOSJ9sDS]hQ;Rj@!ND)bD_q&C\g:inYC%)&u#:u,M6Bm%IY!Kb1+”:aAa’S`ViJglLb8<W9k6Yl\\0McJQkDeLWdPN?9A’jX*al>iG1p&i;eVoK&juJHs9%;Xomop”5KatWRT”JQ#qYuL,JD?M$0QP)lKn06l1apKDC@\qJ4B!!(5m+j.7F790m(Vj88l8Q:_CZ(Gm1%X\N1&u!FKHMB~>
 
 endstream
 endobj
 | 
	
	
		 |  | 
	
	
		 | Here is the list of standard filters | 
	
	
		 | 
				ASCIIHexDecodeASCII85DecodeLZWDecodeFlateDecodeRunLengthDecodeCCITTFaxDecodeJBIG2DecodeDCTDecodeJPXDecodeCrypt | 
	
		 
			|  | 
		 |  | 
	
	 
			  
	
	
		 |  | 
	
	
		 | PDF file consists of 4 main elements: | 
	
	
		 | 
				PDF header identifying the PDF specification.A body containing the objects that make up the 
				document contained in the fileA cross-reference table containing information about 
				the indirect objects in the fileA trailer giving the location of the cross-reference 
				table and of certain special objects within the body of the 
				file. | 
	
	 |  | 
	
		 
			|  | 
	
	
		 |  | 
	
	
		 |  | 
	
	
		 |  | 
	
	 
			  
	
	
		 |  | 
	
	
		 | The cross-reference table contains 
			information that permits random access to indirect objects within 
			the file so that the entire file need not be read to locate any 
			particular object. The table contains a one-line entry for each 
			indirect object, specifying the location of that object within the 
			body of the file. 
 Each cross-reference section begins with a 
			line containing the keyword xref. Following this line are one or 
			more cross-reference subsections, which may appear in any order.
 
 Each cross-reference subsection contains entries for a 
			contiguous range of object numbers. The subsection begins with a 
			line containing two numbers separated by a space: the object number 
			of the first object in this subsection and the number of entries in 
			the subsection. For example, the line
 
 0 8
 
 introduces a subsection containing five objects numbered 
			consecutively from 0 to 8.
 | 
	
	
		 |  | 
	
	
		 
			| xref 0 8
 0000000000 65535 f
 0000000009 00000 n
 0000000074 00000 n
 0000000120 00000 n
 0000000179 00000 n
 0000000364 00000 n
 0000000466 00000 n
 0000000496 00000 n
 | 
	
	
		 |  | 
	
	
		 | 0000000009 is 10 digit byte offset in the 
			case of in-use entry , giving the number of bytes from the beginning 
			of the file to the beginning of the object. 0000000000 is the 
			10-digit object number of the next free object int the case of free 
			entry
 | 
	
	
		 |  | 
	
		 |  | 
	
		 |  | 
	
	 
			  
	
	
		 |  | 
	
	
		 | Here are the series of screenshots which 
			shows different parts of sample PDF document. | 
	
	
		 
			|  | 
	
		 
			|  | 
	
	
		 
			|  | 
	
		 
			|  | 
	
	
		 |  | 
	
	
		 
			|  | 
	
	
		 |  | 
	
	
		 |  | 
	
	
		 |  | 
	
	
	
		 
			  
	
	
		 |  | 
	
	
		 |  | 
	
	
	 |  | 
	
	
		 |  | 
	
	
		 
			  
	
	
		 |  | 
	
	
		 | This article explains in brief internals 
			of PDF document, its structures, components with examples and 
			detailed screenshots.  Hope this article will help you in the 
			malware research work revolviing around PDF documents. 
 Though it is enough for beginners but advanced users are advised 
			read through reference white paper for more granular details.
 | 
	
	
		 
			|  | 
	
	
		|  | 
	
		|  | 
		
	
			 |  | 
	
		
	
		|  | 
	
		|  | 
	
		|  | 
	
		|  | 
	
		|  | 
		|  |