{"title": "Learnable Visual Markers", "book": "Advances in Neural Information Processing Systems", "page_first": 4143, "page_last": 4151, "abstract": "We propose a new approach to designing visual markers (analogous to QR-codes, markers for augmented reality, and robotic fiducial tags) based on the advances in deep generative networks. In our approach, the markers are obtained as color images synthesized by a deep network from input bit strings, whereas another deep network is trained to recover the bit strings back from the photos of these markers. The two networks are trained simultaneously in a joint backpropagation process that takes characteristic photometric and geometric distortions associated with marker fabrication and capture into account. Additionally, a stylization loss based on statistics of activations in a pretrained classification network can be inserted into the learning in order to shift the marker appearance towards some texture prototype. In the experiments, we demonstrate that the markers obtained using our approach are capable of retaining bit strings that are long enough to be practical. The ability to automatically adapt markers according to the usage scenario and the desired capacity as well as the ability to combine information encoding with artistic stylization are the unique properties of our approach. As a byproduct, our approach provides an insight on the structure of patterns that are most suitable for recognition by ConvNets and on their ability to distinguish composite patterns.", "full_text": "Learnable Visual Markers\n\nOleg Grinchuk1, Vadim Lebedev1,2, and Victor Lempitsky1\n\n1Skolkovo Institute of Science and Technology, Moscow, Russia\n\n2Yandex, Moscow, Russia\n\nAbstract\n\nWe propose a new approach to designing visual markers (analogous to QR-codes,\nmarkers for augmented reality, and robotic \ufb01ducial tags) based on the advances\nin deep generative networks. In our approach, the markers are obtained as color\nimages synthesized by a deep network from input bit strings, whereas another\ndeep network is trained to recover the bit strings back from the photos of these\nmarkers. The two networks are trained simultaneously in a joint backpropagation\nprocess that takes characteristic photometric and geometric distortions associated\nwith marker fabrication and marker scanning into account. Additionally, a styl-\nization loss based on statistics of activations in a pretrained classi\ufb01cation network\ncan be inserted into the learning in order to shift the marker appearance towards\nsome texture prototype. In the experiments, we demonstrate that the markers ob-\ntained using our approach are capable of retaining bit strings that are long enough\nto be practical. The ability to automatically adapt markers according to the usage\nscenario and the desired capacity as well as the ability to combine information\nencoding with artistic stylization are the unique properties of our approach. As\na byproduct, our approach provides an insight on the structure of patterns that\nare most suitable for recognition by ConvNets and on their ability to distinguish\ncomposite patterns.\n\n1\n\nIntroduction\n\nVisual markers (also known as visual \ufb01ducials or visual codes) are used to facilitate human-\nenvironment and robot-environment interaction, and to aid computer vision in resource-constrained\nand/or accuracy-critical scenarios. Examples of such markers include simple 1D (linear) bar\ncodes [31] and their 2D (matrix) counterparts such as QR-codes [9] or Aztec codes [18], which\nare used to embed chunks of information into objects and scenes. In robotics, AprilTags [23] and\nsimilar methods [3, 4, 26] are a popular way to make locations, objects, and agents easily identi-\n\ufb01able for robots. Within the realm of augmented reality (AR), ARCodes [6] and similar marker\nsystems [13, 21] are used to enable real-time camera pose estimation with high accuracy, low la-\ntency, and on low-end devices. Overall, such markers can embed information into the environment\nin a more compact and language-independent way as compared to traditional human text signatures,\nand they can also be recognized and used by autonomous and human-operated devices in a robust\nway.\nExisting visual markers are designed \u201cmanually\u201d based on the considerations of the ease of pro-\ncessing by computer vision algorithms, the information capacity, and, less frequently, aesthetics.\nOnce marker family is designed, a computer vision-based approach (a marker recognizer) has to be\nengineered and tuned in order to achieve reliable marker localization and interpretation [1, 17, 25].\nThe two processes of the visual marker design on one hand and the marker recognizer design on the\nother hand are thus separated into two subsequent steps, and we argue that such separation makes\nthe corresponding design choices inherently suboptimal. In particular, the third aspect (aesthetics)\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fis usually overlooked, which leads to visually-intrusive markers that in many circumstances might\nnot \ufb01t the style of a certain environment and make this environment \u201ccomputer-friendly\u201d at the cost\nof \u201chuman-friendliness\u201d.\nIn this work, we propose a new general approach to constructing use visual markers that leverages\nrecent advances in deep generative learning. To this end, we suggest to embed the two tasks of the\nvisual marker design and the marker recognizer design into a single end-to-end learning framework.\nWithin our approach, the learning process produces markers and marker recognizers that are adapted\nto each other \u201cby design\u201d. While our idea is more general, we investigate the case where the markers\nare synthesized by a deep neural network (the synthesizer network), and when they are recognized\nby another deep network (the recognizer network). In this case, we demonstrate how these two\nnetworks can be both learned by a joint stochastic optimization process.\nThe bene\ufb01ts of the new approach are thus several-fold:\n\n1. As we demonstrate, the learning process can take into account the adversarial effects that\ncomplicate recognition of the markers, such as perspective distortion, confusion with back-\nground, low-resolution, motion blur, etc. All such effects can be modeled at training time\nas piecewise-differentiable transforms. In this way they can be embedded into the learning\nprocess that will adapt the synthesizer and the recognizer to be robust with respect to such\neffect.\n\n2. It is easy to control the trade-offs between the complexity of the recognizer network, the\ninformation capacity of the codes, and the robustness of the recognition towards different\nadversarial effects. In particular, one can set the recognizer to have a certain architecture,\n\ufb01x the variability and the strength of the adversarial effects that need to be handled, and\nthen the synthesizer will adapt so that the most \u201clegible\u201d codes for such circumstances can\nbe computed.\n\n3. Last but not least, the aesthetics of the neural codes can be brought into the optimization.\nTowards this end we show that we can augment the learning objective with a special styl-\nization loss inspired by [7, 8, 29]. Including such loss facilitates the emergence of stylized\nneural markers that look as instances of a designer-provided stochastic texture. While such\nmodi\ufb01cation of the learning process can reduce the information capacity of the markers, it\ncan greatly increase the \u201chuman-friendliness\u201d of the resulting markers.\n\nBelow, we introduce our approach and then brie\ufb02y discuss the relation of this approach to prior art.\nWe then demonstrate several examples of learned marker families.\n\n2 Learnable visual markers\nWe now detail our approach (Figure 1). Our goal is to build a synthesizer network S(b; \u03b8S) with\nlearnable parameters \u03b8S that can encode a bit sequence b = {b1, b2, . . . bn} containing n bits into\nan image M of the size m-by-m (a marker). For notational simplicity in further derivations, we\nassume that bi \u2208 {\u22121, +1}.\nTo recognize the markers produced by the synthesizer, a recognizer network R(I; \u03b8R) with learnable\nparameters \u03b8R is created. The recognizer takes an image I containing a marker and infers the real-\nvalued sequence r = {r1, r2, . . . , rn}. The recognizer is paired to the synthesizer to ensure that\nsign ri = bi, i.e. that the signs of the numbers inferred by the recognizers correspond to the bits\nencoded by the synthesizer. In particular, we can measure the success of the recognition using a\nsimple loss function based on element-wise sigmoid:\n\nL(b, r) = \u2212 1\nn\n\n\u03c3(biri) = \u2212 1\nn\n\n1\n\n1 + exp(\u2212biri)\n\n(1)\n\nwhere the loss is distributed between \u22121 (perfect recognition) and 0.\nIn real life, the recognizer network does not get to work with the direct outputs of the synthesizer.\nInstead, the markers produced by the synthesizer network are somehow embedded into an environ-\nment (e.g. via printing or using electronic displays), and later their images are captioned by some\ncamera controlled by a human or by a robot. During learning, we model the transformation between\n\nn(cid:88)\n\ni=1\n\nn(cid:88)\n\ni=1\n\n2\n\n\fFigure 1: The outline of our approach and the joint learning process. Our core architecture con-\nsists of the synthesizer network that converts input bit sequences into visual markers, the rendering\nnetwork that simulates photometric and geometric distortions associated with marker printing and\ncapturing, and the recognizer network that is designed to recover the input bit sequence from the\ndistorted markers. The whole architecture is trained end-to-end by backpropagation, after which\nthe synthesizer network can be used to generate markers, and the recognizer network to recover the\ninformation from the markers placed in the environment. Additionally, we can enforce the visual\nsimilarity of markers to a given texture sample using the mismatch in deep Gram matrix statistics in\na pretrained network [7] as the second loss term during learning (right part).\n\n(cid:18)\n\n(cid:17)(cid:19)\nb, R(cid:16)T (cid:0)S(b; \u03b8S); \u03c6(cid:1) ; \u03b8R\n\n.\n\nL\n\na marker produced by the synthesizer and the image of that marker using a special feed-forward net-\nwork (the renderer network) T (M ; \u03c6), where the parameters of the renderer network \u03c6 are sampled\nduring learning and correspond to background variability, lighting variability, perspective slant, blur\nkernel, color shift/white balance of the camera, etc. In some scenarios, the non-learnable parame-\nters \u03c6 can be called nuisance parameters, although in others we might be interested in recovering\nsome of them (e.g. the perspective transform parameters). During learning \u03c6 is sampled from some\ndistribution \u03a6 which should model the variability of the above-mentioned effects in the conditions\nunder which the markers are meant to be used.\nWhen our only objective is robust marker recognition, the learning process can be framed as the\nminimization of the following functional:\nf (\u03b8S, \u03b8R) = E b\u223cU (n)\n\u03c6\u223c\u03a6\n\n(2)\nHere, the bit sequences b are sampled uniformly from U (n) = {\u22121; +1}n, passed through the\nsynthesizer, the renderer, and the recognizer, with the (minus) loss (1) being used to measure the\nsuccess of the recognition. The parameters of the synthesizer and the recognizer are thus optimized\nto maximize the success rate.\nThe minimization of (2) can then be accomplished using a stochastic gradient descent algorithm,\ne.g. ADAM [14]. Each iteration of the algorithm samples a mini-batch of different bit sequences as\nwell as different rendering layer parameter sets and updates the parameters of the synthesizer and\nthe recognizer networks in order to minimize the loss (1) for these samples.\nPractical implementation. As mentioned above, the components of the architecture, namely the\nsynthesizer, the renderer, and the recognizer can be implemented as feed-forward networks. The\nrecognizer network can be implemented as a feedforward convolutional network [16] with n output\nunits. The synthesizer can use multiplicative and up-convolutional [5, 34] layers, as well as element-\nwise non-linearities.\nImplementing the renderer T (M ; \u03c6) (Figure 2) requires non-standard layers. We have implemented\nthe renderer as a chain of layers, each introducing some \u201cnuisance\u201d transformation. We have im-\nplemented a special layer that superimposes an input over a bigger background patch drawn from\na random pool of images. We use the spatial transformer layer [11] to implement the geometric\ndistortion in a differentiable manner. Color shifts and intensity changes can be implemented us-\ning differentiable elementwise transformations (linear, multiplicative, gamma). Blurring associated\nwith lens effect or motion can be simply implemented using a convolutional layer. The nuisance\ntransformation layers can be chained resulting in a renderer layer that can model complex geometric\nand photometric transformations (Figure 2).\n\n3\n\nGram matrix Gram matrix decoding loss synthesizer network synthesizer network recognizer network recognizer network pretrained ConvNet pretrained ConvNet texture loss input bit string decoded input texture sample backpropagation rendering network rendering network \fFigure 2: Visualizations of the rendering network T (M ; \u03c6). For the input marker M on the left the\noutput of the network is obtained through several stages (which are all piecewise-differentiable w.r.t.\ninputs); on the right the outputs T (M ; \u03c6) for several random nuisance parameters \u03c6 are shown. The\nuse of piecewise-differentiable transforms within T allows to backpropagate through T .\n\nControlling the visual appearance. Interestingly, we observed that under variable conditions, the\noptimization of (2) results in markers that have a consistent and interesting visual texture (Figure 3).\nDespite such style consistency, it might be desirable to control the appearance of the resulting mark-\ners more explicitly e.g. using some artistic prototypes. Recently, [7] have achieved remarkable\nresults in texture generation by measuring the statistics of textures using Gram matrices of convolu-\ntional maps inside deep convolutional networks trained to classify natural images. Texture synthesis\ncan then be achieved by minimizing the deviation between such statistics of generated images and\nof style prototypes. Based on their approach, [12, 29] have suggested to include such deviation as\na loss into the training process for deep feedforward generative neural networks. In particular, the\nfeed-forward networks in [29] are trained to convert noise vectors into textures.\nWe follow this line of work and augment our learning objective (2) with the texture loss of [7]. Thus,\nwe consider a feed-forward network C(M ; \u03b3) that computes the result of the t-th convolutional\nlayers of a network trained for large-scale natural image classi\ufb01cation such as the VGGNet [28].\nFor an image M, the output C(M ; \u03b3) thus contains k 2D channels (maps). The network C uses\nthe parameters \u03b3 that are pre-trained on a large-scale dataset and that are not part of our learning\nprocess. The style of an image M is then de\ufb01ned using the following k-by-k Gram matrix G(M ; \u03b3)\nwith each element de\ufb01ned as:\n\n(3)\nwhere Ci and Cj are the i-th and the j-th maps and the inner product is taken over all spatial locations.\nGiven a prototype texture M 0, the learning objective can be augmented with the term:\n\nGij(M ; \u03b3) = (cid:104)Ci(M ; \u03b3),Cj(M ; \u03b3)(cid:105) ,\n\n(cid:13)(cid:13) G(S(b; \u03b8S); \u03b3) \u2212 G(M 0; \u03b3)(cid:13)(cid:13)2\n\nfstyle(\u03b8S) = E b\u223cU (n)\n\n(4)\nThe incorporation of the term (4) forces the markers S(b; \u03b8S) produced by the synthesizer to have\nthe visual appearance similar to instances of the texture de\ufb01ned by the prototype M0 [7].\n\n.\n\n3 Related Work\n\nWe now discuss the classes of deep learning methods that to the best of our understanding are most\nrelated to our approach.\nOur work is partially motivated by the recent approaches that analyze and visualize pretrained deep\nnetworks by synthesizing color images evoking certain responses in these networks. Towards this\nend [27] generate examples that maximize probabilities of certain classes according to the network,\n[33] generate visual illusions that maximize such probabilities while retaining similarity to a prede-\n\ufb01ned image of a potentially different class, [22] also investigate ways of generating highly-abstract\nand structured color images that maximize probabilities of a certain class. Finally, [20] synthesize\ncolor images that evoke a prede\ufb01ned vector of responses at a certain level of the network for the\npurpose of network inversion. Our approach is related to these approaches, since our markers can be\nregarded as stimuli invoking certain responses in the recognizer network. Unlike these approaches,\nour recognizer network is not kept \ufb01xed but is updated together with the synthesizer network that\ngenerates the marker images.\nAnother obvious connection are autoencoders [2], which are models trained to (1) encode inputs into\na compact intermediate representation through the encoder network and (2) recover the original input\n\n4\n\nSpatial TransformColor TransformBlurSuperimposeMarker\f64 bits, default params, C=59.9, p=99.3%\n\n96 bits, low af\ufb01ne, C=90.2, p=99.3%\n\n64 bits, low af\ufb01ne \u03c3 = 0.05, C=61.2, p=99.5%\n\n8 bits, high blur, C=7.91, p=99.9%\n\n32 bits, grayscale, C=27.9, p=98.3%\n\n64 bits, nonlinear encoder, C=58.4, p=98.9%\n\n64 bits, thin network, C=40.1, p=93.2%\n\n64 bits, 16 pixel marker, C=56.8, p=98.5%\n\nFigure 3: Visualization of the markers learned by our approach under different circumstances shown\nin captions (see text for details). The captions also show the bit length, the capacity of the result-\ning encoding (in bits), as well as the accuracy achieved during training. In each case we show six\nmarkers: (1) \u2013 the marker corresponding to a bit sequence consisting of \u22121, (2) \u2013 the marker corre-\nsponding to a bit sequence consisting of +1, (3) and (4) \u2013 markers for two random bit sequences that\ndiffer by a single bit, (5) and (6) \u2013 two markers corresponding to two more random bit sequences.\nUnder many conditions a characteristic grid pattern emerges.\n\nby passing the compact representation through the decoder network. Our system can be regarded\nas a special kind of autoencoder with the certain format of the intermediate representation (a color\nimage). Our decoder is trained to be robust to certain class of transformations of the intermediate\nrepresentations that are modeled by the rendering network. In this respect, our approach is related\nto variational autoencoders [15] that are trained with stochastic intermediate representations and to\ndenoising autoencoders [30] that are trained to be robust to noise.\nFinally, our approach for creating textured markers can be related to steganography [24], which aims\nat hiding a signal in a carrier image. Unlike steganography, we do not aim to conceal information,\nbut just to minimize its \u201cintrusiveness\u201d, while keeping the information machine-readable in the\npresence of distortions associated with printing and scanning.\n\n4 Experiments\n\nBelow, we present qualitative and quantitative evaluation of our approach. For longer bit sequences,\nthe approach might not be able to train a perfect pair of a synthesizer and a recognizer, and therefore,\nsimilarly to other visual marker systems, it makes sense to use error-correcting encoding of the\nsignal. Since the recognizer network returns the odds for each bit in the recovered signal, our\napproach is suitable for any probabilistic error-correction coding [19].\nSynthesizer architectures. For the experiments without texture loss, we use the simplest synthe-\nsizer network, which consists of a single linear layer (with a 3m2 \u00d7 n matrix and a bias vector)\nthat is followed by an element-wise sigmoid. For the experiments with texture loss, we started with\nthe synthesizer used in [29], but found out that it can be greatly simpli\ufb01ed for our task. Our \ufb01nal\narchitecture takes a binary code as input, transforms it with single fully connected layer and series\nof 3 \u00d7 3 convolutions with 2\u00d7 upsamplings in between.\nRecognizer architectures. Unless reported otherwise, the recognizer network was implemented as\na ConvNet with three convolutional layers (96 5 \u00d7 5 \ufb01lters followed by max-pooling and ReLU),\nand two fully-connected layer with 192 and n output units respectively (where n is the length of\nthe code). We \ufb01nd this architecture suf\ufb01cient to successfully deal with marker encoding. In some\nexperiments we have also considered a much smaller networks with 24 maps in convolutional layers,\nand 48 units in the penultimate layer (\u201cthin network\u201d). In general, the convergence on the training\nstage greatly bene\ufb01ts from adding Batch Normalization [10] after every convolutional layer. During\n\n5\n\n\fprototype\n\nall \u22121\n\nall +1\n\nhalf\n\nrandom\n\nrandom +\n1 bit diff.\n\nFigure 4: Examples of textured 64-bit marker families. The texture protototype is shown in the \ufb01rst\ncolumn, while \ufb01ve remaining columns show markers for the following sequences: all \u22121, all +1,\n32 consecutive \u22121 followed by 32 \u22121, and, \ufb01nally, two random bit sequences that differ by a single\nbit.\n\nour experiments with texture loss, we used VGGNet-like architecture with 3 blocks, each consisting\nof two 3 \u00d7 3 convolutions and maxpooling, followed by two dense layers.\nRendering settings. We perform a spatial transform as an af\ufb01ne transformation, where the 6 af\ufb01ne\nparameters are sampled from [1, 0, 0, 0, 1, 0]+N (0, \u03c3) (assuming origin at the center of the marker).\nThe example for \u03c3 = 0.1 is shown in Fig. 2. We leave more complex spatial transforms (e.g. thin\nplate spline [11]) that can make markers more robust to bending for future work. Some resilience to\nbending can still be observed in our qualitative results.\nGiven an image x, we implement the color transformation layer as c1xc2 + c3, where the parameters\nare sampled from the uniform distribution U [\u2212\u03b4, \u03b4]. As we \ufb01nd that printed markers tend to reduce\nthe color contrast, we add a contrast reduction layer that transforms each value to kx + (1 \u2212 k)[0.5]\nfor a random k.\nQuantitative measurements. To quantify the performance of our markers under different circum-\nstances, we report the accuracy p to which our system converges during the learning under different\nsettings (to evaluate accuracy, we threshold recognizer predictions at zero). Whenever we vary the\nsignal length n, we also report the capacity of the code, which is de\ufb01ned as C = n(1\u2212H(p)), where\nH(p) = \u2212p log p \u2212 (1 \u2212 p) log(1 \u2212 p) is the coding entropy. Unless speci\ufb01ed otherwise, we use\nthe rendering network settings visualized in Figure 2, which gives the impression of the variability\nand the dif\ufb01culty of the recovery problem, as the recognizer network is applied to the outputs of this\nrendering network.\nExperiments without texture loss. The bulk of experiments without the texture loss has been\nperformed with m = 32 i.e. 32 \u00d7 32 patches (we used bilinear interpolation when printing or visu-\nalizing). The learned marker families with the base architectures as well as with its variations are\nshown in Figure 3. It is curious to see the emergence of lattice structures (even though our syn-\nthesizer network in this case was a simple single-layer multiplicative network). Apparently, such\n\n6\n\n\fFigure 5: Screenshots of marker recognition process (black box is a part of the user interface and\ncorresponds to perfect alignment). The captions are in (number of correctly recovered bits/total\nsequence length) format. The rightmost two columns correspond to stylized markers. These marker\nfamilies were trained with spatial variances \u03c3 = 0.1, 0.05, 0.1, 0.05, 0.05 respectively. Larger \u03c3\nleads to code recovery robustness with respect to af\ufb01ne transformation.\n\nlattices are most ef\ufb01cient in terms of storing information for later recovery with a ConvNet. It can\nalso be seen how the system can adapt the markers to varying bit lengths or to varying robustness de-\nmands (e.g. to increasing blur or geometric distortions). We have further plotted how the quantitative\nperformance depends on the bit length and and on the marker size in Figure 6.\nExperiments with texture loss. An interesting effect we have encountered while training synthe-\nsizer with texture loss and small output marker size is that it often ended up producing very similar\npatterns. We tried to tweak architecture to handle this problem but eventually found out that it goes\naway for larger markers.\nPerformance of real markers. We also show some qualitative results that include printing (on a\nlaser printer using various backgrounds) and capturing (with a webcam) of the markers. Character-\nistic results in Figure 4 demonstrate that our system can successfully recover encoded signals with\nsmall amount of mistakes. The amount of mistakes can be further reduced by applying the system\nwith jitter and averaging the odds (not implemented here).\nHere, we aid the system by roughly aligning the marker with a pre-de\ufb01ned square (shown as part\nof the user interface). As can be seen the degradation of the results with the increasing alignment\nerror is graceful (due to the use of af\ufb01ne transforms inside the rendering network at train time). In\na more advanced system, such alignment can be bypassed altogether, using a pipeline that detects\nmarker instances in a video stream and localizes their corners. Here, one can either use existing\nquad detection algorithms as in [23] or make the localization process a deep feed-forward network\nand include it into the joint learning in our system. In the latter case, the synthesizer would adapt to\nproduce markers that are distinguishable from backgrounds and have easily identi\ufb01able corners. In\n\n7\n\n64/64 63/64 124/12832/3264/64 59/64 62/64 122/128 31/32 56/64 64/64 64/64 126/128 32/32 64/64 56/64 59/64 115/12831/32 60/64 \f100\n\n%\n\n,\n\ny\nc\na\nr\nu\nc\nc\nA\n\n99\n\n98\n\n97\n\n0\n\n%\n\n,\n\ny\nc\na\nr\nu\nc\nc\nA\n\n100\n\n95\n\n90\n\n0\n\nless af\ufb01ne\n\ndefault\n\nthin network\n\n50\n\n100\n\n150\n\n200\n\nNumber of bits\n\n20\nMarker size, pixels\n\n40\n\n60\n\nFigure 6: Left \u2013 dependence of the recognition accuracy on the size of the bit string for two variants\nwith the default networks, and one with the reduced number of maps in each convolutional layer.\nReducing the capacity of the network hurts the performance a lot, while reducing spatial variation in\nthe rendering network (to \u03c3 = 0.05) increases the capacity very considerably. Right \u2013 dependence\nof the recognition accuracy on the marker size (with otherwise default settings). The capacity of the\ncoding quickly saturates as markers grow bigger.\n\nsuch qualitative experiments (Figure 4), we observe the error rates that are roughly comparable with\nour quantitative experiments.\nRecognizer networks for QR-codes. We have also experimented with replacing the synthesizer\nnetwork with a standard QR-encoder. While we tried different settings (such as error-correction\nlevel, input bit sequence representation), the highest recognition rate we could achieve with our\narchitecture of the recognizer network was only 85%. Apparently, the recognizer network cannot\nreverse the combination of error-correction encoding and rendering transformations well. We also\ntried to replace both the synthesizer and the recognizer with a QR-encoder and a QR-decoder. Here\nwe found that standard QR-decoders cannot decode QR-markers processed by our renderer network\nat the typical level of blur in our experiments (though special-purpose blind deblurring algorithms\nsuch as [32] are likely to succeed).\n\n5 Discussion\n\nIn this work, we have proposed a new approach to marker design, where marker design and their rec-\nognizer are learned jointly. Additionally, an aesthetics-related term can be added into the objective.\nTo the best of our knowledge, we are the \ufb01rst to approach visual marker design using optimization.\nOne curious side aspect of our work is the fact that the learned markers can provide an insight into\nthe architecture of ConvNets (or whatever architecture is used in the recognizer network). In more\ndetails, they represent patterns that are most suitable for recognition with ConvNets. Unlike other\napproaches that e.g. visualize patterns for networks trained to classify natural images, our method\ndecouples geometric and topological factors on one hand from the natural image statistics on the\nother, as we obtain these markers in a \u201ccontent-free\u201d manner1.\nAs discussed above, one further extension to the system might be including marker localizer into\nthe learning as another deep feedforward network. We note that in some scenarios (e.g. generating\naugmented reality tags for real-time camera localization), one can train the recognizer to estimate\nthe parameters of the geometric transformation in addition or even instead of the recovering the\ninput bit string. This would allow to create visual markers particularly suitable for accurate pose\nestimation.\n\n1The only exception are the background images used by the rendering layer. In our experience, their statis-\n\ntics have negligible in\ufb02uence on the emerging patterns.\n\n8\n\n\fReferences\n[1] L. F. Belussi and N. S. Hirata. Fast component-based qr code detection in arbitrarily acquired images.\n\nJournal of mathematical imaging and vision, 45(3):277\u2013292, 2013.\n\n[2] Y. Bengio. Learning deep architectures for AI. Foundations and trends in Machine Learning, 2(1):1\u2013127,\n\n2009.\n\n[3] F. Bergamasco, A. Albarelli, and A. Torsello. Pi-tag: a fast image-space marker design based on projective\n\ninvariants. Machine vision and applications, 24(6):1295\u20131310, 2013.\n\n[4] D. Claus and A. W. Fitzgibbon. Reliable \ufb01ducial detection in natural scenes. Computer Vision-ECCV\n\n2004, pp. 469\u2013480. Springer, 2004.\n\n[5] A. Dosovitskiy, J. T. Springenberg, and T. Brox. Learning to generate chairs with convolutional neural\n\nnetworks. Conf. on Computer Vision and Pattern Recognition (CVPR), 2015.\n\n[6] M. Fiala. ARTag, a \ufb01ducial marker system using digital techniques. Conf. Computer Vision and Pattern\n\nRecognition (CVPR), v. 2, pp. 590\u2013596, 2005.\n\n[7] L. Gatys, A. S. Ecker, and M. Bethge. Texture synthesis using convolutional neural networks. Advances\n\nin Neural Information Processing Systems, NIPS, pp. 262\u2013270, 2015.\n\n[8] L. A. Gatys, A. S. Ecker, and M. Bethge. A neural algorithm of artistic style. Proceedings of the IEEE\n\nConference on Computer Vision and Pattern Recognition,CVPR, 2016.\n\n[9] M. Hara, M. Watabe, T. Nojiri, T. Nagaya, and Y. Uchiyama. Optically readable two-dimensional code\n\nand method and apparatus using the same, 1998. US Patent 5,726,435.\n\n[10] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal\n\ncovariate shift. Proc. International Conference on Machine Learning, ICML, pp. 448\u2013456, 2015.\n\n[11] M. Jaderberg, K. Simonyan, A. Zisserman, et al. Spatial transformer networks. Advances in Neural\n\nInformation Processing Systems, pp. 2008\u20132016, 2015.\n\n[12] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution.\n\nEuropean Conference on Computer Vision (ECCV), pp. 694\u2013711, 2016.\n\n[13] M. Kaltenbrunner and R. Bencina. Reactivision: a computer-vision framework for table-based tangible\ninteraction. Proc. of the 1st international conf. on tangible and embedded interaction, pp. 69\u201374, 2007.\nInternational Conference on\n\n[14] D. P. Kingma and J. B. Adam. A method for stochastic optimization.\n\n[15] D. P. Kingma and M. Welling. Auto-encoding variational bayes. International Conference on Learning\n\nLearning Representation, 2015.\n\nRepresentations, 2014.\n\n[16] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backprop-\n\nagation applied to handwritten zip code recognition. Neural computation, 1(4):541\u2013551, 1989.\n\n[17] C.-C. Lo and C. A. Chang. Neural networks for bar code positioning in automated material handling.\n\nIndustrial Automation and Control: Emerging Technologies, pp. 485\u2013491. IEEE, 1995.\n\n[18] A. Longacre Jr and R. Hussey. Two dimensional data encoding structure and symbology for use with\n\noptical readers, 1997. US Patent 5,591,956.\n\n[19] D. J. MacKay. Information theory, inference and learning algorithms. Cambridge university press, 2003.\n[20] A. Mahendran and A. Vedaldi. Understanding deep image representations by inverting them. Conf.\n\nComputer Vision and Pattern Recognition (CVPR), 2015.\n\n[21] J. Mooser, S. You, and U. Neumann. Tricodes: A barcode-like \ufb01ducial design for augmented reality\n\nmedia. IEEE Multimedia and Expo, pp. 1301\u20131304, 2006.\n\n[22] A. Nguyen, J. Yosinski, and J. Clune. Deep neural networks are easily fooled: High con\ufb01dence predictions\n\nfor unrecognizable images. Conf. on Computer Vision and Pattern Recognition (CVPR), 2015.\n\n[23] E. Olson. Apriltag: A robust and \ufb02exible visual \ufb01ducial system. Robotics and Automation (ICRA), 2011\n\nIEEE International Conference on, pp. 3400\u20133407. IEEE, 2011.\n\n[24] F. A. Petitcolas, R. J. Anderson, and M. G. Kuhn. Information hiding-a survey. Proceedings of the IEEE,\n\n[25] A. Richardson and E. Olson. Learning convolutional \ufb01lters for interest point detection. Conf. on Robotics\n\n[26] D. Scharstein and A. J. Briggs. Real-time recognition of self-similar landmarks.\n\nImage and Vision\n\n87(7):1062\u20131078, 1999.\n\nand Automation (ICRA), pp. 631\u2013637, 2013.\n\nComputing, 19(11):763\u2013772, 2001.\n\n[27] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep inside convolutional networks: Visualising image\n\nclassi\ufb01cation models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.\n\n[28] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition.\n\narXiv preprint arXiv:1409.1556, 2014.\n\n[29] D. Ulyanov, V. Lebedev, A. Vedaldi, and V. Lempitsky. Texture networks: Feed-forward synthesis of\n\ntextures and stylized images. Int. Conf. on Machine Learning (ICML), 2016.\n\n[30] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol. Extracting and composing robust features with\n\ndenoising autoencoders. Int. Conf. on Machine learning (ICML), 2008.\n\n[31] N. J. Woodland and S. Bernard. Classifying apparatus and method, 1952. US Patent 2,612,994.\n[32] S. Yahyanejad and J. Str\u00a8om. Removing motion blur from barcode images. 2010 IEEE Computer Society\n\nConference on Computer Vision and Pattern Recognition-Workshops, pp. 41\u201346. IEEE, 2010.\n\n[33] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. Computer vision\u2013\n\nECCV 2014, pp. 818\u2013833. Springer, 2014.\n\n[34] M. D. Zeiler, G. W. Taylor, and R. Fergus. Adaptive deconvolutional networks for mid and high level\n\nfeature learning. Int. Conf. on Computer Vision (ICCV), pp. 2018\u20132025, 2011.\n\n9\n\n\f", "award": [], "sourceid": 2062, "authors": [{"given_name": "Oleg", "family_name": "Grinchuk", "institution": "Skolkovo Institute of Science and Technology"}, {"given_name": "Vadim", "family_name": "Lebedev", "institution": "Skolkovo Institute of Science and Technology"}, {"given_name": "Victor", "family_name": "Lempitsky", "institution": "Skolkovo Institute of Science and Technology (Skoltech)"}]}